From keshetti.mahesh at gmail.com Wed Oct 1 02:07:30 2008 From: keshetti.mahesh at gmail.com (Keshetti Mahesh) Date: Wed, 1 Oct 2008 14:37:30 +0530 Subject: [ofa-general] ***SPAM*** ibdm network topology format In-Reply-To: <20080930121252.GA7396@sashak.voltaire.com> References: <829ded920809290139vf2cc151w4cc8a6fafb49edfe@mail.gmail.com> <829ded920809292304k3ffc78c0m556efbdd7d35c528@mail.gmail.com> <20080930121252.GA7396@sashak.voltaire.com> Message-ID: <829ded920810010207r475d82abu269d47cd3baddb3f@mail.gmail.com> Hi Sasha, > I'm able to run ibdiagnet with ibsim. I need to export SIM_HOST > environment variable so ibdiagnet will start from some host and not a > switch (by default with ibsim application starts running from first > switch in a fabric). I have tried running 'ibdiagnet' on ibsim after exporting SIM_HOST environment variable to some host name. But still 'ibdiagnet' failed to discover the topology. See the below output. ################################################################ [mahesh at n0 ~]$ LD_PRELOAD=/usr/lib64/umad2sim/libumad2sim.so ibdiagnet -v -o . -wt ibdm.topo -r Loading IBDIAGNET from: /home/mahesh/ibutils-1.2/lib/ibdiagnet1.2 -W- Topology file is not specified. Reports regarding cluster links will use direct routes. Loading IBDM from: /home/mahesh/ibutils-1.2/lib/ibdm1.2 -V- IBIS: ibis log file: ./ibdiagnet_ibis.log -V- IBIS: ibis_get_local_ports_info: {0x0000000000100001 0x0002 ACTIVE 1} -I- Using port 1 as the local port. -V--------------------------------------------------- -V- Starting subnet discovery -V--------------------------------------------------- -V- Discovering DirectPath (no. 1) {} -V- running smNodeInfoMad getByDr {} ...status = 6 (after 1 attempts) -V- Searching for bad link(s) on direct route {} ... -V- Sending MADs over increments of the direct route {} -V--------------------------------------------------- -V- Starting subnet discovery -V--------------------------------------------------- -V- Discovering DirectPath (no. 1) {} -V- running smNodeInfoMad getByDr {} ...status = 6 (after 1 attempts) -V- Searching for bad link(s) on direct route {} ... -V- Sending MADs over increments of the direct route {} -V--------------------------------------------------- -V- Subnet discovery finished. 0 nodes (0 Switches & 0 CA-s) discovered -V--------------------------------------------------- ################################################################ Please feel free to ask if any more information is required. -Mahesh From sunillp at gmail.com Wed Oct 1 02:44:23 2008 From: sunillp at gmail.com (Sunil Patil) Date: Wed, 1 Oct 2008 15:14:23 +0530 Subject: [ofa-general] ***SPAM*** Tool for changing routes Message-ID: <4fb5e0640810010244j416a0c3ek376252198326ffea@mail.gmail.com> Hi, Is there a tool which shows me all the paths that are currently configured by OpenSM, which also shows all the available paths between a source host and a destination host, allows to delete a path configured by OpenSM and set/configure one or more paths between the hosts? Thanks, Sunil -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Wed Oct 1 03:12:28 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 1 Oct 2008 03:12:28 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081001-0200 daily build status Message-ID: <20081001101228.4C7ECE609DB@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.21.1 Log: /home/vlad/tmp/ofa_1_4_kernel-20081001-0200_linux-2.6.21.1_x86_64_check/include/rdma/ib_verbs.h:1833: error: 'struct scatterlist' has no member named 'dma_address' /home/vlad/tmp/ofa_1_4_kernel-20081001-0200_linux-2.6.21.1_x86_64_check/include/rdma/ib_verbs.h: In function 'ib_sg_dma_len': /home/vlad/tmp/ofa_1_4_kernel-20081001-0200_linux-2.6.21.1_x86_64_check/include/rdma/ib_verbs.h:1846: error: 'struct scatterlist' has no member named 'dma_length' make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081001-0200_linux-2.6.21.1_x86_64_check/drivers/infiniband/hw/ipath/ipath_dma.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081001-0200_linux-2.6.21.1_x86_64_check/drivers/infiniband/hw/ipath] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081001-0200_linux-2.6.21.1_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081001-0200_linux-2.6.21.1_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.21.1' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on ppc64 with linux-2.6.24 Log: /home/vlad/tmp/ofa_1_4_kernel-20081001-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c: In function 'ehca_poll_eqs': /home/vlad/tmp/ofa_1_4_kernel-20081001-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:942: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type /home/vlad/tmp/ofa_1_4_kernel-20081001-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:946: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081001-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081001-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081001-0200_linux-2.6.24_ppc64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081001-0200_linux-2.6.24_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.24' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From raq at cttc.upc.edu Wed Oct 1 04:21:20 2008 From: raq at cttc.upc.edu (Ramiro Alba Queipo) Date: Wed, 01 Oct 2008 13:21:20 +0200 Subject: [ofa-general] Catastrophic error on an mthca driver Message-ID: <1222860080.31161.279.camel@mundo> Hi all, I recently had a problem with the server card of an infiniband cluster which in turn made all the fabric down as the opensm daemon had run into problems. Running dmesg you could see: -------------------------------------------------------------------- [408188.411258] ib_mthca 0000:0c:00.0: Catastrophic error detected: internal error [408188.411266] ib_mthca 0000:0c:00.0: buf[00]: 000d0000 [408188.411269] ib_mthca 0000:0c:00.0: buf[01]: 00000000 [408188.411271] ib_mthca 0000:0c:00.0: buf[02]: 00000000 [408188.411274] ib_mthca 0000:0c:00.0: buf[03]: 00000000 [408188.411276] ib_mthca 0000:0c:00.0: buf[04]: 00000000 [408188.411279] ib_mthca 0000:0c:00.0: buf[05]: 00127e9c [408188.411281] ib_mthca 0000:0c:00.0: buf[06]: ffffffff [408188.411283] ib_mthca 0000:0c:00.0: buf[07]: 00000000 [408188.411286] ib_mthca 0000:0c:00.0: buf[08]: 00000000 [408188.411288] ib_mthca 0000:0c:00.0: buf[09]: 00000000 [408188.411290] ib_mthca 0000:0c:00.0: buf[0a]: 00000000 [408188.411292] ib_mthca 0000:0c:00.0: buf[0b]: 00000000 [408188.411295] ib_mthca 0000:0c:00.0: buf[0c]: 00000000 [408188.411297] ib_mthca 0000:0c:00.0: buf[0d]: 00000000 [408188.411299] ib_mthca 0000:0c:00.0: buf[0e]: 00000000 [408188.411302] ib_mthca 0000:0c:00.0: buf[0f]: 00000000 ------------------------------------------------------------ Problems get solved once I restarted networking. I mean: /etc/init.d/networking restart => ifdown -a and then ifup -a I'd say that this was due to running 'smpquery' but I do not know if this has too much sense. Anyway, there are now the following messages running 'dmesg': --------------------------------------------------------- [417317.088898] ib_mad: Method 1 already in use [431433.665919] ib_mad: Method 1 already in use [431533.719671] ib_mad: Method 1 already in use [438159.301272] ib_mad: Method 1 already in use [438236.583426] ib_mad: Method 1 already in use --------------------------------------------------------- I rebooted the server and did a firware update, which did not seem necessary: flint -d /dev/mst/mt25204_pci_cr0 -i jff202/fw-25204-1_2_0-MHGS18-XTC_A5.bin b Current FW version on flash: 1.2.0 New FW version: 1.2.0 Note: The new FW version is not newer than the current FW version on flash. Do you want to continue ? (y/n) [n] : y Read and verify Invariant Sector - OK Read and verify PPS/SPS on flash - OK Burning second FW image without signatures - OK Restoring second signature - OK Then I did a verify: root at jff:~# flint -d /dev/mst/mt25204_pci_cr0 v Failsafe image: Invariant /0x00000028-0x00000953 (0x00092c)/ (BOOT2) - OK Primary Pointer Sector /0x00010000/ - invalid signature (00000000) Secondary Image /0x00020000-0x00020107 (0x000108)/ (Pointer Sector)- OK /0x00090028-0x0009086f (0x000848)/ (BOOT2) - OK /0x00090870-0x000945ff (0x003d90)/ (BOOT2) - OK /0x00094600-0x0009515f (0x000b60)/ (Configuration) - OK /0x00095160-0x00095193 (0x000034)/ (GUID) - OK /0x00095194-0x000951db (0x000048)/ (Image Info) - OK /0x000951dc-0x0009525b (0x000080)/ (DDR) - OK /0x0009525c-0x000a8e2f (0x013bd4)/ (DDR) - OK /0x000a8e30-0x000a8eaf (0x000080)/ (DDR) - OK /0x000a8eb0-0x000aaebb (0x00200c)/ (DDR) - OK /0x000aaebc-0x000aaf3b (0x000080)/ (DDR) - OK /0x000aaf3c-0x000e147b (0x036540)/ (DDR) - OK /0x000e147c-0x000e148f (0x000014)/ (Configuration) - OK /0x000e1490-0x000e14d3 (0x000044)/ (Jump addresses) - OK /0x000e14d4-0x000e16db (0x000208)/ (FW Configuration) - OK FW image verification succeeded. Image is bootable. Now I realized that both the card port and the switch port to where this card is linked, have 'XmtDiscards' (though they do not seem to grow up): # Port counters: Lid 1 port 1 PortSelect:......................1 CounterSelect:...................0x0000 SymbolErrors:....................0 LinkRecovers:....................0 LinkDowned:......................0 RcvErrors:.......................0 RcvRemotePhysErrors:.............0 RcvSwRelayErrors:................0 XmtDiscards:.....................2 XmtConstraintErrors:.............0 RcvConstraintErrors:.............0 LinkIntegrityErrors:.............0 ExcBufOverrunErrors:.............0 VL15Dropped:.....................2 XmtData:.........................1043921 RcvData:.........................123107938 XmtPkts:.........................36932 RcvPkts:.........................249752 # Port counters: Lid 4 port 23 PortSelect:......................23 CounterSelect:...................0x0100 SymbolErrors:....................0 LinkRecovers:....................0 LinkDowned:......................0 RcvErrors:.......................0 RcvRemotePhysErrors:.............0 RcvSwRelayErrors:................142 XmtDiscards:.....................199 XmtConstraintErrors:.............0 RcvConstraintErrors:.............0 LinkIntegrityErrors:.............0 ExcBufOverrunErrors:.............0 VL15Dropped:.....................0 XmtData:.........................649631134 RcvData:.........................6429228 XmtPkts:.........................1345694 RcvPkts:.........................231549 Is this a hardware problem? Is there a way to check for a hardware problem? Regards -- Aquest missatge ha estat analitzat per MailScanner a la cerca de virus i d'altres continguts perillosos, i es considera que est� net. For all your IT requirements visit: http://www.transtec.co.uk From hal.rosenstock at gmail.com Wed Oct 1 06:18:47 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 1 Oct 2008 09:18:47 -0400 Subject: [ofa-general] ***SPAM*** Tool for changing routes In-Reply-To: <4fb5e0640810010244j416a0c3ek376252198326ffea@mail.gmail.com> References: <4fb5e0640810010244j416a0c3ek376252198326ffea@mail.gmail.com> Message-ID: On Wed, Oct 1, 2008 at 5:44 AM, Sunil Patil wrote: > Hi, > > Is there a tool which shows me all the paths that are currently configured > by OpenSM, saquery -p obtains all the SA path records from OpenSM. > which also shows all the available paths between a source host > and a destination host, saquery can be used (with various options) to show the relevant path record(s) from src to dest. ibtracert shows the path through the subnet from a src to a dest which can be used with any SM. > allows to delete a path configured by OpenSM and > set/configure one or more paths between the hosts? Paths are setup by the SM based on the routing algorithm selected and the network topology. For multiple paths, LMC needs to be configured. Redundant links can be added or removed from the topology or enabled/disabled via ibportstate. -- Hal > Thanks, > Sunil > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From hnguyen at linux.vnet.ibm.com Wed Oct 1 04:06:09 2008 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Wed, 1 Oct 2008 13:06:09 +0200 Subject: [ofa-general] [PATCH 0/1] IB/ehca: handle creation of UC QP with SRQ Message-ID: <200810011306.09610.hnguyen@linux.vnet.ibm.com> Hi Roland! Sending this patch on behalf of Michael F. It should apply cleanly against your git tree for-2.6.28. Thanks Nam Acked-by: Hoang-Nam Nguyen From hnguyen at linux.vnet.ibm.com Wed Oct 1 04:06:31 2008 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Wed, 1 Oct 2008 13:06:31 +0200 Subject: [ofa-general] [PATCH 1/1] IB/ehca: Disallow creating UC QP with SRQ Message-ID: <200810011306.31544.hnguyen@linux.vnet.ibm.com> IB/ehca: Disallow creating QP for UC with SRQ This patch prevents a UC QP to be created with SRQ, since current firmware does not support this feature. Signed-off-by: Michael Faath --- drivers/infiniband/hw/ehca/ehca_qp.c | 6 ++++++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index b6bcee0..46897cd 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -460,6 +460,12 @@ static struct ehca_qp *internal_create_qp( struct ehca_qp *my_srq = container_of(init_attr->srq, struct ehca_qp, ib_srq); + if (qp_type == IB_QPT_UC) { + ehca_err(pd->device, "UC with SRQ not supported"); + atomic_dec(&shca->num_qps); + return ERR_PTR(-EINVAL); + } + has_srq = 1; parms.ext_type = EQPT_SRQBASE; parms.srq_qpn = my_srq->real_qp_num; -- 1.5.5 From charr at fusionio.com Wed Oct 1 07:19:34 2008 From: charr at fusionio.com (Cameron Harr) Date: Wed, 01 Oct 2008 08:19:34 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance Message-ID: <48E386F6.5040502@fusionio.com> Greetings, While trying to maximize IOPs over SRP, I ran into a pretty significant bottleneck. After digging into the matter, I believe the problem is coming from the mlx4 drivers overwhelming one of the cpus, and thus cutting performance to that level (since no more IB traffic means no more data for the disks). My target server (with DAS) contains 8 2.8 GHz CPU cores and can sustain over 200K IOPs locally, but only around 73K IOPs over SRP. Looking at /proc/interrupts, I see that the mlx_core (comp) device is pushing about 135K Int/s on 1 of 2 CPUs. All CPUs are enabled for that PCI-E slot, but it only ever uses 2 of the CPUs, and only 1 at a time. None of the other CPUs has an interrupt rate more than about 40-50K/s. Does anyone know of a trick to spread those interrupts out more (which I realize might be bad due to context switching), or something else that will reduce my interrupts on that cpu? The mlx4 is a MSI-X interrupt. I've changed it to an APIC int, but it seems to give slightly lower performance. Thanks, Cameron CONFIDENTIAL This document and attachments contain information from Fusion-io, Inc. which is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this transmission. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking of any action in reliance on the contents of this emailed information is strictly prohibited, and that the documents should be returned to Fusion-io, Inc. immediately. In this regard, if you have received this email in error, please notify us by return email immediately. From cameron at harr.org Wed Oct 1 07:39:43 2008 From: cameron at harr.org (Cameron Harr) Date: Wed, 01 Oct 2008 08:39:43 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48E386F6.5040502@fusionio.com> References: <48E386F6.5040502@fusionio.com> Message-ID: <48E38BAF.5000801@harr.org> Alternatively, is there anything in the SCST layer I should tweak. I'm still running rev 245 of that code (kinda old, but works with OFED 1.3.1 w/o hacks). And sorry about the yucky signature on my past email - I used to be able to send w/o that. Cameron Harr wrote: > Greetings, > While trying to maximize IOPs over SRP, I ran into a pretty > significant bottleneck. After digging into the matter, I believe the > problem is coming from the mlx4 drivers overwhelming one of the cpus, > and thus cutting performance to that level (since no more IB traffic > means no more data for the disks). > > My target server (with DAS) contains 8 2.8 GHz CPU cores and can > sustain over 200K IOPs locally, but only around 73K IOPs over SRP. > Looking at /proc/interrupts, I see that the mlx_core (comp) device is > pushing about 135K Int/s on 1 of 2 CPUs. All CPUs are enabled for that > PCI-E slot, but it only ever uses 2 of the CPUs, and only 1 at a time. > None of the other CPUs has an interrupt rate more than about 40-50K/s. > > Does anyone know of a trick to spread those interrupts out more (which > I realize might be bad due to context switching), or something else > that will reduce my interrupts on that cpu? The mlx4 is a MSI-X > interrupt. I've changed it to an APIC int, but it seems to give > slightly lower performance. > > Thanks, > Cameron > From raq at cttc.upc.edu Wed Oct 1 09:08:46 2008 From: raq at cttc.upc.edu (Ramiro Alba Queipo) Date: Wed, 01 Oct 2008 18:08:46 +0200 Subject: [ofa-general] Infiniband bandwidth Message-ID: <1222877327.31161.315.camel@mundo> Hi all, We have an infiniband cluster of 22 nodes witch 20 Gbps Mellanox MHGS18-XTC cards and I tried to make performance net tests both to check hardware as to clarify concepts. Starting from the theoretic pick according to the infiniband card (in my case 4X DDR => 20 Gbits/s => 2.5 Gbytes/s) we have some limits: 1) Bus type: PCIe 8x => 250 Mbytes/lane => 250 * 8 = 2 Gbytes/s 2) According to a thread an users openmpi mail-list (???): The 16 Gbit/s number is the theoretical peak, IB is coded 8/10 so out of the 20 Gbit/s 16 is what you get. On SDR this number is (of course) 8 Gbit/s achievable (which is ~1000 MB/s) and I've seen well above 900 on MPI (this on 8x PCIe, 2x margin) Is this true? 3) According to other comment in the same thread: The data throughput limit for 8x PCIe is ~12 Gb/s. The theoretical limit is 16 Gb/s, but each PCIe packet has a whopping 20 byte overhead. If the adapter uses 64 byte packets, then you see 1/3 of the throughput go to overhead. Could someone explain me that? Then I got another comment about the matter: The best uni-directional performance I have heard of for PCIe 8x IB DDR is ~1,400 MB/s (11.2 Gb/s) with Lustre, which is about 55% of the theoretical 20 Gb/s advertised speed. --------------------------------------------------------------------- Now, I did some tests (mpi used is OpenMPI) with the following results: a) Using "Performance tests" from OFED 1.31 ib_write_bw -a server -> 1347 MB/s b) Using hpcc (2 cores at diferent nodes) -> 1157 MB/s (--mca mpi_leave_pinned 1) c) Using "OSU Micro-Benchmarks" in "MPItests" from OFED 1.3.1 1) 2 cores from different nodes - mpirun -np 2 --hostfile pool osu_bibw -> 2001.29 MB/s (bidirectional) - mpirun -np 2 --hostfile pool osu_bw -> 1311.31 MB/s 2) 2 cores from the same node - mpirun -np 2 osu_bibw -> 2232 MB/s (bidirectional) - mpirun -np 2 osu_bw -> 2058 MB/s The questions are: - Are those results coherent with what it should be? - Why tests with the two core in the same node are better? - Should not the bidirectional test be a bit higher? - Why hpcc is so low? Thanks in advance Regards -- Aquest missatge ha estat analitzat per MailScanner a la cerca de virus i d'altres continguts perillosos, i es considera que est� net. For all your IT requirements visit: http://www.transtec.co.uk From worleys at gmail.com Wed Oct 1 09:17:08 2008 From: worleys at gmail.com (Chris Worley) Date: Wed, 1 Oct 2008 10:17:08 -0600 Subject: ***SPAM*** Re: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48E38BAF.5000801@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> Message-ID: On Wed, Oct 1, 2008 at 8:39 AM, Cameron Harr wrote: > Cameron Harr wrote: >> My target server (with DAS) contains 8 2.8 GHz CPU cores and can sustain >> over 200K IOPs locally, but only around 73K IOPs over SRP. Looking at >> /proc/interrupts, I see that the mlx_core (comp) device is pushing about >> 135K Int/s on 1 of 2 CPUs. All CPUs are enabled for that PCI-E slot, but it >> only ever uses 2 of the CPUs, and only 1 at a time. None of the other CPUs >> has an interrupt rate more than about 40-50K/s. I don't know iSCSI, but in MPI-land we just poll... assuming you don't care if you loose a CPU to polling. Is there a way to get SRP off it's dependence on Interrupts, and just poll? Chris From ctung at NetEffect.com Wed Oct 1 09:39:09 2008 From: ctung at NetEffect.com (Chien Tung) Date: Wed, 1 Oct 2008 11:39:09 -0500 Subject: [ofa-general] RE: [PATCH] RDMA/nes: nes_cm.c cleanup In-Reply-To: References: <200809151958.m8FJw2sk012367@velma.neteffect.com> Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC08510A6F@venom2> > I applied this part: > > > -struct nes_cm_node *mini_cm_connect(struct nes_cm_core > *cm_core, > +static struct nes_cm_node > *mini_cm_connect(struct nes_cm_core *cm_core, > > since that clearly makes sense, but I dropped: > > > - struct nes_qp *nesqp; > > + struct nes_qp *nesqp = NULL; > > and > > > - u16 mpa_frame_size = sizeof(struct ietf_mpa_frame) + > private_data_len; > > + u16 mpa_frame_size = 0; > > > + mpa_frame_size = sizeof(struct ietf_mpa_frame) + > > + private_data_len; Thanks for the corrections. Acked. Chien From chu11 at llnl.gov Wed Oct 1 10:27:17 2008 From: chu11 at llnl.gov (Al Chu) Date: Wed, 01 Oct 2008 10:27:17 -0700 Subject: [ofa-general] Re: [PATCH v2] opensm: routing chaining In-Reply-To: <20081001011920.GJ7396@sashak.voltaire.com> References: <1221506448.6274.32.camel@cardanus.llnl.gov> <20080928202648.GG25831@sashak.voltaire.com> <20080928204244.GH25831@sashak.voltaire.com> <20081001011920.GJ7396@sashak.voltaire.com> Message-ID: <1222882037.1197.32.camel@cardanus.llnl.gov> Hey Sasha, Patch looks good as a whole. I played with it in ibsim to sanity check it one more time. One comment though. I think the setting of p_osm->routing_engine_used is now different than it was before. With file routing, if both the build_lid_matrices and build_lfts calls default to normal minhop, the routing_engine_used file used to be set to MINHOP, but now it would be set to FILE. I didn't realize it before, but I had the same behavior change in my patch series too :-) This isn't necessarily bad. But it is different behavior. Why don't we throw in something like the attached patch. Al On Wed, 2008-10-01 at 04:19 +0300, Sasha Khapyorsky wrote: > From: Albert Chu > > Routing chaining is the ability to configure the order in which routing > algorithms are applied in opensm, i.e. > > -R ftree,updn,minhop > > Try using ftree routing. If ftree fails, try updn. If updn fails, try > minhop. > > In order to get this done, some rearchitecture of the routing code had > to be done b/c there is no longer an assumption that only one routing > engine can be specified. > > Always setup a routing engine, assume no default "fallthrough" minhop > routing engine. On configured routing engine failure, do minhop as > a last resort. Stick a *next pointer into struct osm_routing_engine. > Rearchitect routing engine usage as a list instead of a single struct. > > Signed-off-by: Sasha Khapyorsky > --- > > The difference with previous version is proper 'is_dor' flag handling > in dor routing engine. > > opensm/include/opensm/osm_opensm.h | 10 ++- > opensm/include/opensm/osm_subnet.h | 7 +- > opensm/include/opensm/osm_ucast_mgr.h | 2 +- > opensm/man/opensm.8.in | 8 +- > opensm/opensm/main.c | 10 ++- > opensm/opensm/osm_opensm.c | 121 +++++++++++++++++++++--------- > opensm/opensm/osm_subnet.c | 11 ++- > opensm/opensm/osm_ucast_file.c | 19 ++--- > opensm/opensm/osm_ucast_ftree.c | 35 +++------ > opensm/opensm/osm_ucast_lash.c | 16 ++-- > opensm/opensm/osm_ucast_mgr.c | 132 ++++++++++++++++++++++----------- > opensm/opensm/osm_ucast_updn.c | 10 +- > 12 files changed, 239 insertions(+), 142 deletions(-) > > diff --git a/opensm/include/opensm/osm_opensm.h b/opensm/include/opensm/osm_opensm.h > index 5d45724..c121be4 100644 > --- a/opensm/include/opensm/osm_opensm.h > +++ b/opensm/include/opensm/osm_opensm.h > @@ -126,6 +126,7 @@ struct osm_routing_engine { > int (*ucast_build_fwd_tables) (void *context); > void (*ucast_dump_tables) (void *context); > void (*delete) (void *context); > + struct osm_routing_engine *next; > }; > /* > * FIELDS > @@ -148,6 +149,9 @@ struct osm_routing_engine { > * delete > * The delete method, may be used for routing engine > * internals cleanup. > +* > +* next > +* Pointer to next routing engine in the list. > */ > > /****s* OpenSM: OpenSM/osm_opensm_t > @@ -178,7 +182,7 @@ typedef struct osm_opensm { > osm_log_t log; > cl_dispatcher_t disp; > cl_plock_t lock; > - struct osm_routing_engine routing_engine; > + struct osm_routing_engine *routing_engine_list; > osm_routing_engine_type_t routing_engine_used; > osm_stats_t stats; > osm_console_t console; > @@ -221,8 +225,8 @@ typedef struct osm_opensm { > * lock > * Shared lock guarding most OpenSM structures. > * > -* routing_engine > -* Routing engine; will be initialized then used. > +* routing_engine_list > +* List of routing engines that should be tried for use. > * > * routing_engine_used > * Indicates which routing engine was used to route a subnet. > diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h > index f90f7ea..0c7f3b9 100644 > --- a/opensm/include/opensm/osm_subnet.h > +++ b/opensm/include/opensm/osm_subnet.h > @@ -182,7 +182,7 @@ typedef struct osm_subn_opt { > char *port_prof_ignore_file; > boolean_t port_profile_switch_nodes; > boolean_t sweep_on_trap; > - char *routing_engine_name; > + char *routing_engine_names; > boolean_t connect_roots; > char *lid_matrix_dump_file; > char *lfts_file; > @@ -353,9 +353,8 @@ typedef struct osm_subn_opt { > * sweep_on_trap > * Received traps will initiate a new sweep. > * > -* routing_engine_name > -* Name of used routing engine > -* (other than default Min Hop Algorithm) > +* routing_engine_names > +* Name of routing engine(s) to use. > * > * connect_roots > * The option which will enforce root to root connectivity with > diff --git a/opensm/include/opensm/osm_ucast_mgr.h b/opensm/include/opensm/osm_ucast_mgr.h > index 1dc9a37..59ba9fa 100644 > --- a/opensm/include/opensm/osm_ucast_mgr.h > +++ b/opensm/include/opensm/osm_ucast_mgr.h > @@ -264,7 +264,7 @@ osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr, > * > * SYNOPSIS > */ > -void osm_ucast_mgr_build_lid_matrices(IN osm_ucast_mgr_t * const p_mgr); > +int osm_ucast_mgr_build_lid_matrices(IN osm_ucast_mgr_t * const p_mgr); > /* > * PARAMETERS > * p_mgr > diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in > index 13d9a32..c1ea584 100644 > --- a/opensm/man/opensm.8.in > +++ b/opensm/man/opensm.8.in > @@ -9,7 +9,7 @@ opensm \- InfiniBand subnet manager and administration (SM/SA) > [\-F | \-\-config ] [\-c(reate-config) ] > [\-g(uid) ] [\-l(mc) ] > [\-p(riority) ] [\-smkey ] [\-r(eassign_lids)] > -[\-R | \-\-routing_engine ] > +[\-R | \-\-routing_engine ] > [\-z | \-\-connect_roots] > [\-M | \-\-lid_matrix_file ] > [\-U | \-\-lfts_file ] > @@ -116,8 +116,10 @@ Without -r, OpenSM attempts to preserve existing > LID assignments resolving multiple use of same LID. > .TP > \fB\-R\fR, \fB\-\-routing_engine\fR > -This option chooses routing engine instead of Min Hop > -algorithm (default). > +This option chooses routing engine(s) to use instead of Min Hop > +algorithm (default). Multiple routing engines can be specified > +separated by commas so that specific ordering of routing algorithms > +will be tried if earlier routing engines fail. > Supported engines: minhop, updn, file, ftree, lash, dor > .TP > \fB\-z\fR, \fB\-\-connect_roots\fR > diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c > index 01bfddf..2f53157 100644 > --- a/opensm/opensm/main.c > +++ b/opensm/opensm/main.c > @@ -177,8 +177,10 @@ static void show_usage(void) > " LID assignments resolving multiple use of same LID.\n\n"); > printf("-R\n" > "--routing_engine \n" > - " This option chooses routing engine instead of Min Hop\n" > - " algorithm (default).\n" > + " This option chooses routing engine(s) to use instead of default\n" > + " Min Hop algorithm. Multiple routing engines can be specified\n" > + " separated by commas so that specific ordering of routing\n" > + " algorithms will be tried if earlier routing engines fail.\n" > " Supported engines: updn, file, ftree, lash, dor\n\n"); > printf("-z\n" > "--connect_roots\n" > @@ -851,8 +853,8 @@ int main(int argc, char *argv[]) > break; > > case 'R': > - opt.routing_engine_name = optarg; > - printf(" Activate \'%s\' routing engine\n", optarg); > + opt.routing_engine_names = optarg; > + printf(" Activate \'%s\' routing engine(s)\n", optarg); > break; > > case 'z': > diff --git a/opensm/opensm/osm_opensm.c b/opensm/opensm/osm_opensm.c > index d17fed3..4970d0c 100644 > --- a/opensm/opensm/osm_opensm.c > +++ b/opensm/opensm/osm_opensm.c > @@ -61,24 +61,23 @@ > > struct routing_engine_module { > const char *name; > - int (*setup) (osm_opensm_t * p_osm); > + int (*setup) (struct osm_routing_engine *, osm_opensm_t *); > }; > > -extern int osm_ucast_updn_setup(osm_opensm_t * p_osm); > -extern int osm_ucast_file_setup(osm_opensm_t * p_osm); > -extern int osm_ucast_ftree_setup(osm_opensm_t * p_osm); > -extern int osm_ucast_lash_setup(osm_opensm_t * p_osm); > - > -static int osm_ucast_null_setup(osm_opensm_t * p_osm); > +extern int osm_ucast_minhop_setup(struct osm_routing_engine *, osm_opensm_t *); > +extern int osm_ucast_updn_setup(struct osm_routing_engine *, osm_opensm_t *); > +extern int osm_ucast_file_setup(struct osm_routing_engine *, osm_opensm_t *); > +extern int osm_ucast_ftree_setup(struct osm_routing_engine *, osm_opensm_t *); > +extern int osm_ucast_lash_setup(struct osm_routing_engine *, osm_opensm_t *); > +extern int osm_ucast_dor_setup(struct osm_routing_engine *, osm_opensm_t *); > > const static struct routing_engine_module routing_modules[] = { > - {"null", osm_ucast_null_setup}, > - {"minhop", osm_ucast_null_setup}, > + {"minhop", osm_ucast_minhop_setup}, > {"updn", osm_ucast_updn_setup}, > {"file", osm_ucast_file_setup}, > {"ftree", osm_ucast_ftree_setup}, > {"lash", osm_ucast_lash_setup}, > - {"dor", osm_ucast_null_setup}, > + {"dor", osm_ucast_dor_setup}, > {NULL, NULL} > }; > > @@ -135,33 +134,77 @@ osm_routing_engine_type_t osm_routing_engine_type(IN const char *str) > > /********************************************************************** > **********************************************************************/ > -static int setup_routing_engine(osm_opensm_t * p_osm, const char *name) > +static void append_routing_engine(osm_opensm_t *osm, > + struct osm_routing_engine *routing_engine) > { > - const struct routing_engine_module *r; > + struct osm_routing_engine *r; > + > + routing_engine->next = NULL; > + > + if (!osm->routing_engine_list) { > + osm->routing_engine_list = routing_engine; > + return; > + } > + > + r = osm->routing_engine_list; > + while (r->next) > + r = r->next; > > - for (r = routing_modules; r->name && *r->name; r++) { > - if (!strcmp(r->name, name)) { > - p_osm->routing_engine.name = r->name; > - if (r->setup(p_osm)) { > - OSM_LOG(&p_osm->log, OSM_LOG_VERBOSE, > + r->next = routing_engine; > +} > + > +static void setup_routing_engine(osm_opensm_t *osm, const char *name) > +{ > + struct osm_routing_engine *re; > + const struct routing_engine_module *m; > + > + for (m = routing_modules; m->name && *m->name; m++) { > + if (!strcmp(m->name, name)) { > + re = malloc(sizeof(struct osm_routing_engine)); > + if (!re) { > + OSM_LOG(&osm->log, OSM_LOG_VERBOSE, > + "memory allocation failed\n"); > + return; > + } > + memset(re, 0, sizeof(struct osm_routing_engine)); > + > + re->name = m->name; > + if (m->setup(re, osm)) { > + OSM_LOG(&osm->log, OSM_LOG_VERBOSE, > "setup of routing" > " engine \'%s\' failed\n", name); > - return -2; > + return; > } > - OSM_LOG(&p_osm->log, OSM_LOG_DEBUG, > - "\'%s\' routing engine set up\n", > - p_osm->routing_engine.name); > - return 0; > + OSM_LOG(&osm->log, OSM_LOG_DEBUG, > + "\'%s\' routing engine set up\n", re->name); > + append_routing_engine(osm, re); > + return; > } > } > - return -1; > + > + OSM_LOG(&osm->log, OSM_LOG_ERROR, > + "cannot find or setup routing engine \'%s\'", name); > } > > -static int osm_ucast_null_setup(osm_opensm_t * p_osm) > +static void setup_routing_engines(osm_opensm_t *osm, const char *engine_names) > { > - OSM_LOG(&p_osm->log, OSM_LOG_VERBOSE, > - "nothing yet - using default (minhop) routing engine\n"); > - return 0; > + char *name, *str, *p; > + > + if (!engine_names || !*engine_names) { > + setup_routing_engine(osm, "minhop"); > + return; > + } > + > + str = strdup(engine_names); > + name = strtok_r(str, ", \t\n", &p); > + while (name && *name) { > + setup_routing_engine(osm, name); > + name = strtok_r(NULL, ", \t\n", &p); > + } > + free(str); > + > + if (!osm->routing_engine_list) > + setup_routing_engine(osm, "minhop"); > } > > /********************************************************************** > @@ -181,6 +224,20 @@ void osm_opensm_construct(IN osm_opensm_t * const p_osm) > > /********************************************************************** > **********************************************************************/ > +static void destroy_routing_engines(osm_opensm_t *osm) > +{ > + struct osm_routing_engine *r, *next; > + > + next = osm->routing_engine_list; > + while (next) { > + r = next; > + next = r->next; > + if (r->delete) > + r->delete(r->context); > + free(r); > + } > +} > + > void osm_opensm_destroy(IN osm_opensm_t * const p_osm) > { > /* in case of shutdown through exit proc - no ^C */ > @@ -218,8 +275,7 @@ void osm_opensm_destroy(IN osm_opensm_t * const p_osm) > osm_sa_db_file_dump(p_osm); > > /* do the destruction in reverse order as init */ > - if (p_osm->routing_engine.delete) > - p_osm->routing_engine.delete(p_osm->routing_engine.context); > + destroy_routing_engines(p_osm); > osm_sa_destroy(&p_osm->sa); > osm_sm_destroy(&p_osm->sm); > #ifdef ENABLE_OSM_PERF_MGR > @@ -371,12 +427,7 @@ osm_opensm_init(IN osm_opensm_t * const p_osm, > goto Exit; > #endif /* ENABLE_OSM_PERF_MGR */ > > - if (p_opt->routing_engine_name && > - setup_routing_engine(p_osm, p_opt->routing_engine_name)) > - OSM_LOG(&p_osm->log, OSM_LOG_VERBOSE, > - "cannot find or setup routing engine" > - " \'%s\'. Default will be used instead\n", > - p_opt->routing_engine_name); > + setup_routing_engines(p_osm, p_opt->routing_engine_names); > > p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_NONE; > > diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c > index 278aa3d..a39ce75 100644 > --- a/opensm/opensm/osm_subnet.c > +++ b/opensm/opensm/osm_subnet.c > @@ -442,7 +442,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) > p_opt->port_prof_ignore_file = NULL; > p_opt->port_profile_switch_nodes = FALSE; > p_opt->sweep_on_trap = TRUE; > - p_opt->routing_engine_name = NULL; > + p_opt->routing_engine_names = NULL; > p_opt->connect_roots = FALSE; > p_opt->lid_matrix_dump_file = NULL; > p_opt->lfts_file = NULL; > @@ -1264,7 +1264,7 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts) > p_key, p_val, &p_opts->sweep_on_trap); > > opts_unpack_charp("routing_engine", > - p_key, p_val, &p_opts->routing_engine_name); > + p_key, p_val, &p_opts->routing_engine_names); > > opts_unpack_boolean("connect_roots", > p_key, p_val, &p_opts->connect_roots); > @@ -1521,9 +1521,12 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t *const p_opts) > > fprintf(opts_file, > "# Routing engine\n" > + "# Multiple routing engines can be specified separated by\n" > + "# commas so that specific ordering of routing algorithms will\n" > + "# be tried if earlier routing engines fail.\n" > "# Supported engines: minhop, updn, file, ftree, lash, dor\n" > - "routing_engine %s\n\n", p_opts->routing_engine_name ? > - p_opts->routing_engine_name : null_str); > + "routing_engine %s\n\n", p_opts->routing_engine_names ? > + p_opts->routing_engine_names : null_str); > > fprintf(opts_file, > "# Connect roots (use FALSE if unsure)\n" > diff --git a/opensm/opensm/osm_ucast_file.c b/opensm/opensm/osm_ucast_file.c > index 3d00cb2..cbd65c1 100644 > --- a/opensm/opensm/osm_ucast_file.c > +++ b/opensm/opensm/osm_ucast_file.c > @@ -135,14 +135,13 @@ static int do_ucast_file_load(void *context) > OSM_LOG(&p_osm->log, OSM_LOG_VERBOSE, > "LFTs file name is not given; " > "using default routing algorithm\n"); > - return -1; > + return 1; > } > > file = fopen(file_name, "r"); > if (!file) { > OSM_LOG(&p_osm->log, OSM_LOG_ERROR | OSM_LOG_SYS, "ERR 6302: " > - "cannot open ucast dump file \'%s\'; " > - "using default routing algorithm\n", file_name); > + "cannot open ucast dump file \'%s\': %m\n", file_name); > return -1; > } > > @@ -270,15 +269,13 @@ static int do_lid_matrix_file_load(void *context) > OSM_LOG(&p_osm->log, OSM_LOG_VERBOSE, > "lid matrix file name is not given; " > "using default lid matrix generation algorithm\n"); > - return -1; > + return 1; > } > > file = fopen(file_name, "r"); > if (!file) { > OSM_LOG(&p_osm->log, OSM_LOG_ERROR | OSM_LOG_SYS, "ERR 6305: " > - "cannot open lid matrix file \'%s\'; " > - "using default lid matrix generation algorithm\n", > - file_name); > + "cannot open lid matrix file \'%s\': %m\n", file_name); > return -1; > } > > @@ -389,10 +386,10 @@ static int do_lid_matrix_file_load(void *context) > return 0; > } > > -int osm_ucast_file_setup(osm_opensm_t * p_osm) > +int osm_ucast_file_setup(struct osm_routing_engine *r, osm_opensm_t *osm) > { > - p_osm->routing_engine.context = (void *)p_osm; > - p_osm->routing_engine.build_lid_matrices = do_lid_matrix_file_load; > - p_osm->routing_engine.ucast_build_fwd_tables = do_ucast_file_load; > + r->context = osm; > + r->build_lid_matrices = do_lid_matrix_file_load; > + r->ucast_build_fwd_tables = do_ucast_file_load; > return 0; > } > diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c > index 1d3233c..15168b7 100644 > --- a/opensm/opensm/osm_ucast_ftree.c > +++ b/opensm/opensm/osm_ucast_ftree.c > @@ -3552,8 +3552,7 @@ static int __osm_ftree_construct_fabric(IN void *context) > OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, "Ranking FatTree\n"); > if (__osm_ftree_fabric_rank(p_ftree) != 0) { > osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, > - "Failed ranking the tree - " > - "fat-tree routing falls back to default routing\n"); > + "Failed ranking the tree\n"); > status = -1; > goto Exit; > } > @@ -3567,14 +3566,12 @@ static int __osm_ftree_construct_fabric(IN void *context) > "Populating CA & switch ports\n"); > if (__osm_ftree_fabric_populate_ports(p_ftree) != 0) { > osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, > - "Fabric topology is not a fat-tree - " > - "routing falls back to default routing\n"); > + "Fabric topology is not a fat-tree\n"); > status = -1; > goto Exit; > } else if (p_ftree->cn_num == 0) { > osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, > - "Fabric has no valid compute nodes - " > - "routing falls back to default routing\n"); > + "Fabric has no valid compute nodes\n"); > status = -1; > goto Exit; > } > @@ -3586,8 +3583,7 @@ static int __osm_ftree_construct_fabric(IN void *context) > if (__osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK || > __osm_ftree_fabric_get_rank(p_ftree) < FAT_TREE_MIN_RANK) { > osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, > - "Fabric rank is %u (should be between %u and %u) - " > - "fat-tree routing falls back to default routing\n", > + "Fabric rank is %u (should be between %u and %u)\n", > __osm_ftree_fabric_get_rank(p_ftree), FAT_TREE_MIN_RANK, > FAT_TREE_MAX_RANK); > status = -1; > @@ -3600,8 +3596,7 @@ static int __osm_ftree_construct_fabric(IN void *context) > validation - it checks that all the CNs are at the same rank. */ > if (__osm_ftree_fabric_mark_leaf_switches(p_ftree)) { > osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, > - "Fabric topology is not a fat-tree - " > - "routing falls back to default routing\n"); > + "Fabric topology is not a fat-tree\n"); > status = -1; > goto Exit; > } > @@ -3619,8 +3614,7 @@ static int __osm_ftree_construct_fabric(IN void *context) > In any case, the first and the last switches in the array are REAL leafs. */ > if (__osm_ftree_fabric_create_leaf_switch_array(p_ftree)) { > osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, > - "Fabric topology is not a fat-tree - " > - "routing falls back to default routing\n"); > + "Fabric topology is not a fat-tree\n"); > status = -1; > goto Exit; > } > @@ -3640,8 +3634,7 @@ static int __osm_ftree_construct_fabric(IN void *context) > if (!__osm_ftree_fabric_roots_provided(p_ftree) && > !__osm_ftree_fabric_validate_topology(p_ftree)) { > osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS, > - "Fabric topology is not a fat-tree - " > - "routing falls back to default routing\n"); > + "Fabric topology is not a fat-tree\n"); > status = -1; > goto Exit; > } > @@ -3726,7 +3719,7 @@ static void __osm_ftree_delete(IN void *context) > /*************************************************** > ***************************************************/ > > -int osm_ucast_ftree_setup(osm_opensm_t * p_osm) > +int osm_ucast_ftree_setup(struct osm_routing_engine *r, osm_opensm_t * p_osm) > { > ftree_fabric_t *p_ftree = __osm_ftree_fabric_create(); > if (!p_ftree) > @@ -3734,12 +3727,10 @@ int osm_ucast_ftree_setup(osm_opensm_t * p_osm) > > p_ftree->p_osm = p_osm; > > - p_osm->routing_engine.context = (void *)p_ftree; > - p_osm->routing_engine.build_lid_matrices = __osm_ftree_construct_fabric; > - p_osm->routing_engine.ucast_build_fwd_tables = __osm_ftree_do_routing; > - p_osm->routing_engine.delete = __osm_ftree_delete; > + r->context = (void *)p_ftree; > + r->build_lid_matrices = __osm_ftree_construct_fabric; > + r->ucast_build_fwd_tables = __osm_ftree_do_routing; > + r->delete = __osm_ftree_delete; > + > return 0; > } > - > -/*************************************************** > - ***************************************************/ > diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c > index b985e9a..ce3982f 100644 > --- a/opensm/opensm/osm_ucast_lash.c > +++ b/opensm/opensm/osm_ucast_lash.c > @@ -785,7 +785,7 @@ static int init_lash_structures(lash_t * p_lash) > unsigned vl_min = p_lash->vl_min; > unsigned num_switches = p_lash->num_switches; > osm_log_t *p_log = &p_lash->p_osm->log; > - int status = IB_SUCCESS; > + int status = 0; > unsigned int i, j, k; > > OSM_LOG_ENTER(p_log); > @@ -852,7 +852,7 @@ static int init_lash_structures(lash_t * p_lash) > goto Exit; > > Exit_Mem_Error: > - status = IB_ERROR; > + status = -1; > OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D01: " > "Could not allocate required memory for LASH errno %d, errno %d for lack of memory\n", > errno, ENOMEM); > @@ -875,7 +875,7 @@ static int lash_core(lash_t * p_lash) > int stop = 0, output_link, i_next_switch; > int output_link2, i_next_switch2; > int cycle_found2 = 0; > - int status = IB_SUCCESS; > + int status = 0; > int *switch_bitmap = NULL; /* Bitmap to check if we have processed this pair */ > > OSM_LOG_ENTER(p_log); > @@ -1028,7 +1028,7 @@ static int lash_core(lash_t * p_lash) > goto Exit; > > Error_Not_Enough_Lanes: > - status = IB_ERROR; > + status = -1; > OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D02: " > "Lane requirements (%d) exceed available lanes (%d)\n", > p_lash->vl_min, lanes_needed); > @@ -1360,15 +1360,15 @@ uint8_t osm_get_lash_sl(osm_opensm_t * p_osm, osm_port_t * p_src_port, > return (uint8_t) ((switch_t *) p_sw->priv)->routing_table[dst_id].lane; > } > > -int osm_ucast_lash_setup(osm_opensm_t * p_osm) > +int osm_ucast_lash_setup(struct osm_routing_engine *r, osm_opensm_t *p_osm) > { > lash_t *p_lash = lash_create(p_osm); > if (!p_lash) > return -1; > > - p_osm->routing_engine.context = p_lash; > - p_osm->routing_engine.ucast_build_fwd_tables = lash_process; > - p_osm->routing_engine.delete = lash_delete; > + r->context = p_lash; > + r->ucast_build_fwd_tables = lash_process; > + r->delete = lash_delete; > > return 0; > } > diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c > index 9d0ad13..a4967fe 100644 > --- a/opensm/opensm/osm_ucast_mgr.c > +++ b/opensm/opensm/osm_ucast_mgr.c > @@ -216,7 +216,6 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr, > uint8_t port; > boolean_t is_ignored_by_port_prof; > ib_net64_t node_guid; > - struct osm_routing_engine *p_routing_eng; > unsigned start_from = 1; > > OSM_LOG_ENTER(p_mgr->p_log); > @@ -253,8 +252,6 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr, > > node_guid = osm_node_get_node_guid(p_sw->p_node); > > - p_routing_eng = &p_mgr->p_subn->p_osm->routing_engine; > - > /* > The lid matrix contains the number of hops to each > lid from each port. From this information we determine > @@ -269,18 +266,9 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr, > /* do not try to overwrite the ppro of non existing port ... */ > is_ignored_by_port_prof = TRUE; > > - /* Up/Down routing can cause unreachable routes between some > - switches so we do not report that as an error in that case */ > - if (!p_routing_eng->build_lid_matrices) { > - OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A08: " > - "No path to get to LID %u from switch 0x%" > - PRIx64 "\n", lid_ho, cl_ntoh64(node_guid)); > - /* trigger a new sweep - try again ... */ > - p_mgr->p_subn->subnet_initialization_error = TRUE; > - } else > - OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, > - "No path to get to LID %u from switch 0x%" > - PRIx64 "\n", lid_ho, cl_ntoh64(node_guid)); > + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, > + "No path to get to LID %u from switch 0x%" PRIx64 "\n", > + lid_ho, cl_ntoh64(node_guid)); > } else { > osm_physp_t *p = osm_node_get_physp_ptr(p_sw->p_node, port); > > @@ -583,7 +571,7 @@ __osm_ucast_mgr_process_neighbors(IN cl_map_item_t * const p_map_item, > > /********************************************************************** > **********************************************************************/ > -void osm_ucast_mgr_build_lid_matrices(IN osm_ucast_mgr_t * const p_mgr) > +int osm_ucast_mgr_build_lid_matrices(IN osm_ucast_mgr_t * const p_mgr) > { > uint32_t i; > uint32_t iteration_max; > @@ -646,6 +634,8 @@ void osm_ucast_mgr_build_lid_matrices(IN osm_ucast_mgr_t * const p_mgr) > OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, > "Min-hop propagated in %d steps\n", i); > } > + > + return 0; > } > > /********************************************************************** > @@ -752,7 +742,7 @@ static void clear_prof_ignore_flag(cl_map_item_t * const p_map_item, void *ctx) > } > } > > -static void ucast_mgr_build_lfts(osm_ucast_mgr_t *p_mgr) > +static int ucast_mgr_build_lfts(osm_ucast_mgr_t *p_mgr) > { > cl_qlist_init(&p_mgr->port_order_list); > > @@ -786,27 +776,56 @@ static void ucast_mgr_build_lfts(osm_ucast_mgr_t *p_mgr) > __osm_ucast_mgr_process_tbl, p_mgr); > > cl_qlist_remove_all(&p_mgr->port_order_list); > + > + return 0; > } > > /********************************************************************** > **********************************************************************/ > +static int ucast_mgr_route(struct osm_routing_engine *r, osm_opensm_t *osm) > +{ > + int ret; > + > + OSM_LOG(&osm->log, OSM_LOG_VERBOSE, > + "building routing with \'%s\' routing algorithm...\n", r->name); > + > + if (!r->build_lid_matrices || > + (ret = r->build_lid_matrices(r->context)) > 0) > + ret = osm_ucast_mgr_build_lid_matrices(&osm->sm.ucast_mgr); > + > + if (ret < 0) { > + OSM_LOG(&osm->log, OSM_LOG_ERROR, > + "%s: cannot build lid matrices.\n", r->name); > + return ret; > + } > + > + if (!r->ucast_build_fwd_tables || > + (ret = r->ucast_build_fwd_tables(r->context)) > 0) > + ret = ucast_mgr_build_lfts(&osm->sm.ucast_mgr); > + > + if (ret < 0) { > + OSM_LOG(&osm->log, OSM_LOG_ERROR, > + "%s: cannot build fwd tables.\n", r->name); > + return ret; > + } > + > + osm->routing_engine_used = osm_routing_engine_type(r->name); > + > + return 0; > +} > + > osm_signal_t osm_ucast_mgr_process(IN osm_ucast_mgr_t * const p_mgr) > { > osm_opensm_t *p_osm; > struct osm_routing_engine *p_routing_eng; > osm_signal_t signal = OSM_SIGNAL_DONE; > cl_qmap_t *p_sw_guid_tbl; > - int blm = 0; > - int ubft = 0; > > OSM_LOG_ENTER(p_mgr->p_log); > > p_sw_guid_tbl = &p_mgr->p_subn->sw_guid_tbl; > p_osm = p_mgr->p_subn->p_osm; > - p_routing_eng = &p_osm->routing_engine; > - > - p_mgr->is_dor = p_routing_eng->name > - && (strcmp(p_routing_eng->name, "dor") == 0); > + p_routing_eng = p_osm->routing_engine_list; > > CL_PLOCK_EXCL_ACQUIRE(p_mgr->p_lock); > > @@ -819,28 +838,19 @@ osm_signal_t osm_ucast_mgr_process(IN osm_ucast_mgr_t * const p_mgr) > > p_mgr->any_change = FALSE; > > - if (!p_routing_eng->build_lid_matrices || > - (blm = p_routing_eng->build_lid_matrices(p_routing_eng->context))) > - osm_ucast_mgr_build_lid_matrices(p_mgr); > + p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_NONE; > + while (p_routing_eng) { > + if (!ucast_mgr_route(p_routing_eng, p_osm)) > + break; > + p_routing_eng = p_routing_eng->next; > + } > > - /* > - Now that the lid matrices have been built, we can > - build and download the switch forwarding tables. > - */ > - if (!p_routing_eng->ucast_build_fwd_tables || > - (ubft = > - p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context))) > + if (p_osm->routing_engine_used == OSM_ROUTING_ENGINE_TYPE_NONE) { > + /* If configured routing algorithm failed, use default MinHop */ > + osm_ucast_mgr_build_lid_matrices(p_mgr); > ucast_mgr_build_lfts(p_mgr); > - > - /* 'file' routing engine has one unique logic corner case */ > - if (p_routing_eng->name && (strcmp(p_routing_eng->name, "file") == 0) > - && (!blm || !ubft)) > - p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_FILE; > - else if (!blm && !ubft) > - p_osm->routing_engine_used = > - osm_routing_engine_type(p_routing_eng->name); > - else > p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_MINHOP; > + } > > OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, > "%s tables configured on all switches\n", > @@ -861,3 +871,41 @@ Exit: > OSM_LOG_EXIT(p_mgr->p_log); > return (signal); > } > + > +static int ucast_build_lid_matrices(void *context) > +{ > + return osm_ucast_mgr_build_lid_matrices(context); > +} > + > +static int ucast_build_lfts(void *context) > +{ > + return ucast_mgr_build_lfts(context); > +} > + > +int osm_ucast_minhop_setup(struct osm_routing_engine *r, osm_opensm_t *osm) > +{ > + r->context = &osm->sm.ucast_mgr; > + r->build_lid_matrices = ucast_build_lid_matrices; > + r->ucast_build_fwd_tables = ucast_build_lfts; > + return 0; > +} > + > +static int ucast_dor_build_lfts(void *context) > +{ > + osm_ucast_mgr_t *mgr = context; > + int ret; > + > + mgr->is_dor = 1; > + ret = ucast_mgr_build_lfts(mgr); > + mgr->is_dor = 0; > + > + return ret; > +} > + > +int osm_ucast_dor_setup(struct osm_routing_engine *r, osm_opensm_t *osm) > +{ > + r->context = &osm->sm.ucast_mgr; > + r->build_lid_matrices = ucast_build_lid_matrices; > + r->ucast_build_fwd_tables = ucast_dor_build_lfts; > + return 0; > +} > diff --git a/opensm/opensm/osm_ucast_updn.c b/opensm/opensm/osm_ucast_updn.c > index 90e9af8..4fdcc78 100644 > --- a/opensm/opensm/osm_ucast_updn.c > +++ b/opensm/opensm/osm_ucast_updn.c > @@ -643,7 +643,7 @@ static int __osm_updn_call(void *ctx) > } else { > OSM_LOG(&p_updn->p_osm->log, OSM_LOG_INFO, > "disabling UPDN algorithm, no root nodes were found\n"); > - ret = 1; > + ret = -1; > } > > if (osm_log_is_active(&p_updn->p_osm->log, OSM_LOG_ROUTING)) > @@ -669,7 +669,7 @@ static void __osm_updn_delete(void *context) > free(context); > } > > -int osm_ucast_updn_setup(osm_opensm_t * p_osm) > +int osm_ucast_updn_setup(struct osm_routing_engine *r, osm_opensm_t *p_osm) > { > updn_t *p_updn; > > @@ -680,9 +680,9 @@ int osm_ucast_updn_setup(osm_opensm_t * p_osm) > > p_updn->p_osm = p_osm; > > - p_osm->routing_engine.context = p_updn; > - p_osm->routing_engine.delete = __osm_updn_delete; > - p_osm->routing_engine.build_lid_matrices = __osm_updn_call; > + r->context = p_updn; > + r->delete = __osm_updn_delete; > + r->build_lid_matrices = __osm_updn_call; > > return 0; > } -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0002-set-routing_engine_used-to-minhop-if-all-defaults-us.patch Type: text/x-patch Size: 1943 bytes Desc: not available URL: From YJia at tmriusa.com Wed Oct 1 10:35:53 2008 From: YJia at tmriusa.com (Yicheng Jia) Date: Wed, 1 Oct 2008 12:35:53 -0500 Subject: [ofa-general] questions about opensm and unmanaged switch Message-ID: Hi Folks, I use opensm with a single unmanaged switch to connect several HCAs. I found that HCAs take much longer time to get LID after opensm restart without power cycle the switch. If the switch is power off/on before opensm restart, then HCAs get their LIDs sooner. I'm wondering if there's minhop table pre-existing in the switch which will prevent HCAs from regain their LIDs soon. Is there any way to clean up the switch's hop table during opensm start? my another question, should the HCA that opensm resides on be physically connected to switch port 0? Thanks! Yicheng _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From cameron at harr.org Wed Oct 1 10:55:33 2008 From: cameron at harr.org (Cameron Harr) Date: Wed, 01 Oct 2008 11:55:33 -0600 Subject: ***SPAM*** Re: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> Message-ID: <48E3B995.2020201@harr.org> Chris Worley wrote: > I don't know iSCSI, but in MPI-land we just poll... assuming you don't > care if you loose a CPU to polling. > > Is there a way to get SRP off it's dependence on Interrupts, and just poll? > > Chris > I looked for a way to enable polling, but was unsuccessful. The best I could find was to run off msi interrupts on the boot cmdline with pci=nomsi. If someone knows of a way to go to polling mode, I'll be sure to try it. Cameron From yosefe at voltaire.com Wed Oct 1 11:55:22 2008 From: yosefe at voltaire.com (Yossi Etigin) Date: Wed, 1 Oct 2008 21:55:22 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH v2] ipoib: fix hang while bringing down uninitialized interface In-Reply-To: References: <48CEA6DC.9000904@gmail.com> Message-ID: <32cb786f0810011155i1b16f83bh23a5432436281e00@mail.gmail.com> You are right. this is not really needed. please use the previous version of the patch. On Tue, Sep 30, 2008 at 6:24 AM, Roland Dreier wrote: > > - handle a case when ipoib_ib_dev_stop() is called twice on the > > same dev->priv - zero the timer after its deletion. > > I don't understand why this is an issue and why: > > > + /* Make sure the timer was initialized */ > > + if (priv->poll_timer.function) { > > + del_timer_sync(&priv->poll_timer); > > + memset(&priv->poll_timer, 0, sizeof priv->poll_timer); > > this memset is needed. > > If the timer isn't pending, isn't del_timer_sync() just a no-op? What > am I missing? > > - R. > From Nathan.Dauchy at noaa.gov Wed Oct 1 12:05:55 2008 From: Nathan.Dauchy at noaa.gov (Nathan Dauchy) Date: Wed, 01 Oct 2008 13:05:55 -0600 Subject: [ofa-general] Intermittent: ib0: multicast join failed In-Reply-To: References: <2C7DE72B9BD00F44BAECA5B0CBB873953217F5@hermes.terascala.com> <20080919170622.GI27236@sashak.voltaire.com> <2C7DE72B9BD00F44BAECA5B0CBB87395321855@hermes.terascala.com> <2C7DE72B9BD00F44BAECA5B0CBB873953218BE@hermes.terascala.com> Message-ID: <48E3CA13.3070504@noaa.gov> Hal Rosenstock wrote: > On Mon, Sep 22, 2008 at 2:43 PM, Roger Spellman wrote: >> Thanks, Hal. >> >> Below is the output to ibstat and ibstatus. It shows that the rate is >> 2.5 Gb/sec, rather than 10 Gb/sec. >> >> Is there a way to get it to renegotiate the rate, short of rebooting? > > Try ibportstate reset on the switch peer port. You could also replug > the cable on that link. Hal, Is there an easy way to determine the switch peer port from the node itself? [root at h118 ~]# ibportstate -D 0 1 speed 2 Initial PortInfo: # Port info: DR path 0 port 1 LinkSpeedEnabled:................2.5 Gbps After PortInfo set: # Port info: DR path 0 port 1 LinkSpeedEnabled:................5.0 Gbps [root at h118 ~]# ibportstate -D 0 1 reset ibportstate: iberror: failed: smp query nodeinfo: Node type not switch I guess I am looking for more detailed documentation on how to craft the "direct route path". >>> It's likely a rate issue where the negotiated port rate is not the >>> broadcast group rate. > > Yes, it's a rate problem (the link is coming up a 1X SDR which is 2.5 > Gbps whereas I suspect that the group is 10 Gbps so it can't join. > I think we are seeing something similar on our mixed SDR/DDR network. All switches are DDR, but ~390 hosts are SDR, ~260 are SDR. Messages like the following show up in "osm.log": Oct 01 03:31:05 514600 [42803940] 0x01 -> __osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 0xff12601bffff0000 : 0x0000000000000016 from port 0x0002c90200224d91 (MT25218 InfiniHostEx Mellanox Technologies) Is there a way to configure the hosts, switches, or subnet manager to avoid this error? Olga Shern's posting implies it is not a real problem and that subsequent multicast joins succeed. Perhaps an update could be made to only log a "warning" for the first failure and "error" if it doesn't join successfully within some number of tries or some number of seconds? Just a thought. Thanks, Nathan From hal.rosenstock at gmail.com Wed Oct 1 12:57:45 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 1 Oct 2008 15:57:45 -0400 Subject: ***SPAM*** Re: [ofa-general] Intermittent: ib0: multicast join failed In-Reply-To: <48E3CA13.3070504@noaa.gov> References: <2C7DE72B9BD00F44BAECA5B0CBB873953217F5@hermes.terascala.com> <20080919170622.GI27236@sashak.voltaire.com> <2C7DE72B9BD00F44BAECA5B0CBB87395321855@hermes.terascala.com> <2C7DE72B9BD00F44BAECA5B0CBB873953218BE@hermes.terascala.com> <48E3CA13.3070504@noaa.gov> Message-ID: On Wed, Oct 1, 2008 at 3:05 PM, Nathan Dauchy wrote: > Hal Rosenstock wrote: >> On Mon, Sep 22, 2008 at 2:43 PM, Roger Spellman wrote: >>> Thanks, Hal. >>> >>> Below is the output to ibstat and ibstatus. It shows that the rate is >>> 2.5 Gb/sec, rather than 10 Gb/sec. >>> >>> Is there a way to get it to renegotiate the rate, short of rebooting? >> >> Try ibportstate reset on the switch peer port. You could also replug >> the cable on that link. > > Hal, > > Is there an easy way to determine the switch peer port from the node itself? Are you hooked to a chassis switch or a simple switch like a 24 porter ? You may be able to tell from the face plate as to which port it is. A command based way from that host for your configuration appears to be: smpquery portinfo -D 0,1 | grep LocalPort > [root at h118 ~]# ibportstate -D 0 1 speed 2 > Initial PortInfo: > # Port info: DR path 0 port 1 > LinkSpeedEnabled:................2.5 Gbps > > After PortInfo set: > # Port info: DR path 0 port 1 > LinkSpeedEnabled:................5.0 Gbps > > [root at h118 ~]# ibportstate -D 0 1 reset > ibportstate: iberror: failed: smp query nodeinfo: Node type not switch > > I guess I am looking for more detailed documentation on how to craft the > "direct route path". IBA 1.2.1 chapter 14.2.2 is the definitive source on directed route SMPs >>>> It's likely a rate issue where the negotiated port rate is not the >>>> broadcast group rate. >> >> Yes, it's a rate problem (the link is coming up a 1X SDR which is 2.5 >> Gbps whereas I suspect that the group is 10 Gbps so it can't join. >> > > I think we are seeing something similar on our mixed SDR/DDR network. > All switches are DDR, but ~390 hosts are SDR, ~260 are SDR. Messages > like the following show up in "osm.log": > > Oct 01 03:31:05 514600 [42803940] 0x01 -> __osm_mcmr_rcv_join_mgrp: ERR > 1B11: method = SubnAdmSet, scope_state = 0x1, component mask = > 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: > 0xff12601bffff0000 : 0x0000000000000016 from port 0x0002c90200224d91 > (MT25218 InfiniHostEx Mellanox Technologies) That's an IPv6 group (as indicated by the 0x601b in the MGID). Are you using IPv6 ? If not, you can ignore this. It's not a rate issue; it's a creation issue. > Is there a way to configure the hosts, switches, or subnet manager to > avoid this error? If you are not using IPv6, turn it off. > Olga Shern's posting implies it is not a real problem and that > subsequent multicast joins succeed. Perhaps an update could be made to > only log a "warning" for the first failure and "error" if it doesn't > join successfully within some number of tries or some number of seconds? This is not easy IMO as there's useful and different information in those messages (group, port, etc.) which mean different things to the network admin. It ends up being a tradeoff of too much in the log v. too little. Some people just want one message and others want to see all the failures. It isn't easy to track whether a certain message has already been logged. -- Hal > Just a thought. > > Thanks, > Nathan > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From haven.hash at isilon.com Wed Oct 1 13:21:22 2008 From: haven.hash at isilon.com (Haven Hash) Date: Wed, 01 Oct 2008 13:21:22 -0700 Subject: [ofa-general] [PATCH][TRIVIAL]mad.c: Need parens to kmalloc correct amount of memory Message-ID: <1222892482.5926.35.camel@hhash-dev> >I assume this has never been a problem because the malloc will probably >word align the allocation, but maybe it was desired? > >Potential patch attached. Signed-off-by: Hal Rosenstock Haven Hash haven.hash at isilon.com- -------------- next part -------------- A non-text attachment was scrubbed... Name: mad.c.diff Type: text/x-patch Size: 703 bytes Desc: not available URL: From sashak at voltaire.com Wed Oct 1 13:35:36 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 1 Oct 2008 23:35:36 +0300 Subject: [ofa-general] Re: [PATCH v2] opensm: routing chaining In-Reply-To: <1222882037.1197.32.camel@cardanus.llnl.gov> References: <1221506448.6274.32.camel@cardanus.llnl.gov> <20080928202648.GG25831@sashak.voltaire.com> <20080928204244.GH25831@sashak.voltaire.com> <20081001011920.GJ7396@sashak.voltaire.com> <1222882037.1197.32.camel@cardanus.llnl.gov> Message-ID: <20081001203536.GK7396@sashak.voltaire.com> Hi Al, On 10:27 Wed 01 Oct , Al Chu wrote: > > One comment though. I think the setting of p_osm->routing_engine_used > is now different than it was before. With file routing, if both the > build_lid_matrices and build_lfts calls default to normal minhop, the > routing_engine_used file used to be set to MINHOP, but now it would be > set to FILE. I didn't realize it before, but I had the same behavior > change in my patch series too :-) Yes, and I like this more than how it was - "file" routing engine is requested without files, this could be done for a reason (preserve config, etc)., then I would expect from OpenSM doing something like this - to setup "file" routing engine with fallback methods. Sasha From sashak at voltaire.com Wed Oct 1 13:38:13 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 1 Oct 2008 23:38:13 +0300 Subject: [ofa-general] ***SPAM*** ibdm network topology format In-Reply-To: <829ded920810010207r475d82abu269d47cd3baddb3f@mail.gmail.com> References: <829ded920809290139vf2cc151w4cc8a6fafb49edfe@mail.gmail.com> <829ded920809292304k3ffc78c0m556efbdd7d35c528@mail.gmail.com> <20080930121252.GA7396@sashak.voltaire.com> <829ded920810010207r475d82abu269d47cd3baddb3f@mail.gmail.com> Message-ID: <20081001203813.GL7396@sashak.voltaire.com> Hi Manesh, On 14:37 Wed 01 Oct , Keshetti Mahesh wrote: > > I have tried running 'ibdiagnet' on ibsim after exporting SIM_HOST > environment variable to some host name. But still 'ibdiagnet' failed to > discover the topology. See the below output. Hmm, I have similar results now. Will try to investigate deeper... Sasha From rdreier at cisco.com Wed Oct 1 13:42:14 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 01 Oct 2008 13:42:14 -0700 Subject: [ofa-general] [PATCH][TRIVIAL]mad.c: Need parens to kmalloc correct amount of memory In-Reply-To: <1222892482.5926.35.camel@hhash-dev> (Haven Hash's message of "Wed, 01 Oct 2008 13:21:22 -0700") References: <1222892482.5926.35.camel@hhash-dev> Message-ID: > Signed-off-by: Hal Rosenstock No -- the signed-off-by line should come from you. Please read Documentation/SubmittingPatches in the kernel source tree, specifically section 12, and if you are able to release your work under those terms, please send an appropriate Signed-off-by line. From mkrause at hp.com Wed Oct 1 14:21:05 2008 From: mkrause at hp.com (Michael Krause) Date: Wed, 01 Oct 2008 14:21:05 -0700 Subject: [ofa-general] Infiniband bandwidth In-Reply-To: <1222877327.31161.315.camel@mundo> References: <1222877327.31161.315.camel@mundo> Message-ID: <6.2.0.14.2.20081001140617.07bede78@esmail.cup.hp.com> An HTML attachment was scrubbed... URL: From sashak at voltaire.com Wed Oct 1 17:58:39 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 2 Oct 2008 03:58:39 +0300 Subject: [ofa-general] ***SPAM*** Tool for changing routes In-Reply-To: References: <4fb5e0640810010244j416a0c3ek376252198326ffea@mail.gmail.com> Message-ID: <20081002005839.GM7396@sashak.voltaire.com> On 09:18 Wed 01 Oct , Hal Rosenstock wrote: > > > allows to delete a path configured by OpenSM and > > set/configure one or more paths between the hosts? > > Paths are setup by the SM based on the routing algorithm selected and > the network topology. For multiple paths, LMC needs to be configured. > Redundant links can be added or removed from the topology or > enabled/disabled via ibportstate. Also OpenSM has "file" routing engine where switch forwarding tables may be configured from file (edited by hands, etc.). Details are under "routing" section of OpenSM man page. Sasha From sashak at voltaire.com Wed Oct 1 18:06:37 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 2 Oct 2008 04:06:37 +0300 Subject: [ofa-general] questions about opensm and unmanaged switch In-Reply-To: References: Message-ID: <20081002010637.GN7396@sashak.voltaire.com> Hi Yicheng, On 12:35 Wed 01 Oct , Yicheng Jia wrote: > > I use opensm with a single unmanaged switch to connect several HCAs. I > found that HCAs take much longer time to get LID after opensm restart > without power cycle the switch. If the switch is power off/on before > opensm restart, then HCAs get their LIDs sooner. I'm wondering if there's > minhop table pre-existing in the switch which will prevent HCAs from > regain their LIDs soon. Is there any way to clean up the switch's hop > table during opensm start? In first run OpenSM will not fetch existing LFTs from the switch and will setup this from "scratch". I think that the reason for delay should be different. Also it is single-switch subnet, its setup should be reasonably fast in both cases. Do you see any errors in OpenSM log file? > my another question, should the HCA that opensm resides on be physically > connected to switch port 0? No. Port 0 is switch's management port and normally it doesn't have physical connection. Sasha From sashak at voltaire.com Wed Oct 1 19:24:30 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 2 Oct 2008 05:24:30 +0300 Subject: [ofa-general] ***SPAM*** ibdm network topology format In-Reply-To: <20081001203813.GL7396@sashak.voltaire.com> References: <829ded920809290139vf2cc151w4cc8a6fafb49edfe@mail.gmail.com> <829ded920809292304k3ffc78c0m556efbdd7d35c528@mail.gmail.com> <20080930121252.GA7396@sashak.voltaire.com> <829ded920810010207r475d82abu269d47cd3baddb3f@mail.gmail.com> <20081001203813.GL7396@sashak.voltaire.com> Message-ID: <20081002022430.GQ7396@sashak.voltaire.com> Hi Manesh, On 23:38 Wed 01 Oct , Sasha Khapyorsky wrote: > > On 14:37 Wed 01 Oct , Keshetti Mahesh wrote: > > > > I have tried running 'ibdiagnet' on ibsim after exporting SIM_HOST > > environment variable to some host name. But still 'ibdiagnet' failed to > > discover the topology. See the below output. > > Hmm, I have similar results now. Will try to investigate deeper... Right now I found two issues: 1. ibis is linked statically with libosmvendor libibumad. libumad2sim.so is loaded (preloaded) dynamically and requires libibumad.so as well. As result we have two instances of libibumad - only one is initialized properly so only one instance of libibumad has flag new_user_mad_api set. As result libibumad(s) used by libosmvendor and by libumad2sim use different user_mad header formats. It may be solved if we will not use static linkage with ibis. Patch like this seems work for me: diff --git a/ibis/src/Makefile.am b/ibis/src/Makefile.am index e0b512f..a52e67f 100644 --- a/ibis/src/Makefile.am +++ b/ibis/src/Makefile.am @@ -74,8 +74,6 @@ LDADD = $(OSM_LDFLAGS) ibis_SOURCES = ibissh_wrap.cpp -ibis_LDFLAGS = -static -# note the order of the libraries does matter as we static link ibis_LDADD = -libiscom $(OSM_LDFLAGS) $(TCL_LIBS) @@ -153,7 +151,8 @@ EXTRA_DIST = swig_extended_obj.c fixSwigWrapper pkgIndex.tcl \ git_version.h # we do not want the temporary and the archive libs installed: -install-libLTLIBRARIES: +# then objects should be linked into program +#install-libLTLIBRARIES: # this actually over write the lib install install-exec-am: install-binPROGRAMS 2. ibis doesn't register class 0x81 - SM direct routed, only SM lid routed (0x1). In comment in ibutils/ibis/src/ibsm.c line 118 is stated: /* no need to bind the Directed Route class as it will automatically be handled by the osm_vendor_bind if asked for LID route */ As far as I can see in osm_vendor_bind() it is not (but it is in opposite order - when class 0x81 is registered class 0x1 will be registered too). Somehow it works without ibsim - so I suspect user_mad handles it. (Hal, could you clarify?) Assuming this is the case, then patch to umad2sim (ibsim) like this can help (helps for me): diff --git a/umad2sim/umad2sim.c b/umad2sim/umad2sim.c index 646cde2..a6196f1 100644 --- a/umad2sim/umad2sim.c +++ b/umad2sim/umad2sim.c @@ -407,7 +408,7 @@ static ssize_t umad2sim_read(struct umad2sim_dev *dev, void *buf, size_t count) mgmt_class = 0; } - umad->agent_id = dev->agent_idx[mgmt_class]; + umad->agent_id = dev->agent_idx[mgmt_class&(~0x80)]; umad->status = ntohl(req.status); umad->timeout_ms = 0; umad->retries = 0; But I'm not sure yet that it is proper solution (up to class registration issue clarification). Sasha From Jesse.Butler at Sun.COM Wed Oct 1 21:02:32 2008 From: Jesse.Butler at Sun.COM (Jesse Butler) Date: Wed, 01 Oct 2008 22:02:32 -0600 Subject: [ofa-general] CM REQ, src and dst IP addrs in host byte order Message-ID: <238D803C-1196-456E-9224-3835DD57C793@sun.com> [ Sorry if this is a duplicate mail. I attempted to send once already. ] I am working on an iSER implementation and am having some issues with connection establishment. Dumping out the contents of inbound MADs, I'm seeing the OFED initiator (running OFED 1.3 on RHEL 5.2) sending CM REQs w/ the source and destination IP addresses in host byte order, rather than the network byte order as expected. Is this a known issue, or intentional for some reason? Thanks Jesse From rdreier at cisco.com Wed Oct 1 22:16:09 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 01 Oct 2008 22:16:09 -0700 Subject: [ofa-general] CM REQ, src and dst IP addrs in host byte order In-Reply-To: <238D803C-1196-456E-9224-3835DD57C793@sun.com> (Jesse Butler's message of "Wed, 01 Oct 2008 22:02:32 -0600") References: <238D803C-1196-456E-9224-3835DD57C793@sun.com> Message-ID: > I am working on an iSER implementation and am having some issues with > connection establishment. Dumping out the contents of inbound MADs, > I'm seeing the OFED initiator (running OFED 1.3 on RHEL 5.2) sending > CM REQs w/ the source and destination IP addresses in host byte order, > rather than the network byte order as expected. Are you positive this is the case? Looking at the code quickly I don't see any place that looks like it is byte-swapping the IP addresses. - R. From olga.shern at gmail.com Thu Oct 2 01:04:41 2008 From: olga.shern at gmail.com (Olga Shern (Voltaire)) Date: Thu, 2 Oct 2008 11:04:41 +0300 Subject: [ofa-general] ***SPAM*** Re: [ewg] Re: Continue of "defer skb_orphan() until irqs enabled" In-Reply-To: References: <48DA643E.9040605@Voltaire.COM> <20080924162034.GE15133@sgi.com> <20080924171135.GF15133@sgi.com> <20080924191623.GJ15133@sgi.com> <20080925114414.GA25044@mtls03> <20080928113945.GA32630@mtls03> Message-ID: We run regression tests and it were OK. We will continue the testing and update if we see any issues. Olga On Sun, Sep 28, 2008 at 2:40 PM, Olga Shern (Voltaire) wrote: > Hi Eli, > > We also want to run regression tests with this patch. > Please let me know when OFED daily build will include it. > > Thanks > Olga > > On Sun, Sep 28, 2008 at 2:39 PM, Eli Cohen wrote: >> On Fri, Sep 26, 2008 at 01:19:00PM -0700, Roland Dreier wrote: >>> How about this? Instead of trying to rely on some complicated and >>> fragile reasoning about when some race might occur, let's just do what >>> we want to do anyway and get rid of LLTX. We change from priv->tx_lock >>> (taken with IRQ disabling) to netif_tx_lock (taken on with >>> BH-disabling). And then we can keep the skb_orphan in the place it is, >>> since our xmit routine runs with IRQs enabled. >>> >> >> We'll integrate this into ofed 1.4 and monitor this through our >> regression system. >> _______________________________________________ >> ewg mailing list >> ewg at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg >> > From raq at cttc.upc.edu Thu Oct 2 01:43:11 2008 From: raq at cttc.upc.edu (Ramiro Alba Queipo) Date: Thu, 02 Oct 2008 10:43:11 +0200 Subject: [ofa-general] Infiniband bandwidth In-Reply-To: <6.2.0.14.2.20081001140617.07bede78@esmail.cup.hp.com> References: <1222877327.31161.315.camel@mundo> <6.2.0.14.2.20081001140617.07bede78@esmail.cup.hp.com> Message-ID: <1222936991.31161.339.camel@mundo> On Wed, 2008-10-01 at 14:21 -0700, Michael Krause wrote: > At 09:08 AM 10/1/2008, Ramiro Alba Queipo wrote: > > Hi all, > > > > We have an infiniband cluster of 22 nodes witch 20 Gbps Mellanox > > MHGS18-XTC cards and I tried to make performance net tests both to > > check > > hardware as to clarify concepts. > > > > Starting from the theoretic pick according to the infiniband card > > (in my > > case 4X DDR => 20 Gbits/s => 2.5 Gbytes/s) we have some limits: > > > > 1) Bus type: PCIe 8x => 250 Mbytes/lane => 250 * 8 = 2 Gbytes/s > > > > 2) According to a thread an users openmpi mail-list (???): > > > > The 16 Gbit/s number is the theoretical peak, IB is coded 8/10 so > > out of the 20 Gbit/s 16 is what you get. On SDR this number is > > (of course) 8 Gbit/s achievable (which is ~1000 MB/s) and I've > > seen well above 900 on MPI (this on 8x PCIe, 2x margin) > > > > Is this true? > > IB uses 8b/10 encoding. This results in a 20% overhead on every > frame. Further, IB protocol - header, CRC, flow control credits, etc. > will consume additional bandwidth - the amount will vary with workload > and traffic patters. Also, any fabric can experience congestion which > may reduce throughput for any given data flow. > > PCIe uses 8b/10b encoding for both 2.5GT/s and 5.0 GT/s signaling (the > next generation signaling is scrambled based so provides 2x the data > bandwidth with significantly less encoding overhead). It also has > protocol overheads conceptually similar to IB which will consume > additional bandwidth (keep in mind many volume chipsets only support a > 256B transaction size so a single IB frame may require 8-16 PCIe > transactions to process. There will also be application / device > driver control messages between the host and the I/O device which will > consume additional bandwidth. > > Also keep in mind that the actual application bandwidth may be further > gated by the memory subsystem, the I/O-to-memory latency, etc. so > while the theoretical bandwidths may be quite high, they will be > constrained by the interactions and the limitations within the overall > hardware and software stacks. > > > > 3) According to other comment in the same thread: > > > > The data throughput limit for 8x PCIe is ~12 Gb/s. The theoretical > > limit is 16 Gb/s, but each PCIe packet has a whopping 20 byte > > overhead. If the adapter uses 64 byte packets, then you see 1/3 of > > the throughput go to overhead. > > > > Could someone explain me that? > > DMA Read completions are often returned one cache line at a time while > DMA Writes are often transmitted at the Max_Payload_Size of 256B (some > chipsets do coalesce completions allowing up to the Max_Payload_Size > to be returned). Depending upon the mix of transactions required to > move an IB frame, the overheads may seem excessive. > > PCIe overheads vary with the transaction type, the flow control credit > exchanges, CRC, etc. It is important to keep these in mind when > evaluating the solution. > > > Then I got another comment about the matter: > > > > The best uni-directional performance I have heard of for PCIe 8x IB > > DDR is ~1,400 MB/s (11.2 Gb/s) with Lustre, which is about 55% of > > the > > theoretical 20 Gb/s advertised speed. > > > > > > --------------------------------------------------------------------- > > > > > > Now, I did some tests (mpi used is OpenMPI) with the following > > results: > > > > a) Using "Performance tests" from OFED 1.31 > > > > ib_write_bw -a server -> 1347 MB/s > > > > b) Using hpcc (2 cores at diferent nodes) -> 1157 MB/s (--mca > > mpi_leave_pinned 1) > > > > c) Using "OSU Micro-Benchmarks" in "MPItests" from OFED 1.3.1 > > > > 1) 2 cores from different nodes > > > > - mpirun -np 2 --hostfile pool osu_bibw -> 2001.29 MB/s > > (bidirectional) > > - mpirun -np 2 --hostfile pool osu_bw -> 1311.31 MB/s > > > > 2) 2 cores from the same node > > > > - mpirun -np 2 osu_bibw -> 2232 MB/s (bidirectional) > > - mpirun -np 2 osu_bw -> 2058 MB/s > > > > The questions are: > > > > - Are those results coherent with what it should be? > > - Why tests with the two core in the same node are better? > > - Should not the bidirectional test be a bit higher? > > - Why hpcc is so low? > > You would need to provide more information about the system hardware, > the fabrics, etc. to make any rational response. There are many Whe have DELL PowerEdge SC1435 nodes with two AMD 2350 processors (2.0 GHz processor frequency and 1.8 GHz of Integrated Memory Controller Speed). The fabrics is built from 20 Gbps Mellanox MHGS18-XTC cards and a Flextrics 24 port 4X DDR switch, with 3 meter cables from Mellanox (MCC4L30-003 4X microGiGaCN latch, 30 AWG). > variables here and as I noted above, one cannot just derate the > hardware by a fixed percentage and conclude there is a real problem in > the solution stack. He is more complex. The question you should > ask is whether the micro-benchmarks you are executing are a realistic > reflection of the real workload. If not, then do any of these numbers No I don't think they are. My main intention is to understand what I really have and why, and to check for link degradations. Keep in mind that this is my first contact with infiniband problematics and before the end of this year we will have 76 nodes (608 cores) with an infiniband net that will be used both for calculations and data, using NFS-RDMA. Apart from our own test, what tests you would use to check for a ready cluster? > matter at the end of the day especially if the total time spent > within the interconnect stacks are relatively small or bursty. > > Mike -- Aquest missatge ha estat analitzat per MailScanner a la cerca de virus i d'altres continguts perillosos, i es considera que est� net. For all your IT requirements visit: http://www.transtec.co.uk From vlad at lists.openfabrics.org Thu Oct 2 03:12:00 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 2 Oct 2008 03:12:00 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081002-0200 daily build status Message-ID: <20081002101200.CEDD3E60E62@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: Build failed on ppc64 with linux-2.6.24 Log: /home/vlad/tmp/ofa_1_4_kernel-20081002-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c: In function 'ehca_poll_eqs': /home/vlad/tmp/ofa_1_4_kernel-20081002-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:942: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type /home/vlad/tmp/ofa_1_4_kernel-20081002-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:946: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081002-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081002-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081002-0200_linux-2.6.24_ppc64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081002-0200_linux-2.6.24_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.24' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From ogerlitz at voltaire.com Thu Oct 2 04:48:57 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 02 Oct 2008 14:48:57 +0300 Subject: [ofa-general] CM REQ, src and dst IP addrs in host byte order In-Reply-To: References: <238D803C-1196-456E-9224-3835DD57C793@sun.com> Message-ID: <48E4B529.6020706@voltaire.com> Roland Dreier wrote: > Are you positive this is the case? Looking at the code quickly I don't see any place that looks like it is byte-swapping the IP addresses. Indeed, maybe you want to send the dump of the CM REQ mad? Or. From Jesse.Butler at Sun.COM Thu Oct 2 06:55:43 2008 From: Jesse.Butler at Sun.COM (Jesse Butler) Date: Thu, 02 Oct 2008 07:55:43 -0600 Subject: [ofa-general] CM REQ, src and dst IP addrs in host byte order In-Reply-To: References: <238D803C-1196-456E-9224-3835DD57C793@sun.com> Message-ID: On Oct 1, 2008, at 11:16 PM, Roland Dreier wrote: >> I am working on an iSER implementation and am having some issues with >> connection establishment. Dumping out the contents of inbound MADs, >> I'm seeing the OFED initiator (running OFED 1.3 on RHEL 5.2) sending >> CM REQs w/ the source and destination IP addresses in host byte >> order, >> rather than the network byte order as expected. > > Are you positive this is the case? Looking at the code quickly I > don't > see any place that looks like it is byte-swapping the IP addresses. > > - R. My error. I am looking at the MAD dumps that I did and am now seeing that my initiator is the one that is incorrect, and I have inadvertently fixed it up on the target side so that I work, but OFED fails login. I'll fix up my code and all should be well. Thanks Jesse From hal.rosenstock at gmail.com Thu Oct 2 07:18:54 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 2 Oct 2008 10:18:54 -0400 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** ibdm network topology format In-Reply-To: <20081002022430.GQ7396@sashak.voltaire.com> References: <829ded920809290139vf2cc151w4cc8a6fafb49edfe@mail.gmail.com> <829ded920809292304k3ffc78c0m556efbdd7d35c528@mail.gmail.com> <20080930121252.GA7396@sashak.voltaire.com> <829ded920810010207r475d82abu269d47cd3baddb3f@mail.gmail.com> <20081001203813.GL7396@sashak.voltaire.com> <20081002022430.GQ7396@sashak.voltaire.com> Message-ID: Sasha, On Wed, Oct 1, 2008 at 10:24 PM, Sasha Khapyorsky wrote: > Hi Manesh, > > On 23:38 Wed 01 Oct , Sasha Khapyorsky wrote: >> >> On 14:37 Wed 01 Oct , Keshetti Mahesh wrote: >> > >> > I have tried running 'ibdiagnet' on ibsim after exporting SIM_HOST >> > environment variable to some host name. But still 'ibdiagnet' failed to >> > discover the topology. See the below output. >> >> Hmm, I have similar results now. Will try to investigate deeper... > > Right now I found two issues: > > 1. ibis is linked statically with libosmvendor libibumad. libumad2sim.so > is loaded (preloaded) dynamically and requires libibumad.so as well. As > result we have two instances of libibumad - only one is initialized > properly so only one instance of libibumad has flag new_user_mad_api set. > As result libibumad(s) used by libosmvendor and by libumad2sim use > different user_mad header formats. > > It may be solved if we will not use static linkage with ibis. Patch like > this seems work for me: > > diff --git a/ibis/src/Makefile.am b/ibis/src/Makefile.am > index e0b512f..a52e67f 100644 > --- a/ibis/src/Makefile.am > +++ b/ibis/src/Makefile.am > @@ -74,8 +74,6 @@ LDADD = $(OSM_LDFLAGS) > > ibis_SOURCES = ibissh_wrap.cpp > > -ibis_LDFLAGS = -static > -# note the order of the libraries does matter as we static link > ibis_LDADD = -libiscom $(OSM_LDFLAGS) $(TCL_LIBS) > > > @@ -153,7 +151,8 @@ EXTRA_DIST = swig_extended_obj.c fixSwigWrapper pkgIndex.tcl \ > git_version.h > > # we do not want the temporary and the archive libs installed: > -install-libLTLIBRARIES: > +# then objects should be linked into program > +#install-libLTLIBRARIES: > > # this actually over write the lib install > install-exec-am: install-binPROGRAMS > > > 2. ibis doesn't register class 0x81 - SM direct routed, only SM lid > routed (0x1). In comment in ibutils/ibis/src/ibsm.c line 118 is stated: > > /* no need to bind the Directed Route class as it will automatically > be handled by the osm_vendor_bind if asked for LID route */ > > As far as I can see in osm_vendor_bind() it is not (but it is in > opposite order - when class 0x81 is registered class 0x1 will be > registered too). Yes that is what osm_vendor_ibumad.c:osm_vendor_bind does. So either ibdiagnet needs to register 0x81 r.t.1 or osm_vendor_ibumad.c:osm_vendor_bind needs to be "symmetric" in terms of registering the other SM class when only one is requested. This is a minor change in the underlying semantics. [Popping up a level in terms of this, (other than applications taking advantage of this "feature",) I'm not sure why the vendor layer should assume that just because one SM class is requested, the other should be too]. I just looked and the latter appears to be consistent with the other vendor layers. I think either solution will work. Your solution below also looks like it would work but don't that should be done in a sim layer. > Somehow it works without ibsim - so I suspect user_mad handles it. > > (Hal, could you clarify?) The kernel (user_mad/mad) does not change the requested registrations but I'm not sure I understand the question you are asking to be clarified. Is that what you're asking ? -- Hal > Assuming this is the case, then patch to umad2sim (ibsim) like this can > help (helps for me): > > diff --git a/umad2sim/umad2sim.c b/umad2sim/umad2sim.c > index 646cde2..a6196f1 100644 > --- a/umad2sim/umad2sim.c > +++ b/umad2sim/umad2sim.c > @@ -407,7 +408,7 @@ static ssize_t umad2sim_read(struct umad2sim_dev *dev, void *buf, size_t count) > mgmt_class = 0; > } > > - umad->agent_id = dev->agent_idx[mgmt_class]; > + umad->agent_id = dev->agent_idx[mgmt_class&(~0x80)]; > umad->status = ntohl(req.status); > umad->timeout_ms = 0; > umad->retries = 0; > > > But I'm not sure yet that it is proper solution (up to class registration > issue clarification). > > Sasha > From mkrause at hp.com Thu Oct 2 07:41:27 2008 From: mkrause at hp.com (Michael Krause) Date: Thu, 02 Oct 2008 07:41:27 -0700 Subject: [ofa-general] Infiniband bandwidth In-Reply-To: <1222936991.31161.339.camel@mundo> References: <1222877327.31161.315.camel@mundo> <6.2.0.14.2.20081001140617.07bede78@esmail.cup.hp.com> <1222936991.31161.339.camel@mundo> Message-ID: <6.2.0.14.2.20081002072938.0760e8b0@esmail.cup.hp.com> An HTML attachment was scrubbed... URL: From truelove at array.ca Thu Oct 2 07:41:53 2008 From: truelove at array.ca (Steven Truelove) Date: Thu, 02 Oct 2008 10:41:53 -0400 Subject: [ofa-general] Status of NFS over RDMA and SRP? Message-ID: <48E4DDB1.8010303@array.ca> Hi, I am considering using our existing Infiniband interconnect to provide high-speed storage access to our compute cluster. It looks like the two ways to do this are NFS over RDMA and SRP. I have found downloads for NFS over RDMA, but it is for an older kernel, and doesn't appear to be maintained at the moment. Similarly, I have found little about SRP. What is the current status of these features? Is anyone currently maintaining these projects? Are they part of OFED? Do they have separate mailing lists? Thanks, Steven Truelove -- Steven Truelove Array Systems Computing, Inc. 1120 Finch Avenue West, 7th Floor Toronto, Ontario M3J 3H7 CANADA http://www.array.ca truelove at array.ca Phone: (416) 736-0900 x307 Fax: (416) 736-4715 From ctung at neteffect.com Thu Oct 2 07:39:30 2008 From: ctung at neteffect.com (Chien Tung) Date: Thu, 2 Oct 2008 09:39:30 -0500 Subject: [ofa-general] [PATCH 1/4] RDMA/nes: Fix 4K PBL accounting Message-ID: <200810021439.m92EdUsK005499@velma.neteffect.com> From: Vishal Thanki Properly account for freed 4K PBL. Signed-off-by: Vishal Thanki Signed-off-by: Sweta Bhatt Signed-off-by: Chien Tung -- Roland, Please consider these 4 patches for 2.6.28. drivers/infiniband/hw/nes/nes_verbs.c | 19 ++++++++++++------- 1 files changed, 12 insertions(+), 7 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index 932e56f..cd09493 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -538,14 +538,9 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr) struct nes_fmr *nesfmr = to_nesfmr(nesmr); struct nes_vnic *nesvnic = to_nesvnic(ibfmr->device); struct nes_device *nesdev = nesvnic->nesdev; - struct nes_mr temp_nesmr = *nesmr; + struct nes_adapter *nesadapter = nesdev->nesadapter; int i = 0; - temp_nesmr.ibmw.device = ibfmr->device; - temp_nesmr.ibmw.pd = ibfmr->pd; - temp_nesmr.ibmw.rkey = ibfmr->rkey; - temp_nesmr.ibmw.uobject = NULL; - /* free the resources */ if (nesfmr->leaf_pbl_cnt == 0) { /* single PBL case */ @@ -562,7 +557,17 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr) nesfmr->root_vpbl.pbl_pbase); } - return nes_dealloc_mw(&temp_nesmr.ibmw); + nesmr->ibmw.device = ibfmr->device; + nesmr->ibmw.pd = ibfmr->pd; + nesmr->ibmw.rkey = ibfmr->rkey; + nesmr->ibmw.uobject = NULL; + + if (nesfmr->nesmr.pbl_4k) { + nesadapter->free_4kpbl += nesfmr->nesmr.pbls_used; + BUG_ON(nesadapter->free_4kpbl > nesadapter->max_4kpbl); + } + + return nes_dealloc_mw(&nesmr->ibmw); } From ctung at neteffect.com Thu Oct 2 07:39:30 2008 From: ctung at neteffect.com (Chien Tung) Date: Thu, 2 Oct 2008 09:39:30 -0500 Subject: [ofa-general] [PATCH 3/4] RDMA/nes: Fix routed RDMA connection Message-ID: <200810021439.m92EdUY3005503@velma.neteffect.com> From: Bob Sharp Fix routed RDMA connections. Use neigh_*() to properly locate neighbor. Signed-off-by: Bob Sharp Signed-off-by: Sweta Bhatt Signed-off-by: Chien Tung -- drivers/infiniband/hw/nes/nes_cm.c | 38 ++++++++++++++++++++++++++++------- 1 files changed, 30 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index 499d3cf..a782f49 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -52,7 +52,7 @@ #include #include #include - +#include #include #include #include @@ -1019,23 +1019,43 @@ static inline int mini_cm_accelerated(struct nes_cm_core *cm_core, /** - * nes_addr_send_arp + * nes_addr_resolve_neigh */ -static void nes_addr_send_arp(u32 dst_ip) +static int nes_addr_resolve_neigh(struct nes_vnic *nesvnic, u32 dst_ip) { struct rtable *rt; struct flowi fl; + struct neighbour *neigh; + int rc = -1; + DECLARE_MAC_BUF(mac); memset(&fl, 0, sizeof fl); fl.nl_u.ip4_u.daddr = htonl(dst_ip); if (ip_route_output_key(&init_net, &rt, &fl)) { printk("%s: ip_route_output_key failed for 0x%08X\n", __func__, dst_ip); - return; + return rc; + } + + neigh = neigh_lookup(&arp_tbl, &rt->rt_gateway, nesvnic->netdev); + if (neigh) { + if (neigh->nud_state & NUD_VALID) { + nes_debug(NES_DBG_CM, "Neighbor MAC address for 0x%08X" + " is %s, Gateway is 0x%08X \n", dst_ip, + print_mac(mac, neigh->ha), ntohl(rt->rt_gateway)); + nes_manage_arp_cache(nesvnic->netdev, neigh->ha, + dst_ip, NES_ARP_ADD); + rc = nes_arp_table(nesvnic->nesdev, dst_ip, NULL, + NES_ARP_RESOLVE); + } + neigh_release(neigh); } - neigh_event_send(rt->u.dst.neighbour, NULL); + if ((neigh == NULL) || (!(neigh->nud_state & NUD_VALID))) + neigh_event_send(rt->u.dst.neighbour, NULL); + ip_rt_put(rt); + return rc; } @@ -1108,9 +1128,11 @@ static struct nes_cm_node *make_cm_node(struct nes_cm_core *cm_core, /* get the mac addr for the remote node */ arpindex = nes_arp_table(nesdev, cm_node->rem_addr, NULL, NES_ARP_RESOLVE); if (arpindex < 0) { - kfree(cm_node); - nes_addr_send_arp(cm_info->rem_addr); - return NULL; + arpindex = nes_addr_resolve_neigh(nesvnic, cm_info->rem_addr); + if (arpindex < 0) { + kfree(cm_node); + return NULL; + } } /* copy the mac addr to node context */ From ctung at neteffect.com Thu Oct 2 07:39:30 2008 From: ctung at neteffect.com (Chien Tung) Date: Thu, 2 Oct 2008 09:39:30 -0500 Subject: [ofa-general] [PATCH 2/4] RDMA/nes: Change CQ allocation for performance applications Message-ID: <200810021439.m92EdUlB005501@velma.neteffect.com> From: Vadim Makhervaks Change CQ allocation scheme for performance applications. Signed-off-by: Vadim Makhervaks Signed-off-by: Sweta Bhatt Signed-off-by: Chien Tung -- drivers/infiniband/hw/nes/nes_verbs.c | 4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index cd09493..cd79780 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -1600,7 +1600,9 @@ static struct ib_cq *nes_create_cq(struct ib_device *ibdev, int entries, nes_ucontext->mcrqf = req.mcrqf; if (nes_ucontext->mcrqf) { if (nes_ucontext->mcrqf & 0x80000000) - nescq->hw_cq.cq_number = nesvnic->nic.qp_id + 12 + (nes_ucontext->mcrqf & 0xf) - 1; + nescq->hw_cq.cq_number = + nesvnic->nic.qp_id + 28 + + 2*((nes_ucontext->mcrqf & 0xf) - 1); else if (nes_ucontext->mcrqf & 0x40000000) nescq->hw_cq.cq_number = nes_ucontext->mcrqf & 0xffff; else From ctung at neteffect.com Thu Oct 2 07:39:30 2008 From: ctung at neteffect.com (Chien Tung) Date: Thu, 2 Oct 2008 09:39:30 -0500 Subject: [ofa-general] [PATCH 4/4] RDMA/nes: Clear cm_id only when done with cm_node Message-ID: <200810021439.m92EdUfG005505@velma.neteffect.com> From: Faisal Latif Clear cm_node->cm_id only when we are really done with it. Signed-off-by: Faisal Latif Signed-off-by: Sweta Bhatt Signed-off-by: Chien Tung -- drivers/infiniband/hw/nes/nes_cm.c | 4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index 3bf90fb..896297b 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -2008,6 +2008,7 @@ static int mini_cm_close(struct nes_cm_core *cm_core, struct nes_cm_node *cm_nod case NES_CM_STATE_CLOSE_WAIT: cm_node->state = NES_CM_STATE_LAST_ACK; send_fin(cm_node, NULL); + cm_node->cm_id = NULL; break; case NES_CM_STATE_FIN_WAIT1: case NES_CM_STATE_FIN_WAIT2: @@ -2015,6 +2016,7 @@ static int mini_cm_close(struct nes_cm_core *cm_core, struct nes_cm_node *cm_nod case NES_CM_STATE_TIME_WAIT: case NES_CM_STATE_CLOSING: ret = -1; + cm_node->cm_id = NULL; break; case NES_CM_STATE_LISTENING: case NES_CM_STATE_UNKNOWN: @@ -2029,7 +2031,7 @@ static int mini_cm_close(struct nes_cm_core *cm_core, struct nes_cm_node *cm_nod ret = rem_ref_cm_node(cm_core, cm_node); break; } - cm_node->cm_id = NULL; + return ret; } From hal.rosenstock at gmail.com Thu Oct 2 07:50:40 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 2 Oct 2008 10:50:40 -0400 Subject: [ofa-general] ***SPAM*** ibutils/ibis linking error Message-ID: Hi Oren, When building using the master branch of ~orenk/ibutils.git, I get the following error: if g++ -DHAVE_CONFIG_H -I. -I. -I.. -I/usr/include/tcl8.4 -I/usr/local/include/infiniband -I/usr/local/include -DOSM_VENDOR_INTF_OPENIB -DOSM_BUILD_OPENIB -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -O2 -Wall -fno-strict-aliasing -fPIC -DIBIS_VERSION=\"1.2\" -g -O2 -MT ibissh_wrap.o -MD -MP -MF ".deps/ibissh_wrap.Tpo" -c -o ibissh_wrap.o ibissh_wrap.cpp; \ then mv -f ".deps/ibissh_wrap.Tpo" ".deps/ibissh_wrap.Po"; else rm -f ".deps/ibissh_wrap.Tpo"; exit 1; fi /bin/sh ../libtool --tag=CXX --mode=link g++ -I/usr/include/tcl8.4 -I/usr/local/include/infiniband -I/usr/local/include -DOSM_VENDOR_INTF_OPENIB -DOSM_BUILD_OPENIB -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -O2 -Wall -fno-strict-aliasing -fPIC -DIBIS_VERSION=\"1.2\" -g -O2 -o ibis -static ibissh_wrap.o -libiscom -Wl,-rpath -Wl,/usr/local/lib -L/usr/local/lib -lopensm -losmvendor -losmcomp -libumad -libcommon -L/usr/lib64 -ltcl8.4 -ldl -lpthread -lm g++ -I/usr/include/tcl8.4 -I/usr/local/include/infiniband -I/usr/local/include -DOSM_VENDOR_INTF_OPENIB -DOSM_BUILD_OPENIB -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -O2 -Wall -fno-strict-aliasing -fPIC -DIBIS_VERSION=\"1.2\" -g -O2 -o ibis ibissh_wrap.o -Wl,-rpath -Wl,/usr/local/lib /home/halr/ibutils/ibis/src/.libs/libibiscom.a -L/usr/local/lib /usr/local/lib/libopensm.a /usr/local/lib/libosmvendor.a -L/home/halr/management/opensm/complib /usr/local/lib/libosmcomp.a -lwrap /usr/local/lib/libibumad.a /usr/local/lib/libibcommon.a -L/usr/lib64 -ltcl8.4 -ldl -lpthread -lm /usr/lib/libc_nonshared.a(elf-init.oS)(.text.__i686.get_pc_thunk.bx+0x0): In function `__i686.get_pc_thunk.bx': : multiple definition of `__i686.get_pc_thunk.bx' ibissh_wrap.o(.gnu.linkonce.t.__i686.get_pc_thunk.bx+0x0):/home/halr/ibutils/ibis/src/ibissh_wrap.cpp:186: first defined here collect2: ld returned 1 exit status Any idea on what is causing this ? Thanks. -- Hal From Thomas.Talpey at netapp.com Thu Oct 2 08:49:12 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 02 Oct 2008 11:49:12 -0400 Subject: [ofa-general] Status of NFS over RDMA and SRP? In-Reply-To: <48E4DDB1.8010303@array.ca> References: <48E4DDB1.8010303@array.ca> Message-ID: At 10:41 AM 10/2/2008, Steven Truelove wrote: >Hi, > > I am considering using our existing Infiniband interconnect to >provide high-speed storage access to our compute cluster. It looks like >the two ways to do this are NFS over RDMA and SRP. > > I have found downloads for NFS over RDMA, but it is for an older >kernel, and doesn't appear to be maintained at the moment. NFS/RDMA is present in the mainline kernel starting with 2.6.25 for the client and 2.6.26 for the server. Just configure it in File Systems->Network Filesystems->NFS etc, the NFS/RDMA options appear whenever the RDMA layer is available. > Similarly, I >have found little about SRP. > > What is the current status of these features? Is anyone currently >maintaining these projects? Are they part of OFED? Do they have >separate mailing lists? NFS/RDMA will also be part of OFED starting with 1.4, but whether that is your best choice depends on what kernel you're planning to run. If you're running RHEL5 and want to stay on it, then OFED1.4 would be your best choice. NFS overall is a very active development area, and NFS/RDMA especially is usually best from the top of the tree in kernel.org. In fact I have a number of client patches almost ready to go for (hopefully) 2.6.28. There are also a large number of server patches already in the 2.6.27 rc's. We have a mailing list for NFS/RDMA support at sourceforge, you can find it from . Or, simply post to the Linux NFS list at linux-nfs at vger.kernel.org, or to me and Tom Tucker (cc'd, server NFS/RDMA) directly. Tom. From sashak at voltaire.com Thu Oct 2 10:00:33 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 2 Oct 2008 20:00:33 +0300 Subject: [ofa-general] ***SPAM*** ibdm network topology format In-Reply-To: References: <829ded920809290139vf2cc151w4cc8a6fafb49edfe@mail.gmail.com> <829ded920809292304k3ffc78c0m556efbdd7d35c528@mail.gmail.com> <20080930121252.GA7396@sashak.voltaire.com> <829ded920810010207r475d82abu269d47cd3baddb3f@mail.gmail.com> <20081001203813.GL7396@sashak.voltaire.com> <20081002022430.GQ7396@sashak.voltaire.com> Message-ID: <20081002170033.GI25831@sashak.voltaire.com> Hi Hal, On 10:18 Thu 02 Oct , Hal Rosenstock wrote: > > > > 2. ibis doesn't register class 0x81 - SM direct routed, only SM lid > > routed (0x1). In comment in ibutils/ibis/src/ibsm.c line 118 is stated: > > > > /* no need to bind the Directed Route class as it will automatically > > be handled by the osm_vendor_bind if asked for LID route */ > > > > As far as I can see in osm_vendor_bind() it is not (but it is in > > opposite order - when class 0x81 is registered class 0x1 will be > > registered too). > > Yes that is what osm_vendor_ibumad.c:osm_vendor_bind does. > > So either ibdiagnet needs to register 0x81 r.t.1 or > osm_vendor_ibumad.c:osm_vendor_bind needs to be "symmetric" in terms > of registering the other SM class when only one is requested. This is > a minor change in the underlying semantics. [Popping up a level in > terms of this, (other than applications taking advantage of this > "feature",) I'm not sure why the vendor layer should assume that just > because one SM class is requested, the other should be too]. I just > looked and the latter appears to be consistent with the other vendor > layers. I think either solution will work. Your solution below also > looks like it would work but don't that should be done in a sim layer. I'm not like this "solution" too, but the fact that ibis works with real stack without registering 0x81 class is unclear for me. > > Somehow it works without ibsim - so I suspect user_mad handles it. > > > > (Hal, could you clarify?) > > The kernel (user_mad/mad) does not change the requested registrations > but I'm not sure I understand the question you are asking to be > clarified. Is that what you're asking ? ibis works somehow with real stack. It registers 0x1 class only and uses direct routing SMPs. Do you have any idea about why (osm_vendor_idumad and/or libibumad don't help)? Sasha From Thomas.Talpey at netapp.com Thu Oct 2 10:39:56 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 02 Oct 2008 13:39:56 -0400 Subject: [ofa-general] rdma_resolve_route() returning -EINVAL Message-ID: I'm debugging a reconnect problem in the NFS/RDMA client and am seeing something rather odd. The context is that if a client mount point goes idle for 5 minutes, the Linux RPC layer closes the associated connection. When a new request needs to be sent, the RPC layer then performs a reconnect. At this point, the NFS/RDMA client code will call rdma_create_id() to create a new rdma_cm_id, then rdma_resolve_addr() and finally rdma_resolve_route(). In the reconnect scenario, that last step however returns -EINVAL. Looking at the code, I think the only reasons for this return are 1) calling rdma_resolve_route() in the wrong state (which I'm not), and 2) way down in the ib_post_send_mad() function, if there is a timeout passed-in (which there is) and there's no receive handler registered for the MAD (no clue but it worked the first time). This is using the ib_mthca driver, and 2.6.27-rc7 btw. Any clues to help figure out what might be wrong? Thanks, Tom. From christopher.tanner at gatech.edu Thu Oct 2 14:06:30 2008 From: christopher.tanner at gatech.edu (Christopher Tanner) Date: Thu, 2 Oct 2008 17:06:30 -0400 Subject: [ofa-general] Bare minimum install Message-ID: <0A4A10CC-F78C-4C0D-ACF5-41E09D151030@gatech.edu> All - I have been handed a cluster that has Mellanox Infiniband cards and Ubuntu 8.04 and I've been trying to install the necessary software to get it to communicate. I think I need to start from ground zero as I have no idea how Infiniband works and what is / isn't necessary for my system. Assuming I have absolutely no Infiniband software on my system (i.e. fresh OS install): a) What OFED packages are absolutely necessary for my MPI applications (i.e. LS-DYNA) to use IB using MVAPICH2 or OpenMPI? b) Will installing the Infiniband software on an NFS mount decrease its usefulness? (assuming IPoIB is not setup) Would it be best to install the Infiniband software on the local drive of each node? c) What OFED packages are necessary to setup IPoIB? For a small cluster (16 nodes), is IPoIB even worth the trouble? d) Is there a document that describes, in simple terms, how Infiniband works so that I know why I'm installing certain packages and how to manually configure the system? (Automated install doesn't work with Ubuntu) For those who have helped me in the past, thank you. I'm starting again from ground zero since everything I tried beforehand didn't work. ------------------------------------------- Chris Tanner Space Systems Design Lab Georgia Institute of Technology christopher.tanner at gatech.edu ------------------------------------------- From rdreier at cisco.com Thu Oct 2 14:34:45 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 02 Oct 2008 14:34:45 -0700 Subject: [ofa-general] Bare minimum install In-Reply-To: <0A4A10CC-F78C-4C0D-ACF5-41E09D151030@gatech.edu> (Christopher Tanner's message of "Thu, 2 Oct 2008 17:06:30 -0400") References: <0A4A10CC-F78C-4C0D-ACF5-41E09D151030@gatech.edu> Message-ID: > a) What OFED packages are absolutely necessary for my MPI applications > (i.e. LS-DYNA) to use IB using MVAPICH2 or OpenMPI? None, with Ubuntu 8.04. Just install the openmpi-bin package. > b) Will installing the Infiniband software on an NFS mount decrease > its usefulness? (assuming IPoIB is not setup) Would it be best to > install the Infiniband software on the local drive of each node? Either way is fine. > c) What OFED packages are necessary to setup IPoIB? For a small > cluster (16 nodes), is IPoIB even worth the trouble? None. Just modprobe ib_ipoib and ifconfig your ibX interfaces however you want. > d) Is there a document that describes, in simple terms, how Infiniband > works so that I know why I'm installing certain packages and how to > manually configure the system? (Automated install doesn't work with > Ubuntu) Not really, unfortunately. - R. From YJia at tmriusa.com Thu Oct 2 15:14:15 2008 From: YJia at tmriusa.com (Yicheng Jia) Date: Thu, 2 Oct 2008 17:14:15 -0500 Subject: [ofa-general] questions about opensm and unmanaged switch In-Reply-To: <20081002010637.GN7396@sashak.voltaire.com> Message-ID: Hi Sasha, The error I got are "osm_db_store: ERR 6109: Failed to remove file:/tmp//guid2lid" and "osm_db_store: ERR 6108: Failed to rename the db file to:/tmp//guid2lid". I set "OSM_DEFAULT_CACHE_DIR" to "/tmp/". Could it be the reason of slow? Thanks! Yicheng Sasha Khapyorsky 10/01/2008 08:05 PM To Yicheng Jia cc general at lists.openfabrics.org Subject Re: [ofa-general] questions about opensm and unmanaged switch Hi Yicheng, On 12:35 Wed 01 Oct , Yicheng Jia wrote: > > I use opensm with a single unmanaged switch to connect several HCAs. I > found that HCAs take much longer time to get LID after opensm restart > without power cycle the switch. If the switch is power off/on before > opensm restart, then HCAs get their LIDs sooner. I'm wondering if there's > minhop table pre-existing in the switch which will prevent HCAs from > regain their LIDs soon. Is there any way to clean up the switch's hop > table during opensm start? In first run OpenSM will not fetch existing LFTs from the switch and will setup this from "scratch". I think that the reason for delay should be different. Also it is single-switch subnet, its setup should be reasonably fast in both cases. Do you see any errors in OpenSM log file? > my another question, should the HCA that opensm resides on be physically > connected to switch port 0? No. Port 0 is switch's management port and normally it doesn't have physical connection. Sasha _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Thu Oct 2 15:22:03 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 2 Oct 2008 18:22:03 -0400 Subject: [ofa-general] ***SPAM*** ibdm network topology format In-Reply-To: <20081002170033.GI25831@sashak.voltaire.com> References: <829ded920809290139vf2cc151w4cc8a6fafb49edfe@mail.gmail.com> <829ded920809292304k3ffc78c0m556efbdd7d35c528@mail.gmail.com> <20080930121252.GA7396@sashak.voltaire.com> <829ded920810010207r475d82abu269d47cd3baddb3f@mail.gmail.com> <20081001203813.GL7396@sashak.voltaire.com> <20081002022430.GQ7396@sashak.voltaire.com> <20081002170033.GI25831@sashak.voltaire.com> Message-ID: Sasha, On Thu, Oct 2, 2008 at 1:00 PM, Sasha Khapyorsky wrote: > Hi Hal, > > On 10:18 Thu 02 Oct , Hal Rosenstock wrote: >> > >> > 2. ibis doesn't register class 0x81 - SM direct routed, only SM lid >> > routed (0x1). In comment in ibutils/ibis/src/ibsm.c line 118 is stated: >> > >> > /* no need to bind the Directed Route class as it will automatically >> > be handled by the osm_vendor_bind if asked for LID route */ >> > >> > As far as I can see in osm_vendor_bind() it is not (but it is in >> > opposite order - when class 0x81 is registered class 0x1 will be >> > registered too). >> >> Yes that is what osm_vendor_ibumad.c:osm_vendor_bind does. >> >> So either ibdiagnet needs to register 0x81 r.t.1 or >> osm_vendor_ibumad.c:osm_vendor_bind needs to be "symmetric" in terms >> of registering the other SM class when only one is requested. This is >> a minor change in the underlying semantics. [Popping up a level in >> terms of this, (other than applications taking advantage of this >> "feature",) I'm not sure why the vendor layer should assume that just >> because one SM class is requested, the other should be too]. I just >> looked and the latter appears to be consistent with the other vendor >> layers. I think either solution will work. Your solution below also >> looks like it would work but don't that should be done in a sim layer. > > I'm not like this "solution" too, but the fact that ibis works with real > stack without registering 0x81 class is unclear for me. Me too. See below. >> > Somehow it works without ibsim - so I suspect user_mad handles it. >> > >> > (Hal, could you clarify?) >> >> The kernel (user_mad/mad) does not change the requested registrations >> but I'm not sure I understand the question you are asking to be >> clarified. Is that what you're asking ? > > ibis works somehow with real stack. It registers 0x1 class only and > uses direct routing SMPs. Do you have any idea about why > (osm_vendor_idumad and/or libibumad don't help)? libibumad umad_register does not do anything that would affect this either. I can only conclude there must be something in ibutils that fixes this if it does work with the real stack. It shouldn't be too hard to track down where that registration for class 0x81 comes from. -- Hal > Sasha > From hal.rosenstock at gmail.com Thu Oct 2 15:29:39 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 2 Oct 2008 18:29:39 -0400 Subject: [ofa-general] rdma_resolve_route() returning -EINVAL In-Reply-To: References: Message-ID: Tom, On Thu, Oct 2, 2008 at 1:39 PM, Talpey, Thomas wrote: > I'm debugging a reconnect problem in the NFS/RDMA client and > am seeing something rather odd. The context is that if a client > mount point goes idle for 5 minutes, the Linux RPC layer closes > the associated connection. When a new request needs to be > sent, the RPC layer then performs a reconnect. > > At this point, the NFS/RDMA client code will call rdma_create_id() > to create a new rdma_cm_id, then rdma_resolve_addr() and > finally rdma_resolve_route(). In the reconnect scenario, that > last step however returns -EINVAL. > > Looking at the code, I think the only reasons for this return are > 1) calling rdma_resolve_route() in the wrong state (which I'm not), > and 2) way down in the ib_post_send_mad() function, if there is > a timeout passed-in (which there is) and there's no receive handler > registered for the MAD (no clue but it worked the first time). Are you saying you're suspecting reason 2 above ? FWIW, my read relative to ib_post_send_mad is that CM does register a receive handler so I don't think -EINVAL comes from there. Are you actually seeing the lack of a receive handler or is it from reviewing the code looking from where -EINVAL could possibly come ? -- Hal > This is using the ib_mthca driver, and 2.6.27-rc7 btw. Any clues to > help figure out what might be wrong? > > Thanks, > Tom. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From alex.estrin at qlogic.com Thu Oct 2 17:19:24 2008 From: alex.estrin at qlogic.com (Alex Estrin) Date: Thu, 2 Oct 2008 19:19:24 -0500 Subject: [ofa-general] IPoIB CM connectivity issue. In-Reply-To: References: <20080923083956.GA14288@mtls03> Message-ID: Hello, It seem there is an problem with connection algorithm in ipoib cm. In case if connection initiated by remote node(B), OFED node(A) accepts connection and then sends it's own connect request to the same node(B) using different RC QP! Here is what I see on a wire: A -> ARP REQ ->B B -> CREQ ->A A -> CREP (local RC QPN1) ->B B -> RTU ->A B -> ARP REP ->A unicast packet delivered over RC QP A -> CREQ (local RC QPN2) ->B !!!!!!!!!!!!!!!!!!! How many RC queue pairs OFED nodes can use to connect to each other? In case if connection initiated from OFED(A) FIRST - node (B) accepts request and connection stays alive until arp table entry(B) is expired and (A) sends DREQ to (B). Only one RC QP is used to communicate. B -> ARP REQ ->A A -> ARP REP ->B !!(questionable. Please see note2 below) A -> CREQ (local RC QPN1) ->B B -> CREP ->A A -> RTU ->B ....... .... followed data delivered over RC QP .... Note1: Initial assumption - in the beginning both hosts have clean ARP tables. Note2: It looks like violation of RFC4755 convention to send unicast data over connected QP. Please let me know if I missed anything. Thanks, Alex. From rdreier at cisco.com Thu Oct 2 17:27:36 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 02 Oct 2008 17:27:36 -0700 Subject: [ofa-general] Re: IPoIB CM connectivity issue. In-Reply-To: (Alex Estrin's message of "Thu, 2 Oct 2008 19:19:24 -0500") References: <20080923083956.GA14288@mtls03> Message-ID: > In case if connection initiated by remote node(B), OFED node(A) > accepts connection and then sends it's own connect request to the same > node(B) using different RC QP! Yes, in the Linux implementation of IPoIB CM, each QP is used for traffic in only a single direction. This simplifies things quite a bit in the implementation, and as far as I know is a fully compliant thing to do. > It looks like violation of RFC4755 convention to send unicast data over > connected QP. I assume you mean sending ARP replies. Yes, you are correct. I never noticed before but RFC 4755 does say: Additionally, all address resolution responses (ARP or Neighbor Discovery) MUST always be encapsulated in a UD mode packet. So we should fix the Linux implementation to respect this. However I'm somewhat surprised that this is an issue, given that unicast_arp_send() seems like it never would issue a CM send. - R. From Thomas.Talpey at netapp.com Thu Oct 2 18:07:31 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 02 Oct 2008 21:07:31 -0400 Subject: [ofa-general] rdma_resolve_route() returning -EINVAL In-Reply-To: References: Message-ID: At 06:29 PM 10/2/2008, Hal Rosenstock wrote: >Tom, > >On Thu, Oct 2, 2008 at 1:39 PM, Talpey, Thomas > wrote: >> I'm debugging a reconnect problem in the NFS/RDMA client and >> am seeing something rather odd. The context is that if a client >> mount point goes idle for 5 minutes, the Linux RPC layer closes >> the associated connection. When a new request needs to be >> sent, the RPC layer then performs a reconnect. >> >> At this point, the NFS/RDMA client code will call rdma_create_id() >> to create a new rdma_cm_id, then rdma_resolve_addr() and >> finally rdma_resolve_route(). In the reconnect scenario, that >> last step however returns -EINVAL. >> >> Looking at the code, I think the only reasons for this return are >> 1) calling rdma_resolve_route() in the wrong state (which I'm not), >> and 2) way down in the ib_post_send_mad() function, if there is >> a timeout passed-in (which there is) and there's no receive handler >> registered for the MAD (no clue but it worked the first time). > >Are you saying you're suspecting reason 2 above ? FWIW, my read >relative to ib_post_send_mad is that CM does register a receive Hi Hal, thanks for looking at it. As it turns out I've determined it's actually 1) above, but for a new reason. It turns out that the CM has a new upcall enum called RDMA_CM_EVENT_TIMEWAIT_EXIT which is emitted shortly after any disconnect. This upcall arrives either before or during my connection recovery and signals a completion in my code that causes the re-binding to skip a step. What's the purpose of this new upcall, do you know? It's not used by anything I see. Tom. >handler so I don't think -EINVAL comes from there. Are you actually >seeing the lack of a receive handler or is it from reviewing the code >looking from where -EINVAL could possibly come ? > >-- Hal > >> This is using the ib_mthca driver, and 2.6.27-rc7 btw. Any clues to >> help figure out what might be wrong? >> >> Thanks, >> Tom. From rdreier at cisco.com Thu Oct 2 18:09:27 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 02 Oct 2008 18:09:27 -0700 Subject: [ofa-general] rdma_resolve_route() returning -EINVAL In-Reply-To: (Thomas Talpey's message of "Thu, 02 Oct 2008 21:07:31 -0400") References: Message-ID: It was added in: commit 38ca83a588662f0af684ba2567dd910a564268ab Author: Amir Vadai Date: Tue Jul 22 14:14:23 2008 -0700 RDMA/cma: Add RDMA_CM_EVENT_TIMEWAIT_EXIT event Consumers that want to re-use their QPs in new connections need to know when the QP has exited the timewait state. Report the timewait event through the rdma_cm. basically you can't really free an IB QP until it fully leaves the CM state machine, ie when it finishes with timewait. - R. From Thomas.Talpey at netapp.com Thu Oct 2 18:18:25 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 02 Oct 2008 21:18:25 -0400 Subject: [ofa-general] rdma_resolve_route() returning -EINVAL In-Reply-To: References: Message-ID: At 09:09 PM 10/2/2008, Roland Dreier wrote: >It was added in: > >commit 38ca83a588662f0af684ba2567dd910a564268ab >Author: Amir Vadai >Date: Tue Jul 22 14:14:23 2008 -0700 > > RDMA/cma: Add RDMA_CM_EVENT_TIMEWAIT_EXIT event > > Consumers that want to re-use their QPs in new connections need to > know when the QP has exited the timewait state. Report the timewait > event through the rdma_cm. > >basically you can't really free an IB QP until it fully leaves the CM >state machine, ie when it finishes with timewait. Ok,. I guess that's not obvious from the name, but in particular if it's so important, then why doesn't anyone implement it? It happens pretty much iimmediately on my IB adapter QPs, under what conditions would it be delayed? And since rdma_destroy_qp() is void, how would I know if I call it too early? I'm think maybe I should be waiting for it, and not the disconnect event, before beginning my connection recovery... Tom. From rdreier at cisco.com Thu Oct 2 18:24:50 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 02 Oct 2008 18:24:50 -0700 Subject: [ofa-general] rdma_resolve_route() returning -EINVAL In-Reply-To: (Thomas Talpey's message of "Thu, 02 Oct 2008 21:18:25 -0400") References: Message-ID: > Ok,. I guess that's not obvious from the name, but in particular if it's > so important, then why doesn't anyone implement it? It happens pretty > much iimmediately on my IB adapter QPs, under what conditions would > it be delayed? And since rdma_destroy_qp() is void, how would I know > if I call it too early? The timewait state is just long enough for any packets to drain out of the IB fabric -- the IB spec CM chapter has the exact formula, but it's going to be very short. You're allowed to destroy a QP earlier, but you have a remote chance of getting into trouble if you reuse the same QP number before any stale packets have drained from the fabric. The issue is more of spec compliance than a likely real-life scenario... and as for why no one else is worrying about it, I think it's because the only other user of rdma_connect() in the tree is iSER, and I guess no one worried too much there. SRP uses the IB CM directly, and waits for timewait exit before calling a connection closed. - R. From Thomas.Talpey at netapp.com Thu Oct 2 19:11:30 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 02 Oct 2008 22:11:30 -0400 Subject: [ofa-general] rdma_resolve_route() returning -EINVAL In-Reply-To: References: Message-ID: At 09:24 PM 10/2/2008, Roland Dreier wrote: > > > Ok,. I guess that's not obvious from the name, but in particular if it's > > so important, then why doesn't anyone implement it? It happens pretty > > much iimmediately on my IB adapter QPs, under what conditions would > > it be delayed? And since rdma_destroy_qp() is void, how would I know > > if I call it too early? > >The timewait state is just long enough for any packets to drain out of >the IB fabric -- the IB spec CM chapter has the exact formula, but it's >going to be very short. > >You're allowed to destroy a QP earlier, but you have a remote chance of >getting into trouble if you reuse the same QP number before any stale >packets have drained from the fabric. Makes sense. I am currently using rdma_create_qp(), so the qpnum isn't in my control I think. I'm a bit more worried about this being more likely in a wide-area iWARP connection, which some of the grid folks are considering. OTOH, I don't want to delay connection recovery - I just found and fixed the source of a 5-second pause! :-) > >The issue is more of spec compliance than a likely real-life >scenario... and as for why no one else is worrying about it, I think >it's because the only other user of rdma_connect() in the tree is iSER, >and I guess no one worried too much there. SRP uses the IB CM directly, >and waits for timewait exit before calling a connection closed. I didn't see the SRP timewait behavior, I'll go look again. Thanks, Tom. > > - R. From mashirle at us.ibm.com Thu Oct 2 21:03:24 2008 From: mashirle at us.ibm.com (Shirley Ma) Date: Thu, 02 Oct 2008 21:03:24 -0700 Subject: [ofa-general] Re: IPoIB CM connectivity issue. In-Reply-To: References: <20080923083956.GA14288@mtls03> Message-ID: <1223006604.8387.7.camel@IBM-29AB850785D.beaverton.ibm.com> Hello, 在 2008-10-02四的 17:27 -0700,Roland Dreier写道: > So we should fix the Linux implementation to respect this. However > I'm > somewhat surprised that this is an issue, given that > unicast_arp_send() > seems like it never would issue a CM send. I remembered I saw the problem when the ARP entry was in stale status (15 - 45 secs if no traffic by default base_reachable_time = 30), but the router entry was still there (5 mins expiration). Then neighbour is still there to be used so IPoIB used CM mode instead of unicast_arp_send. Shirley From mashirle at us.ibm.com Thu Oct 2 21:20:39 2008 From: mashirle at us.ibm.com (Shirley Ma) Date: Thu, 02 Oct 2008 21:20:39 -0700 Subject: [ofa-general] openSM for supporting IPv6 SNM MGIDs consolidation Message-ID: <1223007639.8387.22.camel@IBM-29AB850785D.beaverton.ibm.com> Hello Sasha, We had several customers hit multicast groups exceed issue when IPv6 was in place in the past. My question is has IPv6 SNM (Solocited Node Multicast) MGIDs consolidation feature been in openSM 3.2.X? If so, can the customer just updates openSM without updating the whole fabrics? Any backport compatibility test has been done between this version 3.2.X openSM and other OFED release, like OFED-1.3? Any issue or no problem? If not, is there any other way to increase MGIDs to 2K/4K to avoid this problem in a 1K node cluster, like a parameter setting in openSM depite of how many MGIDs the switch can support? Thanks Shirley From alex.estrin at qlogic.com Thu Oct 2 22:20:32 2008 From: alex.estrin at qlogic.com (Alex Estrin) Date: Fri, 3 Oct 2008 00:20:32 -0500 Subject: [ofa-general] RE: IPoIB CM connectivity issue. References: <20080923083956.GA14288@mtls03> Message-ID: > > In case if connection initiated by remote node(B), OFED node(A) > > accepts connection and then sends it's own connect request to the same > > node(B) using different RC QP! > Yes, in the Linux implementation of IPoIB CM, each QP is used for > traffic in only a single direction. This simplifies things quite a bit > in the implementation, and as far as I know is a fully compliant thing > to do. In second case (when OFED is CREQ initiator) only one RC QP was used to establish a connection and apparently bidirectional traffic was capable to go through that one QP. > > It looks like violation of RFC4755 convention to send unicast data over > > connected QP. > I assume you mean sending ARP replies. Yes, you are correct. I never > noticed before but RFC 4755 does say: > Additionally, all address resolution responses (ARP or Neighbor > Discovery) MUST always be encapsulated in a UD mode packet. Yes, you are right. Please discard my note regarding ARP reply. Thanks, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu Oct 2 22:38:07 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 02 Oct 2008 22:38:07 -0700 Subject: [ofa-general] Re: IPoIB CM connectivity issue. In-Reply-To: (Alex Estrin's message of "Fri, 3 Oct 2008 00:20:32 -0500") References: <20080923083956.GA14288@mtls03> Message-ID: > In second case (when OFED is CREQ initiator) only one RC QP was used > to establish a connection and apparently bidirectional traffic was > capable to go through that one QP. Yes, at least in the case where you have an SRQ-capable adapter, it doesn't really matter which QP has incoming traffic. However, it was much simpler in the IPoIB implementation to simply open a QP to send traffic rather than searching through all passive connections for a connection to the same peer. Is this behavior causing problems for you? > > I assume you mean sending ARP replies. Yes, you are correct. I never > > noticed before but RFC 4755 does say: > > Additionally, all address resolution responses (ARP or Neighbor > > Discovery) MUST always be encapsulated in a UD mode packet. > Yes, you are right. Please discard my note regarding ARP reply. Not sure what you mean -- if Linux is sending ARP replies on a connected QP, then that is not allowed according to the RFC. However, looking at this quote again I see that the RFC's requirement rather unfortunately includes neighbour discovery too. It's not *too* bad to look at the ethertype in the IPoIB pseudo-header to check for an ARP packet, but sending all neighbour discovery messages seems very ugly -- even just sending all ICMP6 messages via UD wouldn't be very nice to implement, and it would eg break ping6 with large messages, so we would have to look deep deep into packets to see which were ND messages. I wonder what the rationale behind that part of the RFC was? - R. From vlad at lists.openfabrics.org Fri Oct 3 03:16:47 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 3 Oct 2008 03:16:47 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081003-0200 daily build status Message-ID: <20081003101647.7B43BE609D6@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: Build failed on ppc64 with linux-2.6.24 Log: /home/vlad/tmp/ofa_1_4_kernel-20081003-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c: In function 'ehca_poll_eqs': /home/vlad/tmp/ofa_1_4_kernel-20081003-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:942: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type /home/vlad/tmp/ofa_1_4_kernel-20081003-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:946: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081003-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081003-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081003-0200_linux-2.6.24_ppc64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081003-0200_linux-2.6.24_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.24' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From sashak at voltaire.com Fri Oct 3 08:39:00 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 3 Oct 2008 18:39:00 +0300 Subject: [ofa-general] Re: openSM for supporting IPv6 SNM MGIDs consolidation In-Reply-To: <1223007639.8387.22.camel@IBM-29AB850785D.beaverton.ibm.com> References: <1223007639.8387.22.camel@IBM-29AB850785D.beaverton.ibm.com> Message-ID: <20081003153900.GC6566@sashak.voltaire.com> Hi Shirley, On 21:20 Thu 02 Oct , Shirley Ma wrote: > > We had several customers hit multicast groups exceed issue when IPv6 > was in place in the past. My question is has IPv6 SNM (Solocited Node > Multicast) MGIDs consolidation feature been in openSM 3.2.X? Yes, it is '--consolidate_ipv6_snm_req' option. > If so, can the customer just updates openSM without updating the whole > fabrics? Any backport compatibility test has been done between this > version 3.2.X openSM and other OFED release, like OFED-1.3? Any issue or > no problem? AFAIK there should not be a problem with upgrade. Management packages are not really depend from rest OFED. > If not, is there any other way to increase MGIDs to 2K/4K to avoid this > problem in a 1K node cluster, like a parameter setting in openSM depite > of how many MGIDs the switch can support? We don't have this feature before 3.2.x. Sasha From sashak at voltaire.com Fri Oct 3 08:50:51 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 3 Oct 2008 18:50:51 +0300 Subject: [ofa-general] questions about opensm and unmanaged switch In-Reply-To: References: <20081002010637.GN7396@sashak.voltaire.com> Message-ID: <20081003155051.GD6566@sashak.voltaire.com> On 17:14 Thu 02 Oct , Yicheng Jia wrote: > > The error I got are "osm_db_store: ERR 6109: Failed to remove > file:/tmp//guid2lid" and "osm_db_store: ERR 6108: Failed to rename the db > file to:/tmp//guid2lid". I set "OSM_DEFAULT_CACHE_DIR" to "/tmp/". Could > it be the reason of slow? I don't think. How ibnetdiscover output looks (after slow setup)? Sasha From sashak at voltaire.com Fri Oct 3 08:52:47 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 3 Oct 2008 18:52:47 +0300 Subject: [ofa-general] questions about opensm and unmanaged switch In-Reply-To: <20081003155051.GD6566@sashak.voltaire.com> References: <20081002010637.GN7396@sashak.voltaire.com> <20081003155051.GD6566@sashak.voltaire.com> Message-ID: <20081003155247.GE6566@sashak.voltaire.com> On 18:50 Fri 03 Oct , Sasha Khapyorsky wrote: > On 17:14 Thu 02 Oct , Yicheng Jia wrote: > > > > The error I got are "osm_db_store: ERR 6109: Failed to remove > > file:/tmp//guid2lid" and "osm_db_store: ERR 6108: Failed to rename the db > > file to:/tmp//guid2lid". I set "OSM_DEFAULT_CACHE_DIR" to "/tmp/". And what is the error status printed there? Sasha From mashirle at us.ibm.com Fri Oct 3 09:01:56 2008 From: mashirle at us.ibm.com (Shirley Ma) Date: Fri, 03 Oct 2008 09:01:56 -0700 Subject: [ofa-general] Re: openSM for supporting IPv6 SNM MGIDs consolidation In-Reply-To: <20081003153900.GC6566@sashak.voltaire.com> References: <1223007639.8387.22.camel@IBM-29AB850785D.beaverton.ibm.com> <20081003153900.GC6566@sashak.voltaire.com> Message-ID: <1223049716.8387.24.camel@IBM-29AB850785D.beaverton.ibm.com> Thanks Sasha for the prompt reply. 在 2008-10-03五的 18:39 +0300,Sasha Khapyorsky写道: > > If so, can the customer just updates openSM without updating > the whole > > fabrics? Any backport compatibility test has been done between this > > version 3.2.X openSM and other OFED release, like OFED-1.3? Any > issue or > > no problem? > > AFAIK there should not be a problem with upgrade. Management packages > are not really depend from rest OFED. How about ibutilies? Whether there are any issue? thanks Shirley From sashak at voltaire.com Fri Oct 3 09:04:48 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 3 Oct 2008 19:04:48 +0300 Subject: [ofa-general] Re: openSM for supporting IPv6 SNM MGIDs consolidation In-Reply-To: <1223049716.8387.24.camel@IBM-29AB850785D.beaverton.ibm.com> References: <1223007639.8387.22.camel@IBM-29AB850785D.beaverton.ibm.com> <20081003153900.GC6566@sashak.voltaire.com> <1223049716.8387.24.camel@IBM-29AB850785D.beaverton.ibm.com> Message-ID: <20081003160448.GF6566@sashak.voltaire.com> On 09:01 Fri 03 Oct , Shirley Ma wrote: > > How about ibutilies? Whether there are any issue? I many times installed whole management (libibcommon, libibumad, opensm, libibmad, infiniband-diags) over any OFED and without OFED. Didn't see any problems. Sasha From mashirle at us.ibm.com Fri Oct 3 09:29:18 2008 From: mashirle at us.ibm.com (Shirley Ma) Date: Fri, 03 Oct 2008 09:29:18 -0700 Subject: [ofa-general] Re: openSM for supporting IPv6 SNM MGIDs consolidation In-Reply-To: <20081003160448.GF6566@sashak.voltaire.com> References: <1223007639.8387.22.camel@IBM-29AB850785D.beaverton.ibm.com> <20081003153900.GC6566@sashak.voltaire.com> <1223049716.8387.24.camel@IBM-29AB850785D.beaverton.ibm.com> <20081003160448.GF6566@sashak.voltaire.com> Message-ID: <1223051358.8387.26.camel@IBM-29AB850785D.beaverton.ibm.com> Thanks Sasha to confirm this. I will suggest customer to upgrade openSM. 在 2008-10-03五的 19:04 +0300,Sasha Khapyorsky写道: > On 09:01 Fri 03 Oct , Shirley Ma wrote: > > > > How about ibutilies? Whether there are any issue? > > I many times installed whole management (libibcommon, libibumad, opensm, > libibmad, infiniband-diags) over any OFED and without OFED. Didn't see > any problems. > > Sasha From vuhuong at mellanox.com Fri Oct 3 09:34:18 2008 From: vuhuong at mellanox.com (Vu Pham) Date: Fri, 03 Oct 2008 09:34:18 -0700 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48E38BAF.5000801@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> Message-ID: <48E6498A.3070002@mellanox.com> > Alternatively, is there anything in the SCST layer I should tweak. I'm > still running rev 245 of that code (kinda old, but works with OFED 1.3.1 > w/o hacks). > What is the mode (pass thru, blockio...)? What is the scst_threads= parameters? > >> >> >> My target server (with DAS) contains 8 2.8 GHz CPU cores and can >> sustain over 200K IOPs locally, but only around 73K IOPs over SRP. Is this number from one initiator or multiple? >> Looking at /proc/interrupts, I see that the mlx_core (comp) device is >> pushing about 135K Int/s on 1 of 2 CPUs. All CPUs are enabled for >> that PCI-E slot, but it only ever uses 2 of the CPUs, and only 1 at a >> time. None of the other CPUs has an interrupt rate more than about >> 40-50K/s. >> The number of interrupt can be cut down if there are more completions to be processed by sw. ie. please test with multiple QPs between one initiator vs. your target and multiple initiators vs. your target >> Does anyone know of a trick to spread those interrupts out more >> (which I realize might be bad due to context switching), or something >> else that will reduce my interrupts on that cpu? The mlx4 is a MSI-X >> interrupt. I've changed it to an APIC int, but it seems to give >> slightly lower performance. >> There userspace daemon, irqbalanced, that dynamically directs IRQs to different CPUs. You can define which CPUs CAN handle an IRQ but you cannot control how it is done. You can look at Documentation/IRQ-affinity.txt for details how to configure it. In some cases I found better performance-wise to shut the irqbalanced off and assign the process to one (ore more) CPU and use a different CPU to serve interrupts. -vu From YJia at tmriusa.com Fri Oct 3 09:45:48 2008 From: YJia at tmriusa.com (Yicheng Jia) Date: Fri, 3 Oct 2008 11:45:48 -0500 Subject: [ofa-general] questions about opensm and unmanaged switch In-Reply-To: <20081003155247.GE6566@sashak.voltaire.com> Message-ID: err:4294967295. I'm running it on QNX so ibnetdiscover is not available so far. I attach the verbose output of opensm during startup. As you can see, it start to receive SMP from other HCAs after several heavy sweep. It looks like the switch block them at the beginning? Thanks! Yicheng # cat /tmp/opensm.log Sep 26 16:11:35 362802 [0001] 0x04 -> OpenSM 3.2.1-0bc7db2 Sep 26 16:11:35 362802 [0001] 0x80 -> OpenSM 3.2.1-0bc7db2 Sep 26 16:11:35 363802 [0001] 0x80 -> Entering DISCOVERING state Sep 26 16:11:35 363802 [0001] 0x04 -> osm_report_sm_state: ****************************************************************** ***************** ENTERING SM DISCOVERING STATE ****************** ****************************************************************** Sep 26 16:11:35 415794 [0001] 0x04 -> osm_sm_mad_ctrl_bind: Binding to port 0x2c90200230775 Sep 26 16:11:35 575769 [0001] 0x04 -> osm_sa_mad_ctrl_bind: Binding to port GUID 0x2c90200230775 Sep 26 16:11:35 576769 [0008] 0x04 -> __osm_state_mgr_sweep_hop_0: ****************************************************************** ********************* INITIATING HEAVY SWEEP ********************* ****************************************************************** Sep 26 16:11:35 576769 [0003] 0x04 -> Received SMP on a 0 hop path: Initial path = 0 Return path = 0 Sep 26 16:11:35 576769 [0003] 0x04 -> __osm_ni_rcv_process_new: Discovered new Channel Adapter node, GUID 0x2c90200230774, TID 0x1234 Sep 26 16:11:35 577769 [0004] 0x04 -> __osm_nd_rcv_process_nd: Node 0x2c90200230774 Description = 0 Sep 26 16:11:35 577769 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230775 for parent node GUID 0x2c90200230774, TID 0x1236 Sep 26 16:11:35 577769 [0005] 0x04 -> __osm_pi_rcv_process_endport: Setting endport minimal MTU to:4 defined by port:0x2c90200230775 Sep 26 16:11:35 577769 [0005] 0x04 -> __osm_pi_rcv_process_endport: Setting endport minimal rate to:3 defined by port:0x2c90200230775 Sep 26 16:11:35 579768 [0003] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:1 port_num 1 with GUID 0x2c90200230775 for parent node GUID 0x2c90200230774, TID 0x1238 Sep 26 16:11:35 579768 [0006] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 1 with GUID 0x2c90200230775 for parent node GUID 0x2c90200230774, TID 0x1237 Sep 26 16:11:35 579768 [0004] 0x04 -> Received SMP on a 1 hop path: Initial path = 0,1 Return path = 0,1 Sep 26 16:11:35 579768 [0004] 0x04 -> __osm_ni_rcv_process_new: Discovered new Switch node, GUID 0x66a00d900083b, TID 0x1239 Sep 26 16:11:35 580768 [0005] 0x04 -> __osm_nd_rcv_process_nd: Node 0x66a00d900083b Description = SilverStorm 9024 DDR GUID=0x00066a00d900083b Sep 26 16:11:35 580768 [0003] 0x04 -> __osm_si_rcv_process_new: Subnet max multicast lid is 0xC400 Sep 26 16:11:35 581768 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x0 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x123c Sep 26 16:11:35 581768 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x123d Sep 26 16:11:35 581768 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x2 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x123e Sep 26 16:11:35 581768 [0005] 0x04 -> osm_pi_rcv_process: Initializing port number 0x2 Sep 26 16:11:35 582768 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x3 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x123f Sep 26 16:11:35 582768 [0003] 0x04 -> osm_pi_rcv_process: Initializing port number 0x3 Sep 26 16:11:35 582768 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x4 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1240 Sep 26 16:11:35 582768 [0006] 0x04 -> osm_pi_rcv_process: Initializing port number 0x4 Sep 26 16:11:35 582768 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x5 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1241 Sep 26 16:11:35 582768 [0004] 0x04 -> osm_pi_rcv_process: Initializing port number 0x5 Sep 26 16:11:35 583768 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x6 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1242 Sep 26 16:11:35 583768 [0005] 0x04 -> osm_pi_rcv_process: Initializing port number 0x6 Sep 26 16:11:35 583768 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x7 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1243 Sep 26 16:11:35 583768 [0003] 0x04 -> osm_pi_rcv_process: Initializing port number 0x7 Sep 26 16:11:35 584768 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x8 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1244 Sep 26 16:11:35 584768 [0006] 0x04 -> osm_pi_rcv_process: Initializing port number 0x8 Sep 26 16:11:35 584768 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x9 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1245 Sep 26 16:11:35 584768 [0004] 0x04 -> osm_pi_rcv_process: Initializing port number 0x9 Sep 26 16:11:35 584768 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0xA with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1246 Sep 26 16:11:35 584768 [0005] 0x04 -> osm_pi_rcv_process: Initializing port number 0xA Sep 26 16:11:35 584768 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0xB with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1247 Sep 26 16:11:35 584768 [0003] 0x04 -> osm_pi_rcv_process: Initializing port number 0xB Sep 26 16:11:35 585768 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0xC with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1248 Sep 26 16:11:35 585768 [0006] 0x04 -> osm_pi_rcv_process: Initializing port number 0xC Sep 26 16:11:35 585768 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0xD with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1249 Sep 26 16:11:35 585768 [0004] 0x04 -> osm_pi_rcv_process: Initializing port number 0xD Sep 26 16:11:35 585768 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0xE with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x124a Sep 26 16:11:35 585768 [0005] 0x04 -> osm_pi_rcv_process: Initializing port number 0xE Sep 26 16:11:35 586767 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0xF with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x124b Sep 26 16:11:35 586767 [0003] 0x04 -> osm_pi_rcv_process: Initializing port number 0xF Sep 26 16:11:35 586767 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x10 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x124c Sep 26 16:11:35 586767 [0006] 0x04 -> osm_pi_rcv_process: Initializing port number 0x10 Sep 26 16:11:35 586767 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x11 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x124d Sep 26 16:11:35 586767 [0004] 0x04 -> osm_pi_rcv_process: Initializing port number 0x11 Sep 26 16:11:35 587767 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x12 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x124e Sep 26 16:11:35 587767 [0005] 0x04 -> osm_pi_rcv_process: Initializing port number 0x12 Sep 26 16:11:35 587767 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x13 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x124f Sep 26 16:11:35 587767 [0003] 0x04 -> osm_pi_rcv_process: Initializing port number 0x13 Sep 26 16:11:35 587767 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x14 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1250 Sep 26 16:11:35 587767 [0006] 0x04 -> osm_pi_rcv_process: Initializing port number 0x14 Sep 26 16:11:35 588767 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x15 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1251 Sep 26 16:11:35 588767 [0004] 0x04 -> osm_pi_rcv_process: Initializing port number 0x15 Sep 26 16:11:35 588767 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x16 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1252 Sep 26 16:11:35 588767 [0005] 0x04 -> osm_pi_rcv_process: Initializing port number 0x16 Sep 26 16:11:35 588767 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x17 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1253 Sep 26 16:11:35 588767 [0003] 0x04 -> osm_pi_rcv_process: Initializing port number 0x17 Sep 26 16:11:35 589767 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x18 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1254 Sep 26 16:11:35 589767 [0006] 0x04 -> osm_pi_rcv_process: Initializing port number 0x18 Sep 26 16:11:35 589767 [0004] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 0 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1255 Sep 26 16:11:35 589767 [0005] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 1 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1256 Sep 26 16:11:35 590767 [0003] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 2 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1257 Sep 26 16:11:35 590767 [0006] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 3 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1258 Sep 26 16:11:35 590767 [0004] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 4 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1259 Sep 26 16:11:35 590767 [0005] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 5 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x125a Sep 26 16:11:35 591767 [0003] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 6 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x125b Sep 26 16:11:35 591767 [0006] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 7 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x125c Sep 26 16:11:35 592766 [0004] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 8 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x125d Sep 26 16:11:35 592766 [0005] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 9 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x125e Sep 26 16:11:35 592766 [0003] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 10 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x125f Sep 26 16:11:35 592766 [0006] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 11 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1260 Sep 26 16:11:35 593766 [0004] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 12 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1261 Sep 26 16:11:35 593766 [0005] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 13 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1262 Sep 26 16:11:35 594766 [0003] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 14 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1263 Sep 26 16:11:35 594766 [0006] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 15 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1264 Sep 26 16:11:35 594766 [0004] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 16 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1265 Sep 26 16:11:35 595766 [0005] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 17 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1266 Sep 26 16:11:35 595766 [0003] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 18 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1267 Sep 26 16:11:35 595766 [0006] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 19 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1268 Sep 26 16:11:35 596766 [0004] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 20 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1269 Sep 26 16:11:35 596766 [0005] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 21 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x126a Sep 26 16:11:35 596766 [0003] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 22 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x126b Sep 26 16:11:35 597766 [0006] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 23 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x126c Sep 26 16:11:35 597766 [0004] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 24 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x126d Sep 26 16:11:35 597766 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** HEAVY SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:11:35 597766 [0008] 0x80 -> Entering MASTER state Sep 26 16:11:35 597766 [0008] 0x04 -> osm_report_sm_state: ****************************************************************** ******************** ENTERING SM MASTER STATE ******************** ****************************************************************** Sep 26 16:11:35 597766 [0008] 0x04 -> osm_prtn_add_port: port 0x2c90200230775 already in partition 'Default' (0x7fff). Will overwrite Sep 26 16:11:35 597766 [0008] 0x04 -> osm_sa_db_file_load: sa db file name is not specifed. Skip restore Sep 26 16:11:35 597766 [0008] 0x04 -> __osm_lid_mgr_process_our_sm_node: Assigning SM's port 0x0002c90200230775 to LID range [1,1] Sep 26 16:11:35 597766 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230775, port 0x1 and port 0x00066a00d900083b, port 0x1. Using lower OP_VLS of 3 Sep 26 16:11:35 598766 [0008] 0x04 -> do_sweep: ****************************************************************** **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG ***** ****************************************************************** Sep 26 16:11:35 598766 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x00066a00d900083b, LID [6,6] Sep 26 16:11:35 599765 [0008] 0x04 -> do_sweep: ****************************************************************** ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG ***** ****************************************************************** Sep 26 16:11:35 599765 [0008] 0x04 -> osm_ucast_mgr_build_lid_matrices: Starting switches' Min Hop Table Assignment Sep 26 16:11:35 599765 [0008] 0x04 -> osm_ucast_mgr_process: LFT Tables configured on all switches Sep 26 16:11:35 600765 [0008] 0x04 -> do_sweep: ****************************************************************** **************** SWITCHES CONFIGURED FOR UNICAST ***************** ****************************************************************** Sep 26 16:11:35 600765 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC000 has no members - nothing to do Sep 26 16:11:35 600765 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC001 has no members - nothing to do Sep 26 16:11:35 600765 [0008] 0x04 -> do_sweep: ****************************************************************** *************** SWITCHES CONFIGURED FOR MULTICAST **************** ****************************************************************** Sep 26 16:11:35 600765 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x1 and port 0x0002c90200230775, port 0x1. Using lower OP_VLS of 3 Sep 26 16:11:35 600765 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230775, port 0x1 and port 0x00066a00d900083b, port 0x1. Using lower OP_VLS of 3 Sep 26 16:11:35 600765 [0008] 0x04 -> do_sweep: ****************************************************************** ******* LINKS PORTS CONFIGURED - SET LINKS TO ARMED STATE ******** ****************************************************************** Sep 26 16:11:35 601765 [0008] 0x04 -> do_sweep: ****************************************************************** ************ LINKS ARMED - SET LINKS TO ACTIVE STATE ************* ****************************************************************** Sep 26 16:11:35 737744 [0008] 0x80 -> SUBNET UP Sep 26 16:11:35 737744 [0008] 0x04 -> __osm_state_mgr_up_msg: ****************************************************************** *************************** SUBNET UP **************************** ****************************************************************** Sep 26 16:11:35 973708 [0003] 0x04 -> Generic Notice dump: type.....................0x04 prod_type................1 (Channel Adapter) trap_num.................144 lid......................0 new_cap_mask.............0x02500a6a Sep 26 16:11:35 974708 [0003] 0x04 -> __osm_trap_rcv_process_request: Forcing heavy sweep. Received trap:144 Sep 26 16:11:35 974708 [0008] 0x04 -> __osm_state_mgr_sweep_hop_0: ****************************************************************** ********************* INITIATING HEAVY SWEEP ********************* ****************************************************************** Sep 26 16:11:35 975708 [0006] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c90200230774 TID 0x1277, discovered 0 times already Sep 26 16:11:35 975708 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230775 for parent node GUID 0x2c90200230774, TID 0x1278 Sep 26 16:11:35 976708 [0005] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Switch node 0x66a00d900083b TID 0x1279, discovered 0 times already Sep 26 16:11:35 976708 [0003] 0x04 -> __osm_si_rcv_process_existing: discovery_count is:1 Sep 26 16:11:35 978707 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x0 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x127b Sep 26 16:11:35 978707 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x127c Sep 26 16:11:35 979707 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x2 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x127d Sep 26 16:11:35 979707 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x3 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x127e Sep 26 16:11:35 980707 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x4 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x127f Sep 26 16:11:35 980707 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x5 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1280 Sep 26 16:11:35 981707 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x6 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1281 Sep 26 16:11:35 982707 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x7 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1282 Sep 26 16:11:35 982707 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x8 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1283 Sep 26 16:11:35 983707 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x9 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1284 Sep 26 16:11:35 983707 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0xA with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1285 Sep 26 16:11:35 984706 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0xB with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1286 Sep 26 16:11:35 984706 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0xC with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1287 Sep 26 16:11:35 985706 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0xD with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1288 Sep 26 16:11:35 985706 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0xE with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1289 Sep 26 16:11:35 986706 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0xF with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x128a Sep 26 16:11:35 986706 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x10 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x128b Sep 26 16:11:35 987706 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x11 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x128c Sep 26 16:11:35 988706 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x12 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x128d Sep 26 16:11:35 988706 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x13 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x128e Sep 26 16:11:35 989706 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x14 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x128f Sep 26 16:11:35 989706 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x15 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1290 Sep 26 16:11:35 990706 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x16 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1291 Sep 26 16:11:35 990706 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x17 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1292 Sep 26 16:11:35 990706 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x18 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1293 Sep 26 16:11:35 990706 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** HEAVY SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:11:35 991705 [0008] 0x04 -> osm_prtn_make_new: Duplicated partition definition: 'Default' (0x7fff) prev name 'Default'. Will use it Sep 26 16:11:35 991705 [0008] 0x04 -> osm_prtn_add_port: port 0x2c90200230775 already in partition 'Default' (0x7fff). Will overwrite Sep 26 16:11:35 991705 [0008] 0x04 -> osm_sa_db_file_load: sa db file name is not specifed. Skip restore Sep 26 16:11:35 991705 [0008] 0x04 -> __osm_lid_mgr_process_our_sm_node: Assigning SM's port 0x0002c90200230775 to LID range [1,1] Sep 26 16:11:35 991705 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230775, port 0x1 and port 0x00066a00d900083b, port 0x1. Using lower OP_VLS of 3 Sep 26 16:11:35 991705 [0008] 0x04 -> do_sweep: ****************************************************************** **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG ***** ****************************************************************** Sep 26 16:11:35 991705 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x00066a00d900083b, LID [6,6] Sep 26 16:11:35 991705 [0008] 0x04 -> do_sweep: ****************************************************************** ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG ***** ****************************************************************** Sep 26 16:11:35 991705 [0008] 0x04 -> osm_ucast_mgr_build_lid_matrices: Starting switches' Min Hop Table Assignment Sep 26 16:11:35 991705 [0008] 0x04 -> osm_ucast_mgr_process: LFT Tables configured on all switches Sep 26 16:11:35 992705 [0008] 0x04 -> do_sweep: ****************************************************************** **************** SWITCHES CONFIGURED FOR UNICAST ***************** ****************************************************************** Sep 26 16:11:35 992705 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC000 has no members - nothing to do Sep 26 16:11:35 992705 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC001 has no members - nothing to do Sep 26 16:11:35 992705 [0008] 0x04 -> do_sweep: ****************************************************************** *************** SWITCHES CONFIGURED FOR MULTICAST **************** ****************************************************************** Sep 26 16:11:35 992705 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x1 and port 0x0002c90200230775, port 0x1. Using lower OP_VLS of 3 Sep 26 16:11:35 992705 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230775, port 0x1 and port 0x00066a00d900083b, port 0x1. Using lower OP_VLS of 3 Sep 26 16:11:35 992705 [0008] 0x04 -> do_sweep: ****************************************************************** ******* LINKS PORTS CONFIGURED - SET LINKS TO ARMED STATE ******** ****************************************************************** Sep 26 16:11:35 992705 [0008] 0x04 -> do_sweep: ****************************************************************** ************ LINKS ARMED - SET LINKS TO ACTIVE STATE ************* ****************************************************************** Sep 26 16:11:36 127685 [0008] 0x04 -> __osm_state_mgr_up_msg: ****************************************************************** *************************** SUBNET UP **************************** ****************************************************************** Sep 26 16:11:41 243902 [0005] 0x04 -> Generic Notice dump: type.....................0x01 prod_type................2 (Switch) trap_num.................128 sw_lid...................6 Sep 26 16:11:41 243902 [0005] 0x04 -> __osm_trap_rcv_process_request: Forcing heavy sweep. Received trap:128 Sep 26 16:11:41 243902 [0008] 0x04 -> __osm_state_mgr_sweep_hop_0: ****************************************************************** ********************* INITIATING HEAVY SWEEP ********************* ****************************************************************** Sep 26 16:11:41 244902 [0003] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c90200230774 TID 0x1295, discovered 0 times already Sep 26 16:11:41 244902 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230775 for parent node GUID 0x2c90200230774, TID 0x1296 Sep 26 16:11:41 245901 [0004] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Switch node 0x66a00d900083b TID 0x1297, discovered 0 times already Sep 26 16:11:41 245901 [0005] 0x04 -> __osm_si_rcv_process_existing: discovery_count is:1 Sep 26 16:11:41 246901 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x0 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1299 Sep 26 16:11:41 246901 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x129a Sep 26 16:11:41 246901 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x2 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x129b Sep 26 16:11:41 247901 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x3 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x129c Sep 26 16:11:41 247901 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x4 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x129d Sep 26 16:11:41 247901 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x5 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x129e Sep 26 16:11:41 248901 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x6 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x129f Sep 26 16:11:41 248901 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x7 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12a0 Sep 26 16:11:41 248901 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x8 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12a1 Sep 26 16:11:41 249901 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x9 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12a2 Sep 26 16:11:41 249901 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0xA with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12a3 Sep 26 16:11:41 249901 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0xB with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12a4 Sep 26 16:11:41 250901 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0xC with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12a5 Sep 26 16:11:41 250901 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0xD with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12a6 Sep 26 16:11:41 250901 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0xE with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12a7 Sep 26 16:11:41 251900 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0xF with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12a8 Sep 26 16:11:41 251900 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x10 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12a9 Sep 26 16:11:41 251900 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x11 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12aa Sep 26 16:11:41 252900 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x12 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12ab Sep 26 16:11:41 252900 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x13 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12ac Sep 26 16:11:41 252900 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x14 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12ad Sep 26 16:11:41 253900 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x15 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12ae Sep 26 16:11:41 253900 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x16 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12af Sep 26 16:11:41 253900 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x17 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12b0 Sep 26 16:11:41 253900 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x18 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12b1 Sep 26 16:11:41 253900 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** HEAVY SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:11:41 253900 [0008] 0x04 -> osm_prtn_make_new: Duplicated partition definition: 'Default' (0x7fff) prev name 'Default'. Will use it Sep 26 16:11:41 253900 [0008] 0x04 -> osm_prtn_add_port: port 0x2c90200230775 already in partition 'Default' (0x7fff). Will overwrite Sep 26 16:11:41 253900 [0008] 0x04 -> osm_sa_db_file_load: sa db file name is not specifed. Skip restore Sep 26 16:11:41 253900 [0008] 0x04 -> __osm_lid_mgr_process_our_sm_node: Assigning SM's port 0x0002c90200230775 to LID range [1,1] Sep 26 16:11:41 253900 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230775, port 0x1 and port 0x00066a00d900083b, port 0x1. Using lower OP_VLS of 3 Sep 26 16:11:41 254900 [0008] 0x04 -> do_sweep: ****************************************************************** **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG ***** ****************************************************************** Sep 26 16:11:41 254900 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x00066a00d900083b, LID [6,6] Sep 26 16:11:41 254900 [0008] 0x04 -> do_sweep: ****************************************************************** ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG ***** ****************************************************************** Sep 26 16:11:41 254900 [0008] 0x04 -> osm_ucast_mgr_build_lid_matrices: Starting switches' Min Hop Table Assignment Sep 26 16:11:41 254900 [0008] 0x04 -> osm_ucast_mgr_process: LFT Tables configured on all switches Sep 26 16:11:41 254900 [0008] 0x04 -> do_sweep: ****************************************************************** **************** SWITCHES CONFIGURED FOR UNICAST ***************** ****************************************************************** Sep 26 16:11:41 254900 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC000 has no members - nothing to do Sep 26 16:11:41 254900 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC001 has no members - nothing to do Sep 26 16:11:41 254900 [0008] 0x04 -> do_sweep: ****************************************************************** *************** SWITCHES CONFIGURED FOR MULTICAST **************** ****************************************************************** Sep 26 16:11:41 254900 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x1 and port 0x0002c90200230775, port 0x1. Using lower OP_VLS of 3 Sep 26 16:11:41 254900 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230775, port 0x1 and port 0x00066a00d900083b, port 0x1. Using lower OP_VLS of 3 Sep 26 16:11:41 254900 [0008] 0x04 -> do_sweep: ****************************************************************** ******* LINKS PORTS CONFIGURED - SET LINKS TO ARMED STATE ******** ****************************************************************** Sep 26 16:11:41 254900 [0008] 0x04 -> do_sweep: ****************************************************************** ************ LINKS ARMED - SET LINKS TO ACTIVE STATE ************* ****************************************************************** Sep 26 16:11:41 389879 [0008] 0x04 -> __osm_state_mgr_up_msg: ****************************************************************** *************************** SUBNET UP **************************** ****************************************************************** Sep 26 16:11:45 364271 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: ****************************************************************** ********************* INITIATING LIGHT SWEEP ********************* ****************************************************************** Sep 26 16:11:45 364271 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** LIGHT SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:11:55 364741 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: ****************************************************************** ********************* INITIATING LIGHT SWEEP ********************* ****************************************************************** Sep 26 16:11:55 364741 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** LIGHT SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:12:05 365211 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: ****************************************************************** ********************* INITIATING LIGHT SWEEP ********************* ****************************************************************** Sep 26 16:12:05 365211 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** LIGHT SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:12:15 365680 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: ****************************************************************** ********************* INITIATING LIGHT SWEEP ********************* ****************************************************************** Sep 26 16:12:15 365680 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** LIGHT SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:12:18 691171 [0004] 0x04 -> Generic Notice dump: type.....................0x01 prod_type................2 (Switch) trap_num.................128 sw_lid...................6 Sep 26 16:12:18 691171 [0004] 0x04 -> __osm_trap_rcv_process_request: Forcing heavy sweep. Received trap:128 Sep 26 16:12:18 691171 [0008] 0x04 -> __osm_state_mgr_sweep_hop_0: ****************************************************************** ********************* INITIATING HEAVY SWEEP ********************* ****************************************************************** Sep 26 16:12:18 691171 [0005] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c90200230774 TID 0x12b7, discovered 0 times already Sep 26 16:12:18 691171 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230775 for parent node GUID 0x2c90200230774, TID 0x12b8 Sep 26 16:12:18 692171 [0006] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Switch node 0x66a00d900083b TID 0x12b9, discovered 0 times already Sep 26 16:12:18 692171 [0004] 0x04 -> __osm_si_rcv_process_existing: discovery_count is:1 Sep 26 16:12:18 693171 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x0 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12bb Sep 26 16:12:18 693171 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12bc Sep 26 16:12:18 693171 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x2 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12bd Sep 26 16:12:18 694171 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x3 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12be Sep 26 16:12:18 694171 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x4 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12bf Sep 26 16:12:18 695171 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x5 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12c0 Sep 26 16:12:18 695171 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x6 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12c1 Sep 26 16:12:18 695171 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x7 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12c2 Sep 26 16:12:18 695171 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x8 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12c3 Sep 26 16:12:18 696171 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x9 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12c4 Sep 26 16:12:18 696171 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0xA with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12c5 Sep 26 16:12:18 697170 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0xB with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12c6 Sep 26 16:12:18 697170 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0xC with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12c7 Sep 26 16:12:18 697170 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0xD with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12c8 Sep 26 16:12:18 697170 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0xE with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12c9 Sep 26 16:12:18 698170 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0xF with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12ca Sep 26 16:12:18 698170 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x10 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12cb Sep 26 16:12:18 698170 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x11 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12cc Sep 26 16:12:18 699170 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x12 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12cd Sep 26 16:12:18 699170 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x13 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12ce Sep 26 16:12:18 699170 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x14 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12cf Sep 26 16:12:18 700170 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x15 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12d0 Sep 26 16:12:18 700170 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x16 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12d1 Sep 26 16:12:18 700170 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x17 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12d2 Sep 26 16:12:18 701170 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x18 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12d3 Sep 26 16:12:18 701170 [0003] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 5 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12d5 Sep 26 16:12:18 701170 [0006] 0x04 -> Received SMP on a 2 hop path: Initial path = 0,1,5 Return path = 0,1,1 Sep 26 16:12:18 701170 [0006] 0x04 -> __osm_ni_rcv_process_new: Discovered new Channel Adapter node, GUID 0x2c902002307a0, TID 0x12d4 Sep 26 16:12:18 702170 [0004] 0x04 -> __osm_nd_rcv_process_nd: Node 0x2c902002307a0 Description = 1 Sep 26 16:12:18 702170 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c902002307a1 for parent node GUID 0x2c902002307a0, TID 0x12d7 Sep 26 16:12:18 703170 [0003] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 1 with GUID 0x2c902002307a1 for parent node GUID 0x2c902002307a0, TID 0x12d8 Sep 26 16:12:18 703170 [0006] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:1 port_num 1 with GUID 0x2c902002307a1 for parent node GUID 0x2c902002307a0, TID 0x12d9 Sep 26 16:12:18 703170 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** HEAVY SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:12:18 704169 [0008] 0x04 -> osm_prtn_make_new: Duplicated partition definition: 'Default' (0x7fff) prev name 'Default'. Will use it Sep 26 16:12:18 704169 [0008] 0x04 -> osm_prtn_add_port: port 0x2c90200230775 already in partition 'Default' (0x7fff). Will overwrite Sep 26 16:12:18 704169 [0008] 0x04 -> osm_sa_db_file_load: sa db file name is not specifed. Skip restore Sep 26 16:12:18 704169 [0008] 0x04 -> __osm_lid_mgr_process_our_sm_node: Assigning SM's port 0x0002c90200230775 to LID range [1,1] Sep 26 16:12:18 704169 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230775, port 0x1 and port 0x00066a00d900083b, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:18 704169 [0008] 0x04 -> do_sweep: ****************************************************************** **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG ***** ****************************************************************** Sep 26 16:12:18 704169 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x00066a00d900083b, LID [6,6] Sep 26 16:12:18 704169 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c902002307a1, LID [2,2] Sep 26 16:12:18 704169 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002307a1, port 0x1 and port 0x00066a00d900083b, port 0x5. Using lower OP_VLS of 3 Sep 26 16:12:18 705169 [0008] 0x04 -> do_sweep: ****************************************************************** ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG ***** ****************************************************************** Sep 26 16:12:18 705169 [0008] 0x04 -> osm_ucast_mgr_build_lid_matrices: Starting switches' Min Hop Table Assignment Sep 26 16:12:18 705169 [0008] 0x04 -> osm_ucast_mgr_process: LFT Tables configured on all switches Sep 26 16:12:18 706169 [0008] 0x04 -> do_sweep: ****************************************************************** **************** SWITCHES CONFIGURED FOR UNICAST ***************** ****************************************************************** Sep 26 16:12:18 706169 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC000 has no members - nothing to do Sep 26 16:12:18 706169 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC001 has no members - nothing to do Sep 26 16:12:18 706169 [0008] 0x04 -> do_sweep: ****************************************************************** *************** SWITCHES CONFIGURED FOR MULTICAST **************** ****************************************************************** Sep 26 16:12:18 706169 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x1 and port 0x0002c90200230775, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:18 706169 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x5 and port 0x0002c902002307a1, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:18 706169 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230775, port 0x1 and port 0x00066a00d900083b, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:18 706169 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002307a1, port 0x1 and port 0x00066a00d900083b, port 0x5. Using lower OP_VLS of 3 Sep 26 16:12:18 706169 [0008] 0x04 -> do_sweep: ****************************************************************** ******* LINKS PORTS CONFIGURED - SET LINKS TO ARMED STATE ******** ****************************************************************** Sep 26 16:12:18 707169 [0008] 0x04 -> do_sweep: ****************************************************************** ************ LINKS ARMED - SET LINKS TO ACTIVE STATE ************* ****************************************************************** Sep 26 16:12:18 844148 [0008] 0x04 -> __osm_state_mgr_up_msg: ****************************************************************** *************************** SUBNET UP **************************** ****************************************************************** Sep 26 16:12:19 731012 [0006] 0x04 -> Generic Notice dump: type.....................0x01 prod_type................2 (Switch) trap_num.................128 sw_lid...................6 Sep 26 16:12:19 731012 [0006] 0x04 -> __osm_trap_rcv_process_request: Forcing heavy sweep. Received trap:128 Sep 26 16:12:19 731012 [0008] 0x04 -> __osm_state_mgr_sweep_hop_0: ****************************************************************** ********************* INITIATING HEAVY SWEEP ********************* ****************************************************************** Sep 26 16:12:19 731012 [0004] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c90200230774 TID 0x12e1, discovered 0 times already Sep 26 16:12:19 732012 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230775 for parent node GUID 0x2c90200230774, TID 0x12e2 Sep 26 16:12:19 732012 [0003] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Switch node 0x66a00d900083b TID 0x12e3, discovered 0 times already Sep 26 16:12:19 733012 [0006] 0x04 -> __osm_si_rcv_process_existing: discovery_count is:1 Sep 26 16:12:19 733012 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x0 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12e5 Sep 26 16:12:19 733012 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12e6 Sep 26 16:12:19 734012 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x2 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12e7 Sep 26 16:12:19 734012 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x3 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12e8 Sep 26 16:12:19 734012 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x4 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12e9 Sep 26 16:12:19 735012 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x5 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12ea Sep 26 16:12:19 735012 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x6 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12eb Sep 26 16:12:19 735012 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x7 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12ec Sep 26 16:12:19 736012 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x8 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12ed Sep 26 16:12:19 736012 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x9 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12ee Sep 26 16:12:19 736012 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0xA with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12ef Sep 26 16:12:19 737011 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0xB with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12f0 Sep 26 16:12:19 737011 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0xC with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12f1 Sep 26 16:12:19 737011 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0xD with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12f2 Sep 26 16:12:19 738011 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0xE with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12f3 Sep 26 16:12:19 738011 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0xF with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12f4 Sep 26 16:12:19 738011 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x10 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12f5 Sep 26 16:12:19 739011 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x11 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12f6 Sep 26 16:12:19 739011 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x12 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12f7 Sep 26 16:12:19 739011 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x13 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12f8 Sep 26 16:12:19 740011 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x14 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12f9 Sep 26 16:12:19 740011 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x15 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12fa Sep 26 16:12:19 740011 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x16 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12fb Sep 26 16:12:19 741011 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x17 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12fc Sep 26 16:12:19 741011 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x18 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12fd Sep 26 16:12:19 741011 [0005] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 4 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x12ff Sep 26 16:12:19 741011 [0003] 0x04 -> Received SMP on a 2 hop path: Initial path = 0,1,4 Return path = 0,1,1 Sep 26 16:12:19 741011 [0003] 0x04 -> __osm_ni_rcv_process_new: Discovered new Channel Adapter node, GUID 0x2c9020024951c, TID 0x12fe Sep 26 16:12:19 742011 [0006] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c902002307a0 TID 0x1300, discovered 0 times already Sep 26 16:12:19 742011 [0004] 0x04 -> __osm_nd_rcv_process_nd: Node 0x2c9020024951c Description = 2 Sep 26 16:12:19 743010 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c902002307a1 for parent node GUID 0x2c902002307a0, TID 0x1303 Sep 26 16:12:19 743010 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c9020024951d for parent node GUID 0x2c9020024951c, TID 0x1302 Sep 26 16:12:19 744010 [0006] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 1 with GUID 0x2c9020024951d for parent node GUID 0x2c9020024951c, TID 0x1304 Sep 26 16:12:19 744010 [0004] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:1 port_num 1 with GUID 0x2c9020024951d for parent node GUID 0x2c9020024951c, TID 0x1305 Sep 26 16:12:19 744010 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** HEAVY SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:12:19 744010 [0008] 0x04 -> osm_prtn_make_new: Duplicated partition definition: 'Default' (0x7fff) prev name 'Default'. Will use it Sep 26 16:12:19 744010 [0008] 0x04 -> osm_prtn_add_port: port 0x2c90200230775 already in partition 'Default' (0x7fff). Will overwrite Sep 26 16:12:19 744010 [0008] 0x04 -> osm_sa_db_file_load: sa db file name is not specifed. Skip restore Sep 26 16:12:19 744010 [0008] 0x04 -> __osm_lid_mgr_process_our_sm_node: Assigning SM's port 0x0002c90200230775 to LID range [1,1] Sep 26 16:12:19 744010 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230775, port 0x1 and port 0x00066a00d900083b, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:19 744010 [0008] 0x04 -> do_sweep: ****************************************************************** **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG ***** ****************************************************************** Sep 26 16:12:19 744010 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c9020024951d, LID [3,3] Sep 26 16:12:19 744010 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c9020024951d, port 0x1 and port 0x00066a00d900083b, port 0x4. Using lower OP_VLS of 3 Sep 26 16:12:19 744010 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x00066a00d900083b, LID [6,6] Sep 26 16:12:19 744010 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c902002307a1, LID [2,2] Sep 26 16:12:19 744010 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002307a1, port 0x1 and port 0x00066a00d900083b, port 0x5. Using lower OP_VLS of 3 Sep 26 16:12:19 745010 [0008] 0x04 -> do_sweep: ****************************************************************** ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG ***** ****************************************************************** Sep 26 16:12:19 745010 [0008] 0x04 -> osm_ucast_mgr_build_lid_matrices: Starting switches' Min Hop Table Assignment Sep 26 16:12:19 746010 [0008] 0x04 -> osm_ucast_mgr_process: LFT Tables configured on all switches Sep 26 16:12:19 746010 [0008] 0x04 -> do_sweep: ****************************************************************** **************** SWITCHES CONFIGURED FOR UNICAST ***************** ****************************************************************** Sep 26 16:12:19 746010 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC000 has no members - nothing to do Sep 26 16:12:19 746010 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC001 has no members - nothing to do Sep 26 16:12:19 746010 [0008] 0x04 -> do_sweep: ****************************************************************** *************** SWITCHES CONFIGURED FOR MULTICAST **************** ****************************************************************** Sep 26 16:12:19 746010 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c9020024951d, port 0x1 and port 0x00066a00d900083b, port 0x4. Using lower OP_VLS of 3 Sep 26 16:12:19 746010 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x1 and port 0x0002c90200230775, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:19 746010 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x4 and port 0x0002c9020024951d, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:19 746010 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x5 and port 0x0002c902002307a1, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:19 746010 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230775, port 0x1 and port 0x00066a00d900083b, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:19 746010 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002307a1, port 0x1 and port 0x00066a00d900083b, port 0x5. Using lower OP_VLS of 3 Sep 26 16:12:19 746010 [0008] 0x04 -> do_sweep: ****************************************************************** ******* LINKS PORTS CONFIGURED - SET LINKS TO ARMED STATE ******** ****************************************************************** Sep 26 16:12:19 748010 [0008] 0x04 -> do_sweep: ****************************************************************** ************ LINKS ARMED - SET LINKS TO ACTIVE STATE ************* ****************************************************************** Sep 26 16:12:19 883989 [0008] 0x04 -> __osm_state_mgr_up_msg: ****************************************************************** *************************** SUBNET UP **************************** ****************************************************************** Sep 26 16:12:20 770853 [0004] 0x04 -> Generic Notice dump: type.....................0x01 prod_type................2 (Switch) trap_num.................128 sw_lid...................6 Sep 26 16:12:20 770853 [0004] 0x04 -> __osm_trap_rcv_process_request: Forcing heavy sweep. Received trap:128 Sep 26 16:12:20 770853 [0008] 0x04 -> __osm_state_mgr_sweep_hop_0: ****************************************************************** ********************* INITIATING HEAVY SWEEP ********************* ****************************************************************** Sep 26 16:12:20 771853 [0005] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c90200230774 TID 0x130d, discovered 0 times already Sep 26 16:12:20 771853 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230775 for parent node GUID 0x2c90200230774, TID 0x130e Sep 26 16:12:20 772853 [0006] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Switch node 0x66a00d900083b TID 0x130f, discovered 0 times already Sep 26 16:12:20 772853 [0004] 0x04 -> __osm_si_rcv_process_existing: discovery_count is:1 Sep 26 16:12:20 773853 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x0 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1311 Sep 26 16:12:20 773853 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1312 Sep 26 16:12:20 773853 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x2 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1313 Sep 26 16:12:20 774853 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x3 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1314 Sep 26 16:12:20 774853 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x4 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1315 Sep 26 16:12:20 774853 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x5 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1316 Sep 26 16:12:20 775852 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x6 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1317 Sep 26 16:12:20 775852 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x7 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1318 Sep 26 16:12:20 775852 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x8 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1319 Sep 26 16:12:20 775852 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x9 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x131a Sep 26 16:12:20 776852 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0xA with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x131b Sep 26 16:12:20 776852 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0xB with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x131c Sep 26 16:12:20 777852 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0xC with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x131d Sep 26 16:12:20 777852 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0xD with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x131e Sep 26 16:12:20 777852 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0xE with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x131f Sep 26 16:12:20 777852 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0xF with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1320 Sep 26 16:12:20 778852 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x10 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1321 Sep 26 16:12:20 778852 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x11 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1322 Sep 26 16:12:20 778852 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x12 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1323 Sep 26 16:12:20 779852 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x13 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1324 Sep 26 16:12:20 779852 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x14 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1325 Sep 26 16:12:20 779852 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x15 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1326 Sep 26 16:12:20 780852 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x16 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1327 Sep 26 16:12:20 780852 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x17 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1328 Sep 26 16:12:20 780852 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x18 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1329 Sep 26 16:12:20 781851 [0003] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 2 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x132b Sep 26 16:12:20 781851 [0006] 0x04 -> Received SMP on a 2 hop path: Initial path = 0,1,2 Return path = 0,1,1 Sep 26 16:12:20 781851 [0006] 0x04 -> __osm_ni_rcv_process_new: Discovered new Channel Adapter node, GUID 0x2c90200230708, TID 0x132a Sep 26 16:12:20 781851 [0004] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c9020024951c TID 0x132c, discovered 0 times already Sep 26 16:12:20 782851 [0005] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c902002307a0 TID 0x132d, discovered 0 times already Sep 26 16:12:20 782851 [0003] 0x04 -> __osm_nd_rcv_process_nd: Node 0x2c90200230708 Description = 4 Sep 26 16:12:20 782851 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230709 for parent node GUID 0x2c90200230708, TID 0x132f Sep 26 16:12:20 783851 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c9020024951d for parent node GUID 0x2c9020024951c, TID 0x1330 Sep 26 16:12:20 783851 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c902002307a1 for parent node GUID 0x2c902002307a0, TID 0x1331 Sep 26 16:12:20 783851 [0003] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 1 with GUID 0x2c90200230709 for parent node GUID 0x2c90200230708, TID 0x1332 Sep 26 16:12:20 783851 [0006] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:1 port_num 1 with GUID 0x2c90200230709 for parent node GUID 0x2c90200230708, TID 0x1333 Sep 26 16:12:20 784851 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** HEAVY SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:12:20 784851 [0008] 0x04 -> osm_prtn_make_new: Duplicated partition definition: 'Default' (0x7fff) prev name 'Default'. Will use it Sep 26 16:12:20 784851 [0008] 0x04 -> osm_prtn_add_port: port 0x2c90200230775 already in partition 'Default' (0x7fff). Will overwrite Sep 26 16:12:20 784851 [0008] 0x04 -> osm_sa_db_file_load: sa db file name is not specifed. Skip restore Sep 26 16:12:20 784851 [0008] 0x04 -> __osm_lid_mgr_process_our_sm_node: Assigning SM's port 0x0002c90200230775 to LID range [1,1] Sep 26 16:12:20 784851 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230775, port 0x1 and port 0x00066a00d900083b, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:20 784851 [0008] 0x04 -> do_sweep: ****************************************************************** **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG ***** ****************************************************************** Sep 26 16:12:20 784851 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c90200230709, LID [4,4] Sep 26 16:12:20 784851 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230709, port 0x1 and port 0x00066a00d900083b, port 0x2. Using lower OP_VLS of 3 Sep 26 16:12:20 784851 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c9020024951d, LID [3,3] Sep 26 16:12:20 784851 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c9020024951d, port 0x1 and port 0x00066a00d900083b, port 0x4. Using lower OP_VLS of 3 Sep 26 16:12:20 784851 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x00066a00d900083b, LID [6,6] Sep 26 16:12:20 784851 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c902002307a1, LID [2,2] Sep 26 16:12:20 784851 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002307a1, port 0x1 and port 0x00066a00d900083b, port 0x5. Using lower OP_VLS of 3 Sep 26 16:12:20 785851 [0008] 0x04 -> do_sweep: ****************************************************************** ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG ***** ****************************************************************** Sep 26 16:12:20 785851 [0008] 0x04 -> osm_ucast_mgr_build_lid_matrices: Starting switches' Min Hop Table Assignment Sep 26 16:12:20 785851 [0008] 0x04 -> osm_ucast_mgr_process: LFT Tables configured on all switches Sep 26 16:12:20 786851 [0008] 0x04 -> do_sweep: ****************************************************************** **************** SWITCHES CONFIGURED FOR UNICAST ***************** ****************************************************************** Sep 26 16:12:20 786851 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC000 has no members - nothing to do Sep 26 16:12:20 786851 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC001 has no members - nothing to do Sep 26 16:12:20 786851 [0008] 0x04 -> do_sweep: ****************************************************************** *************** SWITCHES CONFIGURED FOR MULTICAST **************** ****************************************************************** Sep 26 16:12:20 786851 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230709, port 0x1 and port 0x00066a00d900083b, port 0x2. Using lower OP_VLS of 3 Sep 26 16:12:20 786851 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c9020024951d, port 0x1 and port 0x00066a00d900083b, port 0x4. Using lower OP_VLS of 3 Sep 26 16:12:20 786851 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x1 and port 0x0002c90200230775, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:20 786851 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x2 and port 0x0002c90200230709, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:20 786851 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x4 and port 0x0002c9020024951d, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:20 786851 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x5 and port 0x0002c902002307a1, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:20 786851 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230775, port 0x1 and port 0x00066a00d900083b, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:20 786851 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002307a1, port 0x1 and port 0x00066a00d900083b, port 0x5. Using lower OP_VLS of 3 Sep 26 16:12:20 786851 [0008] 0x04 -> do_sweep: ****************************************************************** ******* LINKS PORTS CONFIGURED - SET LINKS TO ARMED STATE ******** ****************************************************************** Sep 26 16:12:20 787851 [0008] 0x04 -> do_sweep: ****************************************************************** ************ LINKS ARMED - SET LINKS TO ACTIVE STATE ************* ****************************************************************** Sep 26 16:12:20 923830 [0008] 0x04 -> __osm_state_mgr_up_msg: ****************************************************************** *************************** SUBNET UP **************************** ****************************************************************** Sep 26 16:12:21 811694 [0006] 0x04 -> Generic Notice dump: type.....................0x01 prod_type................2 (Switch) trap_num.................128 sw_lid...................6 Sep 26 16:12:21 811694 [0006] 0x04 -> __osm_trap_rcv_process_request: Forcing heavy sweep. Received trap:128 Sep 26 16:12:21 811694 [0008] 0x04 -> __osm_state_mgr_sweep_hop_0: ****************************************************************** ********************* INITIATING HEAVY SWEEP ********************* ****************************************************************** Sep 26 16:12:21 811694 [0004] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c90200230774 TID 0x133b, discovered 0 times already Sep 26 16:12:21 812694 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230775 for parent node GUID 0x2c90200230774, TID 0x133c Sep 26 16:12:21 812694 [0003] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Switch node 0x66a00d900083b TID 0x133d, discovered 0 times already Sep 26 16:12:21 812694 [0006] 0x04 -> __osm_si_rcv_process_existing: discovery_count is:1 Sep 26 16:12:21 813694 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x0 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x133f Sep 26 16:12:21 813694 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1340 Sep 26 16:12:21 814693 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x2 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1341 Sep 26 16:12:21 814693 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x3 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1342 Sep 26 16:12:21 814693 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x4 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1343 Sep 26 16:12:21 815693 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x5 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1344 Sep 26 16:12:21 815693 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x6 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1345 Sep 26 16:12:21 815693 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x7 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1346 Sep 26 16:12:21 816693 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x8 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1347 Sep 26 16:12:21 816693 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x9 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1348 Sep 26 16:12:21 816693 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0xA with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1349 Sep 26 16:12:21 817693 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0xB with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x134a Sep 26 16:12:21 817693 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0xC with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x134b Sep 26 16:12:21 817693 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0xD with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x134c Sep 26 16:12:21 818693 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0xE with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x134d Sep 26 16:12:21 818693 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0xF with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x134e Sep 26 16:12:21 818693 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x10 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x134f Sep 26 16:12:21 819693 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x11 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1350 Sep 26 16:12:21 819693 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x12 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1351 Sep 26 16:12:21 819693 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x13 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1352 Sep 26 16:12:21 820693 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x14 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1353 Sep 26 16:12:21 820693 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x15 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1354 Sep 26 16:12:21 820693 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x16 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1355 Sep 26 16:12:21 821692 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x17 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1356 Sep 26 16:12:21 821692 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x18 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1357 Sep 26 16:12:21 821692 [0005] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c90200230708 TID 0x1358, discovered 0 times already Sep 26 16:12:21 822692 [0003] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 3 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x135a Sep 26 16:12:21 822692 [0006] 0x04 -> Received SMP on a 2 hop path: Initial path = 0,1,3 Return path = 0,1,1 Sep 26 16:12:21 822692 [0006] 0x04 -> __osm_ni_rcv_process_new: Discovered new Channel Adapter node, GUID 0x2c902002306e8, TID 0x1359 Sep 26 16:12:21 822692 [0004] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c9020024951c TID 0x135b, discovered 0 times already Sep 26 16:12:21 823692 [0005] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c902002307a0 TID 0x135c, discovered 0 times already Sep 26 16:12:21 823692 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230709 for parent node GUID 0x2c90200230708, TID 0x135d Sep 26 16:12:21 823692 [0006] 0x04 -> __osm_nd_rcv_process_nd: Node 0x2c902002306e8 Description = 3 Sep 26 16:12:21 824692 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c902002306e9 for parent node GUID 0x2c902002306e8, TID 0x135f Sep 26 16:12:21 824692 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c9020024951d for parent node GUID 0x2c9020024951c, TID 0x1360 Sep 26 16:12:21 824692 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c902002307a1 for parent node GUID 0x2c902002307a0, TID 0x1361 Sep 26 16:12:21 824692 [0006] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 1 with GUID 0x2c902002306e9 for parent node GUID 0x2c902002306e8, TID 0x1362 Sep 26 16:12:21 825692 [0004] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:1 port_num 1 with GUID 0x2c902002306e9 for parent node GUID 0x2c902002306e8, TID 0x1363 Sep 26 16:12:21 825692 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** HEAVY SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:12:21 825692 [0008] 0x04 -> osm_prtn_make_new: Duplicated partition definition: 'Default' (0x7fff) prev name 'Default'. Will use it Sep 26 16:12:21 825692 [0008] 0x04 -> osm_prtn_add_port: port 0x2c90200230775 already in partition 'Default' (0x7fff). Will overwrite Sep 26 16:12:21 825692 [0008] 0x04 -> osm_sa_db_file_load: sa db file name is not specifed. Skip restore Sep 26 16:12:21 825692 [0008] 0x04 -> __osm_lid_mgr_process_our_sm_node: Assigning SM's port 0x0002c90200230775 to LID range [1,1] Sep 26 16:12:21 825692 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230775, port 0x1 and port 0x00066a00d900083b, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:21 825692 [0008] 0x04 -> do_sweep: ****************************************************************** **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG ***** ****************************************************************** Sep 26 16:12:21 825692 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c90200230709, LID [4,4] Sep 26 16:12:21 825692 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230709, port 0x1 and port 0x00066a00d900083b, port 0x2. Using lower OP_VLS of 3 Sep 26 16:12:21 825692 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c9020024951d, LID [3,3] Sep 26 16:12:21 825692 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c9020024951d, port 0x1 and port 0x00066a00d900083b, port 0x4. Using lower OP_VLS of 3 Sep 26 16:12:21 825692 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x00066a00d900083b, LID [6,6] Sep 26 16:12:21 825692 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c902002307a1, LID [2,2] Sep 26 16:12:21 825692 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002307a1, port 0x1 and port 0x00066a00d900083b, port 0x5. Using lower OP_VLS of 3 Sep 26 16:12:21 825692 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c902002306e9, LID [5,5] Sep 26 16:12:21 825692 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002306e9, port 0x1 and port 0x00066a00d900083b, port 0x3. Using lower OP_VLS of 3 Sep 26 16:12:21 827691 [0008] 0x04 -> do_sweep: ****************************************************************** ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG ***** ****************************************************************** Sep 26 16:12:21 827691 [0008] 0x04 -> osm_ucast_mgr_build_lid_matrices: Starting switches' Min Hop Table Assignment Sep 26 16:12:21 827691 [0008] 0x04 -> osm_ucast_mgr_process: LFT Tables configured on all switches Sep 26 16:12:21 827691 [0008] 0x04 -> do_sweep: ****************************************************************** **************** SWITCHES CONFIGURED FOR UNICAST ***************** ****************************************************************** Sep 26 16:12:21 827691 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC000 has no members - nothing to do Sep 26 16:12:21 827691 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC001 has no members - nothing to do Sep 26 16:12:21 827691 [0008] 0x04 -> do_sweep: ****************************************************************** *************** SWITCHES CONFIGURED FOR MULTICAST **************** ****************************************************************** Sep 26 16:12:21 827691 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230709, port 0x1 and port 0x00066a00d900083b, port 0x2. Using lower OP_VLS of 3 Sep 26 16:12:21 827691 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c9020024951d, port 0x1 and port 0x00066a00d900083b, port 0x4. Using lower OP_VLS of 3 Sep 26 16:12:21 827691 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x1 and port 0x0002c90200230775, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:21 827691 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x2 and port 0x0002c90200230709, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:21 827691 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x3 and port 0x0002c902002306e9, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:21 827691 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x4 and port 0x0002c9020024951d, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:21 828691 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x5 and port 0x0002c902002307a1, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:21 828691 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230775, port 0x1 and port 0x00066a00d900083b, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:21 828691 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002307a1, port 0x1 and port 0x00066a00d900083b, port 0x5. Using lower OP_VLS of 3 Sep 26 16:12:21 828691 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002306e9, port 0x1 and port 0x00066a00d900083b, port 0x3. Using lower OP_VLS of 3 Sep 26 16:12:21 828691 [0008] 0x04 -> do_sweep: ****************************************************************** ******* LINKS PORTS CONFIGURED - SET LINKS TO ARMED STATE ******** ****************************************************************** Sep 26 16:12:21 829691 [0008] 0x04 -> do_sweep: ****************************************************************** ************ LINKS ARMED - SET LINKS TO ACTIVE STATE ************* ****************************************************************** Sep 26 16:12:21 966670 [0008] 0x04 -> __osm_state_mgr_up_msg: ****************************************************************** *************************** SUBNET UP **************************** ****************************************************************** Sep 26 16:12:22 851535 [0004] 0x04 -> Generic Notice dump: type.....................0x01 prod_type................2 (Switch) trap_num.................128 sw_lid...................6 Sep 26 16:12:22 851535 [0004] 0x04 -> __osm_trap_rcv_process_request: Forcing heavy sweep. Received trap:128 Sep 26 16:12:22 851535 [0008] 0x04 -> __osm_state_mgr_sweep_hop_0: ****************************************************************** ********************* INITIATING HEAVY SWEEP ********************* ****************************************************************** Sep 26 16:12:22 851535 [0005] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c90200230774 TID 0x136b, discovered 0 times already Sep 26 16:12:22 852535 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230775 for parent node GUID 0x2c90200230774, TID 0x136c Sep 26 16:12:22 852535 [0006] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Switch node 0x66a00d900083b TID 0x136d, discovered 0 times already Sep 26 16:12:22 853534 [0004] 0x04 -> __osm_si_rcv_process_existing: discovery_count is:1 Sep 26 16:12:22 853534 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x0 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x136f Sep 26 16:12:22 854534 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1370 Sep 26 16:12:22 854534 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x2 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1371 Sep 26 16:12:22 854534 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x3 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1372 Sep 26 16:12:22 854534 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x4 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1373 Sep 26 16:12:22 855534 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x5 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1374 Sep 26 16:12:22 855534 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x6 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1375 Sep 26 16:12:22 855534 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x7 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1376 Sep 26 16:12:22 856534 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x8 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1377 Sep 26 16:12:22 856534 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x9 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1378 Sep 26 16:12:22 856534 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0xA with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1379 Sep 26 16:12:22 857534 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0xB with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x137a Sep 26 16:12:22 857534 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0xC with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x137b Sep 26 16:12:22 857534 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0xD with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x137c Sep 26 16:12:22 858534 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0xE with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x137d Sep 26 16:12:22 858534 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0xF with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x137e Sep 26 16:12:22 858534 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x10 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x137f Sep 26 16:12:22 859534 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x11 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1380 Sep 26 16:12:22 859534 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x12 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1381 Sep 26 16:12:22 859534 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x13 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1382 Sep 26 16:12:22 860533 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x14 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1383 Sep 26 16:12:22 860533 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x15 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1384 Sep 26 16:12:22 860533 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x16 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1385 Sep 26 16:12:22 861533 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x17 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1386 Sep 26 16:12:22 861533 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x18 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1387 Sep 26 16:12:22 862533 [0003] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c90200230708 TID 0x1388, discovered 0 times already Sep 26 16:12:22 862533 [0006] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c902002306e8 TID 0x1389, discovered 0 times already Sep 26 16:12:22 862533 [0004] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c9020024951c TID 0x138a, discovered 0 times already Sep 26 16:12:22 862533 [0005] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c902002307a0 TID 0x138b, discovered 0 times already Sep 26 16:12:22 863533 [0003] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 10 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x138d Sep 26 16:12:22 863533 [0006] 0x04 -> Received SMP on a 2 hop path: Initial path = 0,1,10 Return path = 0,1,1 Sep 26 16:12:22 863533 [0006] 0x04 -> __osm_ni_rcv_process_new: Discovered new Channel Adapter node, GUID 0x2c90200230754, TID 0x138c Sep 26 16:12:22 864533 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230709 for parent node GUID 0x2c90200230708, TID 0x138e Sep 26 16:12:22 864533 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c902002306e9 for parent node GUID 0x2c902002306e8, TID 0x138f Sep 26 16:12:22 864533 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c9020024951d for parent node GUID 0x2c9020024951c, TID 0x1390 Sep 26 16:12:22 864533 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c902002307a1 for parent node GUID 0x2c902002307a0, TID 0x1391 Sep 26 16:12:22 864533 [0004] 0x04 -> __osm_nd_rcv_process_nd: Node 0x2c90200230754 Description = 8 Sep 26 16:12:22 865533 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230755 for parent node GUID 0x2c90200230754, TID 0x1393 Sep 26 16:12:22 866532 [0003] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 1 with GUID 0x2c90200230755 for parent node GUID 0x2c90200230754, TID 0x1394 Sep 26 16:12:22 866532 [0006] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:1 port_num 1 with GUID 0x2c90200230755 for parent node GUID 0x2c90200230754, TID 0x1395 Sep 26 16:12:22 866532 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** HEAVY SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:12:22 866532 [0008] 0x04 -> osm_prtn_make_new: Duplicated partition definition: 'Default' (0x7fff) prev name 'Default'. Will use it Sep 26 16:12:22 866532 [0008] 0x04 -> osm_prtn_add_port: port 0x2c90200230775 already in partition 'Default' (0x7fff). Will overwrite Sep 26 16:12:22 866532 [0008] 0x04 -> osm_sa_db_file_load: sa db file name is not specifed. Skip restore Sep 26 16:12:22 866532 [0008] 0x04 -> __osm_lid_mgr_process_our_sm_node: Assigning SM's port 0x0002c90200230775 to LID range [1,1] Sep 26 16:12:22 866532 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230775, port 0x1 and port 0x00066a00d900083b, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:22 866532 [0008] 0x04 -> do_sweep: ****************************************************************** **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG ***** ****************************************************************** Sep 26 16:12:22 866532 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c90200230709, LID [4,4] Sep 26 16:12:22 866532 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230709, port 0x1 and port 0x00066a00d900083b, port 0x2. Using lower OP_VLS of 3 Sep 26 16:12:22 866532 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c9020024951d, LID [3,3] Sep 26 16:12:22 866532 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c9020024951d, port 0x1 and port 0x00066a00d900083b, port 0x4. Using lower OP_VLS of 3 Sep 26 16:12:22 867532 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x00066a00d900083b, LID [6,6] Sep 26 16:12:22 867532 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c90200230755, LID [7,7] Sep 26 16:12:22 867532 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230755, port 0x1 and port 0x00066a00d900083b, port 0xA. Using lower OP_VLS of 3 Sep 26 16:12:22 867532 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c902002307a1, LID [2,2] Sep 26 16:12:22 867532 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002307a1, port 0x1 and port 0x00066a00d900083b, port 0x5. Using lower OP_VLS of 3 Sep 26 16:12:22 867532 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c902002306e9, LID [5,5] Sep 26 16:12:22 867532 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002306e9, port 0x1 and port 0x00066a00d900083b, port 0x3. Using lower OP_VLS of 3 Sep 26 16:12:22 868532 [0008] 0x04 -> do_sweep: ****************************************************************** ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG ***** ****************************************************************** Sep 26 16:12:22 868532 [0008] 0x04 -> osm_ucast_mgr_build_lid_matrices: Starting switches' Min Hop Table Assignment Sep 26 16:12:22 868532 [0008] 0x04 -> osm_ucast_mgr_process: LFT Tables configured on all switches Sep 26 16:12:22 868532 [0008] 0x04 -> do_sweep: ****************************************************************** **************** SWITCHES CONFIGURED FOR UNICAST ***************** ****************************************************************** Sep 26 16:12:22 869532 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC000 has no members - nothing to do Sep 26 16:12:22 869532 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC001 has no members - nothing to do Sep 26 16:12:22 869532 [0008] 0x04 -> do_sweep: ****************************************************************** *************** SWITCHES CONFIGURED FOR MULTICAST **************** ****************************************************************** Sep 26 16:12:22 869532 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230709, port 0x1 and port 0x00066a00d900083b, port 0x2. Using lower OP_VLS of 3 Sep 26 16:12:22 869532 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c9020024951d, port 0x1 and port 0x00066a00d900083b, port 0x4. Using lower OP_VLS of 3 Sep 26 16:12:22 869532 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x1 and port 0x0002c90200230775, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:22 869532 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x2 and port 0x0002c90200230709, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:22 869532 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x3 and port 0x0002c902002306e9, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:22 869532 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x4 and port 0x0002c9020024951d, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:22 869532 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x5 and port 0x0002c902002307a1, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:22 869532 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0xA and port 0x0002c90200230755, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:22 869532 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230755, port 0x1 and port 0x00066a00d900083b, port 0xA. Using lower OP_VLS of 3 Sep 26 16:12:22 869532 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230775, port 0x1 and port 0x00066a00d900083b, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:22 869532 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002307a1, port 0x1 and port 0x00066a00d900083b, port 0x5. Using lower OP_VLS of 3 Sep 26 16:12:22 869532 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002306e9, port 0x1 and port 0x00066a00d900083b, port 0x3. Using lower OP_VLS of 3 Sep 26 16:12:22 869532 [0008] 0x04 -> do_sweep: ****************************************************************** ******* LINKS PORTS CONFIGURED - SET LINKS TO ARMED STATE ******** ****************************************************************** Sep 26 16:12:22 870532 [0008] 0x04 -> do_sweep: ****************************************************************** ************ LINKS ARMED - SET LINKS TO ACTIVE STATE ************* ****************************************************************** Sep 26 16:12:23 141490 [0008] 0x04 -> __osm_state_mgr_up_msg: ****************************************************************** *************************** SUBNET UP **************************** ****************************************************************** Sep 26 16:12:23 891376 [0006] 0x04 -> Generic Notice dump: type.....................0x01 prod_type................2 (Switch) trap_num.................128 sw_lid...................6 Sep 26 16:12:23 891376 [0006] 0x04 -> __osm_trap_rcv_process_request: Forcing heavy sweep. Received trap:128 Sep 26 16:12:23 891376 [0008] 0x04 -> __osm_state_mgr_sweep_hop_0: ****************************************************************** ********************* INITIATING HEAVY SWEEP ********************* ****************************************************************** Sep 26 16:12:23 892375 [0004] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c90200230774 TID 0x139d, discovered 0 times already Sep 26 16:12:23 892375 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230775 for parent node GUID 0x2c90200230774, TID 0x139e Sep 26 16:12:23 893375 [0003] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Switch node 0x66a00d900083b TID 0x139f, discovered 0 times already Sep 26 16:12:23 893375 [0006] 0x04 -> __osm_si_rcv_process_existing: discovery_count is:1 Sep 26 16:12:23 894375 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x0 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13a1 Sep 26 16:12:23 894375 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13a2 Sep 26 16:12:23 894375 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x2 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13a3 Sep 26 16:12:23 895375 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x3 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13a4 Sep 26 16:12:23 895375 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x4 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13a5 Sep 26 16:12:23 895375 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x5 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13a6 Sep 26 16:12:23 895375 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x6 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13a7 Sep 26 16:12:23 896375 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x7 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13a8 Sep 26 16:12:23 896375 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x8 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13a9 Sep 26 16:12:23 896375 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x9 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13aa Sep 26 16:12:23 897375 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0xA with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13ab Sep 26 16:12:23 897375 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0xB with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13ac Sep 26 16:12:23 897375 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0xC with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13ad Sep 26 16:12:23 898375 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0xD with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13ae Sep 26 16:12:23 898375 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0xE with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13af Sep 26 16:12:23 898375 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0xF with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13b0 Sep 26 16:12:23 899374 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x10 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13b1 Sep 26 16:12:23 899374 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x11 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13b2 Sep 26 16:12:23 899374 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x12 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13b3 Sep 26 16:12:23 900374 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x13 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13b4 Sep 26 16:12:23 900374 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x14 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13b5 Sep 26 16:12:23 900374 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x15 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13b6 Sep 26 16:12:23 901374 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x16 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13b7 Sep 26 16:12:23 901374 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x17 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13b8 Sep 26 16:12:23 901374 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x18 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13b9 Sep 26 16:12:23 902374 [0005] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c90200230708 TID 0x13ba, discovered 0 times already Sep 26 16:12:23 902374 [0003] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c902002306e8 TID 0x13bb, discovered 0 times already Sep 26 16:12:23 902374 [0006] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c9020024951c TID 0x13bc, discovered 0 times already Sep 26 16:12:23 903374 [0004] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c902002307a0 TID 0x13bd, discovered 0 times already Sep 26 16:12:23 903374 [0005] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 7 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13bf Sep 26 16:12:23 903374 [0003] 0x04 -> Received SMP on a 2 hop path: Initial path = 0,1,7 Return path = 0,1,1 Sep 26 16:12:23 903374 [0003] 0x04 -> __osm_ni_rcv_process_new: Discovered new Channel Adapter node, GUID 0x2c90200230720, TID 0x13be Sep 26 16:12:23 904374 [0006] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c90200230754 TID 0x13c0, discovered 0 times already Sep 26 16:12:23 904374 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230709 for parent node GUID 0x2c90200230708, TID 0x13c1 Sep 26 16:12:23 904374 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c902002306e9 for parent node GUID 0x2c902002306e8, TID 0x13c2 Sep 26 16:12:23 905374 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c9020024951d for parent node GUID 0x2c9020024951c, TID 0x13c3 Sep 26 16:12:23 905374 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c902002307a1 for parent node GUID 0x2c902002307a0, TID 0x13c4 Sep 26 16:12:23 905374 [0004] 0x04 -> __osm_nd_rcv_process_nd: Node 0x2c90200230720 Description = 5 Sep 26 16:12:23 905374 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230755 for parent node GUID 0x2c90200230754, TID 0x13c7 Sep 26 16:12:23 906373 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230721 for parent node GUID 0x2c90200230720, TID 0x13c6 Sep 26 16:12:23 906373 [0006] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 1 with GUID 0x2c90200230721 for parent node GUID 0x2c90200230720, TID 0x13c8 Sep 26 16:12:23 907373 [0004] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:1 port_num 1 with GUID 0x2c90200230721 for parent node GUID 0x2c90200230720, TID 0x13c9 Sep 26 16:12:23 907373 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** HEAVY SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:12:23 907373 [0008] 0x04 -> osm_prtn_make_new: Duplicated partition definition: 'Default' (0x7fff) prev name 'Default'. Will use it Sep 26 16:12:23 907373 [0008] 0x04 -> osm_prtn_add_port: port 0x2c90200230775 already in partition 'Default' (0x7fff). Will overwrite Sep 26 16:12:23 907373 [0008] 0x04 -> osm_sa_db_file_load: sa db file name is not specifed. Skip restore Sep 26 16:12:23 907373 [0008] 0x04 -> __osm_lid_mgr_process_our_sm_node: Assigning SM's port 0x0002c90200230775 to LID range [1,1] Sep 26 16:12:23 907373 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230775, port 0x1 and port 0x00066a00d900083b, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:23 907373 [0008] 0x04 -> do_sweep: ****************************************************************** **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG ***** ****************************************************************** Sep 26 16:12:23 907373 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c90200230709, LID [4,4] Sep 26 16:12:23 907373 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230709, port 0x1 and port 0x00066a00d900083b, port 0x2. Using lower OP_VLS of 3 Sep 26 16:12:23 907373 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c9020024951d, LID [3,3] Sep 26 16:12:23 907373 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c9020024951d, port 0x1 and port 0x00066a00d900083b, port 0x4. Using lower OP_VLS of 3 Sep 26 16:12:23 907373 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c90200230721, LID [8,8] Sep 26 16:12:23 907373 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230721, port 0x1 and port 0x00066a00d900083b, port 0x7. Using lower OP_VLS of 3 Sep 26 16:12:23 907373 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x00066a00d900083b, LID [6,6] Sep 26 16:12:23 907373 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c90200230755, LID [7,7] Sep 26 16:12:23 907373 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230755, port 0x1 and port 0x00066a00d900083b, port 0xA. Using lower OP_VLS of 3 Sep 26 16:12:23 907373 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c902002307a1, LID [2,2] Sep 26 16:12:23 907373 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002307a1, port 0x1 and port 0x00066a00d900083b, port 0x5. Using lower OP_VLS of 3 Sep 26 16:12:23 907373 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c902002306e9, LID [5,5] Sep 26 16:12:23 907373 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002306e9, port 0x1 and port 0x00066a00d900083b, port 0x3. Using lower OP_VLS of 3 Sep 26 16:12:23 908373 [0008] 0x04 -> do_sweep: ****************************************************************** ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG ***** ****************************************************************** Sep 26 16:12:23 908373 [0008] 0x04 -> osm_ucast_mgr_build_lid_matrices: Starting switches' Min Hop Table Assignment Sep 26 16:12:23 908373 [0008] 0x04 -> osm_ucast_mgr_process: LFT Tables configured on all switches Sep 26 16:12:23 909373 [0008] 0x04 -> do_sweep: ****************************************************************** **************** SWITCHES CONFIGURED FOR UNICAST ***************** ****************************************************************** Sep 26 16:12:23 909373 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC000 has no members - nothing to do Sep 26 16:12:23 909373 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC001 has no members - nothing to do Sep 26 16:12:23 909373 [0008] 0x04 -> do_sweep: ****************************************************************** *************** SWITCHES CONFIGURED FOR MULTICAST **************** ****************************************************************** Sep 26 16:12:23 909373 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230709, port 0x1 and port 0x00066a00d900083b, port 0x2. Using lower OP_VLS of 3 Sep 26 16:12:23 909373 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c9020024951d, port 0x1 and port 0x00066a00d900083b, port 0x4. Using lower OP_VLS of 3 Sep 26 16:12:23 909373 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230721, port 0x1 and port 0x00066a00d900083b, port 0x7. Using lower OP_VLS of 3 Sep 26 16:12:23 909373 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x1 and port 0x0002c90200230775, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:23 909373 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x2 and port 0x0002c90200230709, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:23 909373 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x3 and port 0x0002c902002306e9, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:23 909373 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x4 and port 0x0002c9020024951d, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:23 909373 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x5 and port 0x0002c902002307a1, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:23 909373 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x7 and port 0x0002c90200230721, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:23 909373 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0xA and port 0x0002c90200230755, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:23 909373 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230755, port 0x1 and port 0x00066a00d900083b, port 0xA. Using lower OP_VLS of 3 Sep 26 16:12:23 909373 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230775, port 0x1 and port 0x00066a00d900083b, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:23 909373 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002307a1, port 0x1 and port 0x00066a00d900083b, port 0x5. Using lower OP_VLS of 3 Sep 26 16:12:23 909373 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002306e9, port 0x1 and port 0x00066a00d900083b, port 0x3. Using lower OP_VLS of 3 Sep 26 16:12:23 909373 [0008] 0x04 -> do_sweep: ****************************************************************** ******* LINKS PORTS CONFIGURED - SET LINKS TO ARMED STATE ******** ****************************************************************** Sep 26 16:12:23 911373 [0008] 0x04 -> do_sweep: ****************************************************************** ************ LINKS ARMED - SET LINKS TO ACTIVE STATE ************* ****************************************************************** Sep 26 16:12:24 181331 [0008] 0x04 -> __osm_state_mgr_up_msg: ****************************************************************** *************************** SUBNET UP **************************** ****************************************************************** Sep 26 16:12:24 932216 [0004] 0x04 -> Generic Notice dump: type.....................0x01 prod_type................2 (Switch) trap_num.................128 sw_lid...................6 Sep 26 16:12:24 932216 [0004] 0x04 -> __osm_trap_rcv_process_request: Forcing heavy sweep. Received trap:128 Sep 26 16:12:24 932216 [0008] 0x04 -> __osm_state_mgr_sweep_hop_0: ****************************************************************** ********************* INITIATING HEAVY SWEEP ********************* ****************************************************************** Sep 26 16:12:24 932216 [0005] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c90200230774 TID 0x13d1, discovered 0 times already Sep 26 16:12:24 933216 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230775 for parent node GUID 0x2c90200230774, TID 0x13d2 Sep 26 16:12:24 933216 [0006] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Switch node 0x66a00d900083b TID 0x13d3, discovered 0 times already Sep 26 16:12:24 933216 [0004] 0x04 -> __osm_si_rcv_process_existing: discovery_count is:1 Sep 26 16:12:24 934216 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x0 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13d5 Sep 26 16:12:24 934216 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13d6 Sep 26 16:12:24 935216 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x2 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13d7 Sep 26 16:12:24 935216 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x3 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13d8 Sep 26 16:12:24 935216 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x4 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13d9 Sep 26 16:12:24 936216 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x5 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13da Sep 26 16:12:24 936216 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x6 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13db Sep 26 16:12:24 936216 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x7 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13dc Sep 26 16:12:24 937216 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x8 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13dd Sep 26 16:12:24 937216 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x9 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13de Sep 26 16:12:24 937216 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0xA with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13df Sep 26 16:12:24 938215 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0xB with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13e0 Sep 26 16:12:24 938215 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0xC with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13e1 Sep 26 16:12:24 938215 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0xD with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13e2 Sep 26 16:12:24 938215 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0xE with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13e3 Sep 26 16:12:24 939215 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0xF with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13e4 Sep 26 16:12:24 939215 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x10 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13e5 Sep 26 16:12:24 940215 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x11 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13e6 Sep 26 16:12:24 940215 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x12 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13e7 Sep 26 16:12:24 940215 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x13 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13e8 Sep 26 16:12:24 941215 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x14 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13e9 Sep 26 16:12:24 941215 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x15 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13ea Sep 26 16:12:24 941215 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x16 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13eb Sep 26 16:12:24 941215 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x17 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13ec Sep 26 16:12:24 942215 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x18 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13ed Sep 26 16:12:24 942215 [0003] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c90200230708 TID 0x13ee, discovered 0 times already Sep 26 16:12:24 943215 [0006] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c902002306e8 TID 0x13ef, discovered 0 times already Sep 26 16:12:24 943215 [0004] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c9020024951c TID 0x13f0, discovered 0 times already Sep 26 16:12:24 943215 [0005] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c902002307a0 TID 0x13f1, discovered 0 times already Sep 26 16:12:24 944215 [0003] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c90200230720 TID 0x13f2, discovered 0 times already Sep 26 16:12:24 944215 [0006] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 8 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x13f4 Sep 26 16:12:24 944215 [0004] 0x04 -> Received SMP on a 2 hop path: Initial path = 0,1,8 Return path = 0,1,1 Sep 26 16:12:24 944215 [0004] 0x04 -> __osm_ni_rcv_process_new: Discovered new Channel Adapter node, GUID 0x2c90200230714, TID 0x13f3 Sep 26 16:12:24 945214 [0005] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c90200230754 TID 0x13f5, discovered 0 times already Sep 26 16:12:24 945214 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230709 for parent node GUID 0x2c90200230708, TID 0x13f6 Sep 26 16:12:24 945214 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c902002306e9 for parent node GUID 0x2c902002306e8, TID 0x13f7 Sep 26 16:12:24 945214 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c9020024951d for parent node GUID 0x2c9020024951c, TID 0x13f8 Sep 26 16:12:24 946214 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c902002307a1 for parent node GUID 0x2c902002307a0, TID 0x13f9 Sep 26 16:12:24 946214 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230721 for parent node GUID 0x2c90200230720, TID 0x13fa Sep 26 16:12:24 946214 [0006] 0x04 -> __osm_nd_rcv_process_nd: Node 0x2c90200230714 Description = 6 Sep 26 16:12:24 947214 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230715 for parent node GUID 0x2c90200230714, TID 0x13fc Sep 26 16:12:24 947214 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230755 for parent node GUID 0x2c90200230754, TID 0x13fd Sep 26 16:12:24 947214 [0003] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 1 with GUID 0x2c90200230715 for parent node GUID 0x2c90200230714, TID 0x13fe Sep 26 16:12:24 948214 [0006] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:1 port_num 1 with GUID 0x2c90200230715 for parent node GUID 0x2c90200230714, TID 0x13ff Sep 26 16:12:24 948214 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** HEAVY SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:12:24 948214 [0008] 0x04 -> osm_prtn_make_new: Duplicated partition definition: 'Default' (0x7fff) prev name 'Default'. Will use it Sep 26 16:12:24 948214 [0008] 0x04 -> osm_prtn_add_port: port 0x2c90200230775 already in partition 'Default' (0x7fff). Will overwrite Sep 26 16:12:24 948214 [0008] 0x04 -> osm_sa_db_file_load: sa db file name is not specifed. Skip restore Sep 26 16:12:24 948214 [0008] 0x04 -> __osm_lid_mgr_process_our_sm_node: Assigning SM's port 0x0002c90200230775 to LID range [1,1] Sep 26 16:12:24 948214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230775, port 0x1 and port 0x00066a00d900083b, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:24 948214 [0008] 0x04 -> do_sweep: ****************************************************************** **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG ***** ****************************************************************** Sep 26 16:12:24 948214 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c90200230709, LID [4,4] Sep 26 16:12:24 948214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230709, port 0x1 and port 0x00066a00d900083b, port 0x2. Using lower OP_VLS of 3 Sep 26 16:12:24 948214 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c90200230715, LID [9,9] Sep 26 16:12:24 948214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230715, port 0x1 and port 0x00066a00d900083b, port 0x8. Using lower OP_VLS of 3 Sep 26 16:12:24 948214 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c9020024951d, LID [3,3] Sep 26 16:12:24 948214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c9020024951d, port 0x1 and port 0x00066a00d900083b, port 0x4. Using lower OP_VLS of 3 Sep 26 16:12:24 948214 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c90200230721, LID [8,8] Sep 26 16:12:24 948214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230721, port 0x1 and port 0x00066a00d900083b, port 0x7. Using lower OP_VLS of 3 Sep 26 16:12:24 948214 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x00066a00d900083b, LID [6,6] Sep 26 16:12:24 948214 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c90200230755, LID [7,7] Sep 26 16:12:24 948214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230755, port 0x1 and port 0x00066a00d900083b, port 0xA. Using lower OP_VLS of 3 Sep 26 16:12:24 948214 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c902002307a1, LID [2,2] Sep 26 16:12:24 948214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002307a1, port 0x1 and port 0x00066a00d900083b, port 0x5. Using lower OP_VLS of 3 Sep 26 16:12:24 948214 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c902002306e9, LID [5,5] Sep 26 16:12:24 948214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002306e9, port 0x1 and port 0x00066a00d900083b, port 0x3. Using lower OP_VLS of 3 Sep 26 16:12:24 949214 [0008] 0x04 -> do_sweep: ****************************************************************** ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG ***** ****************************************************************** Sep 26 16:12:24 949214 [0008] 0x04 -> osm_ucast_mgr_build_lid_matrices: Starting switches' Min Hop Table Assignment Sep 26 16:12:24 949214 [0008] 0x04 -> osm_ucast_mgr_process: LFT Tables configured on all switches Sep 26 16:12:24 950214 [0008] 0x04 -> do_sweep: ****************************************************************** **************** SWITCHES CONFIGURED FOR UNICAST ***************** ****************************************************************** Sep 26 16:12:24 950214 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC000 has no members - nothing to do Sep 26 16:12:24 950214 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC001 has no members - nothing to do Sep 26 16:12:24 950214 [0008] 0x04 -> do_sweep: ****************************************************************** *************** SWITCHES CONFIGURED FOR MULTICAST **************** ****************************************************************** Sep 26 16:12:24 950214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230709, port 0x1 and port 0x00066a00d900083b, port 0x2. Using lower OP_VLS of 3 Sep 26 16:12:24 950214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230715, port 0x1 and port 0x00066a00d900083b, port 0x8. Using lower OP_VLS of 3 Sep 26 16:12:24 950214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c9020024951d, port 0x1 and port 0x00066a00d900083b, port 0x4. Using lower OP_VLS of 3 Sep 26 16:12:24 950214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230721, port 0x1 and port 0x00066a00d900083b, port 0x7. Using lower OP_VLS of 3 Sep 26 16:12:24 950214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x1 and port 0x0002c90200230775, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:24 950214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x2 and port 0x0002c90200230709, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:24 950214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x3 and port 0x0002c902002306e9, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:24 950214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x4 and port 0x0002c9020024951d, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:24 950214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x5 and port 0x0002c902002307a1, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:24 950214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x7 and port 0x0002c90200230721, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:24 950214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x8 and port 0x0002c90200230715, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:24 950214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0xA and port 0x0002c90200230755, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:24 950214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230755, port 0x1 and port 0x00066a00d900083b, port 0xA. Using lower OP_VLS of 3 Sep 26 16:12:24 950214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230775, port 0x1 and port 0x00066a00d900083b, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:24 950214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002307a1, port 0x1 and port 0x00066a00d900083b, port 0x5. Using lower OP_VLS of 3 Sep 26 16:12:24 950214 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002306e9, port 0x1 and port 0x00066a00d900083b, port 0x3. Using lower OP_VLS of 3 Sep 26 16:12:24 950214 [0008] 0x04 -> do_sweep: ****************************************************************** ******* LINKS PORTS CONFIGURED - SET LINKS TO ARMED STATE ******** ****************************************************************** Sep 26 16:12:24 952213 [0008] 0x04 -> do_sweep: ****************************************************************** ************ LINKS ARMED - SET LINKS TO ACTIVE STATE ************* ****************************************************************** Sep 26 16:12:25 223172 [0008] 0x04 -> __osm_state_mgr_up_msg: ****************************************************************** *************************** SUBNET UP **************************** ****************************************************************** Sep 26 16:12:25 366150 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: ****************************************************************** ********************* INITIATING LIGHT SWEEP ********************* ****************************************************************** Sep 26 16:12:25 366150 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** LIGHT SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:12:25 972057 [0004] 0x04 -> Generic Notice dump: type.....................0x01 prod_type................2 (Switch) trap_num.................128 sw_lid...................6 Sep 26 16:12:25 972057 [0004] 0x04 -> __osm_trap_rcv_process_request: Forcing heavy sweep. Received trap:128 Sep 26 16:12:25 972057 [0008] 0x04 -> __osm_state_mgr_sweep_hop_0: ****************************************************************** ********************* INITIATING HEAVY SWEEP ********************* ****************************************************************** Sep 26 16:12:25 972057 [0005] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c90200230774 TID 0x1408, discovered 0 times already Sep 26 16:12:25 973057 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230775 for parent node GUID 0x2c90200230774, TID 0x1409 Sep 26 16:12:25 973057 [0006] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Switch node 0x66a00d900083b TID 0x140a, discovered 0 times already Sep 26 16:12:25 974057 [0004] 0x04 -> __osm_si_rcv_process_existing: discovery_count is:1 Sep 26 16:12:25 974057 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x0 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x140c Sep 26 16:12:25 975057 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x140d Sep 26 16:12:25 975057 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x2 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x140e Sep 26 16:12:25 975057 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x3 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x140f Sep 26 16:12:25 976057 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x4 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1410 Sep 26 16:12:25 976057 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x5 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1411 Sep 26 16:12:25 976057 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x6 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1412 Sep 26 16:12:25 977056 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x7 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1413 Sep 26 16:12:25 977056 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x8 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1414 Sep 26 16:12:25 977056 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x9 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1415 Sep 26 16:12:25 977056 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0xA with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1416 Sep 26 16:12:25 978056 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0xB with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1417 Sep 26 16:12:25 978056 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0xC with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1418 Sep 26 16:12:25 978056 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0xD with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1419 Sep 26 16:12:25 979056 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0xE with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x141a Sep 26 16:12:25 979056 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0xF with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x141b Sep 26 16:12:25 979056 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x10 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x141c Sep 26 16:12:25 980056 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x11 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x141d Sep 26 16:12:25 980056 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x12 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x141e Sep 26 16:12:25 980056 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x13 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x141f Sep 26 16:12:25 981056 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x14 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1420 Sep 26 16:12:25 981056 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x15 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1421 Sep 26 16:12:25 981056 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x16 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1422 Sep 26 16:12:25 982056 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x17 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1423 Sep 26 16:12:25 982056 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x18 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x1424 Sep 26 16:12:25 982056 [0003] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c90200230708 TID 0x1425, discovered 0 times already Sep 26 16:12:25 983056 [0006] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c902002306e8 TID 0x1426, discovered 0 times already Sep 26 16:12:25 983056 [0004] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c9020024951c TID 0x1427, discovered 0 times already Sep 26 16:12:25 983056 [0005] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c902002307a0 TID 0x1428, discovered 0 times already Sep 26 16:12:25 984055 [0003] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c90200230720 TID 0x1429, discovered 0 times already Sep 26 16:12:25 984055 [0006] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c90200230714 TID 0x142a, discovered 0 times already Sep 26 16:12:25 984055 [0004] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 9 with GUID 0x66a00d900083b for parent node GUID 0x66a00d900083b, TID 0x142c Sep 26 16:12:25 985055 [0005] 0x04 -> Received SMP on a 2 hop path: Initial path = 0,1,9 Return path = 0,1,1 Sep 26 16:12:25 985055 [0005] 0x04 -> __osm_ni_rcv_process_new: Discovered new Channel Adapter node, GUID 0x2c902002306f8, TID 0x142b Sep 26 16:12:25 985055 [0003] 0x04 -> __osm_ni_rcv_process_existing: Rediscovered Channel Adapter node 0x2c90200230754 TID 0x142d, discovered 0 times already Sep 26 16:12:25 985055 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230709 for parent node GUID 0x2c90200230708, TID 0x142e Sep 26 16:12:25 986055 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c902002306e9 for parent node GUID 0x2c902002306e8, TID 0x142f Sep 26 16:12:25 986055 [0005] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c9020024951d for parent node GUID 0x2c9020024951c, TID 0x1430 Sep 26 16:12:25 986055 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c902002307a1 for parent node GUID 0x2c902002307a0, TID 0x1431 Sep 26 16:12:25 987055 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230721 for parent node GUID 0x2c90200230720, TID 0x1432 Sep 26 16:12:25 987055 [0004] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230715 for parent node GUID 0x2c90200230714, TID 0x1433 Sep 26 16:12:25 987055 [0005] 0x04 -> __osm_nd_rcv_process_nd: Node 0x2c902002306f8 Description = 7 Sep 26 16:12:25 987055 [0003] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c902002306f9 for parent node GUID 0x2c902002306f8, TID 0x1435 Sep 26 16:12:25 988055 [0006] 0x04 -> osm_pi_rcv_process: Discovered port num 0x1 with GUID 0x2c90200230755 for parent node GUID 0x2c90200230754, TID 0x1436 Sep 26 16:12:25 988055 [0004] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:0 port_num 1 with GUID 0x2c902002306f9 for parent node GUID 0x2c902002306f8, TID 0x1437 Sep 26 16:12:25 989055 [0005] 0x04 -> osm_pkey_rcv_process: Got GetResp(PKey) block:1 port_num 1 with GUID 0x2c902002306f9 for parent node GUID 0x2c902002306f8, TID 0x1438 Sep 26 16:12:25 989055 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** HEAVY SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:12:25 989055 [0008] 0x04 -> osm_prtn_make_new: Duplicated partition definition: 'Default' (0x7fff) prev name 'Default'. Will use it Sep 26 16:12:25 989055 [0008] 0x04 -> osm_prtn_add_port: port 0x2c90200230775 already in partition 'Default' (0x7fff). Will overwrite Sep 26 16:12:25 989055 [0008] 0x04 -> osm_sa_db_file_load: sa db file name is not specifed. Skip restore Sep 26 16:12:25 989055 [0008] 0x04 -> __osm_lid_mgr_process_our_sm_node: Assigning SM's port 0x0002c90200230775 to LID range [1,1] Sep 26 16:12:25 989055 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230775, port 0x1 and port 0x00066a00d900083b, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:25 989055 [0008] 0x04 -> do_sweep: ****************************************************************** **** SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG ***** ****************************************************************** Sep 26 16:12:25 989055 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c90200230709, LID [4,4] Sep 26 16:12:25 989055 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230709, port 0x1 and port 0x00066a00d900083b, port 0x2. Using lower OP_VLS of 3 Sep 26 16:12:25 989055 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c90200230715, LID [9,9] Sep 26 16:12:25 989055 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230715, port 0x1 and port 0x00066a00d900083b, port 0x8. Using lower OP_VLS of 3 Sep 26 16:12:25 989055 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c9020024951d, LID [3,3] Sep 26 16:12:25 989055 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c9020024951d, port 0x1 and port 0x00066a00d900083b, port 0x4. Using lower OP_VLS of 3 Sep 26 16:12:25 989055 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c90200230721, LID [8,8] Sep 26 16:12:25 989055 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230721, port 0x1 and port 0x00066a00d900083b, port 0x7. Using lower OP_VLS of 3 Sep 26 16:12:25 989055 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x00066a00d900083b, LID [6,6] Sep 26 16:12:25 989055 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c90200230755, LID [7,7] Sep 26 16:12:25 989055 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230755, port 0x1 and port 0x00066a00d900083b, port 0xA. Using lower OP_VLS of 3 Sep 26 16:12:25 989055 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c902002307a1, LID [2,2] Sep 26 16:12:25 989055 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002307a1, port 0x1 and port 0x00066a00d900083b, port 0x5. Using lower OP_VLS of 3 Sep 26 16:12:25 989055 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c902002306e9, LID [5,5] Sep 26 16:12:25 989055 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002306e9, port 0x1 and port 0x00066a00d900083b, port 0x3. Using lower OP_VLS of 3 Sep 26 16:12:25 989055 [0008] 0x04 -> osm_lid_mgr_process_subnet: Assigned port 0x0002c902002306f9, LID [10,10] Sep 26 16:12:25 989055 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c902002306f9, port 0x1 and port 0x00066a00d900083b, port 0x9. Using lower OP_VLS of 3 Sep 26 16:12:25 991054 [0008] 0x04 -> do_sweep: ****************************************************************** ***** LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG ***** ****************************************************************** Sep 26 16:12:25 991054 [0008] 0x04 -> osm_ucast_mgr_build_lid_matrices: Starting switches' Min Hop Table Assignment Sep 26 16:12:25 991054 [0008] 0x04 -> osm_ucast_mgr_process: LFT Tables configured on all switches Sep 26 16:12:25 991054 [0008] 0x04 -> do_sweep: ****************************************************************** **************** SWITCHES CONFIGURED FOR UNICAST ***************** ****************************************************************** Sep 26 16:12:25 991054 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC000 has no members - nothing to do Sep 26 16:12:25 991054 [0008] 0x04 -> __osm_mcast_mgr_build_spanning_tree: MLID 0xC001 has no members - nothing to do Sep 26 16:12:25 991054 [0008] 0x04 -> do_sweep: ****************************************************************** *************** SWITCHES CONFIGURED FOR MULTICAST **************** ****************************************************************** Sep 26 16:12:25 991054 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230709, port 0x1 and port 0x00066a00d900083b, port 0x2. Using lower OP_VLS of 3 Sep 26 16:12:25 991054 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230715, port 0x1 and port 0x00066a00d900083b, port 0x8. Using lower OP_VLS of 3 Sep 26 16:12:25 991054 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c9020024951d, port 0x1 and port 0x00066a00d900083b, port 0x4. Using lower OP_VLS of 3 Sep 26 16:12:25 991054 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230721, port 0x1 and port 0x00066a00d900083b, port 0x7. Using lower OP_VLS of 3 Sep 26 16:12:25 991054 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x1 and port 0x0002c90200230775, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:25 991054 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x2 and port 0x0002c90200230709, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:25 991054 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x3 and port 0x0002c902002306e9, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:25 992054 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x4 and port 0x0002c9020024951d, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:25 992054 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x5 and port 0x0002c902002307a1, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:25 992054 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x7 and port 0x0002c90200230721, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:25 992054 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x8 and port 0x0002c90200230715, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:25 992054 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0x9 and port 0x0002c902002306f9, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:25 992054 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00066a00d900083b, port 0xA and port 0x0002c90200230755, port 0x1. Using lower OP_VLS of 3 Sep 26 16:12:25 992054 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x0002c90200230755, port 0x1 and port 0x00066a00d900083b, port 0xA. Using lower OP_VLS of 3 Sep 26 16:12:25 992054 [0008] 0x04 -> osm_physp_calc_link_op_vls: OP_VLS mismatch between ports. Port 0x00 Port 0x0002c902002306f9, port 0x1 and port 0x00066a00d900083b, port 0x9. Using lower OP_VLS of 3 Sep 26 16:12:25 992054 [0008] 0x04 -> do_sweep: ****************************************************************** ******* LINKS PORTS CONFIGURED - SET LINKS TO ARMED STATE ******** ****************************************************************** Sep 26 16:12:25 993054 [0008] 0x04 -> do_sweep: ****************************************************************** ************ LINKS ARMED - SET LINKS TO ACTIVE STATE ************* ****************************************************************** Sep 26 16:12:26 265012 [0008] 0x04 -> __osm_state_mgr_up_msg: ****************************************************************** *************************** SUBNET UP **************************** ****************************************************************** Sep 26 16:12:35 366620 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: ****************************************************************** ********************* INITIATING LIGHT SWEEP ********************* ****************************************************************** Sep 26 16:12:35 366620 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** LIGHT SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:12:45 367089 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: ****************************************************************** ********************* INITIATING LIGHT SWEEP ********************* ****************************************************************** Sep 26 16:12:45 367089 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** LIGHT SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:12:55 367559 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: ****************************************************************** ********************* INITIATING LIGHT SWEEP ********************* ****************************************************************** Sep 26 16:12:55 367559 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** LIGHT SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:13:05 368029 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: ****************************************************************** ********************* INITIATING LIGHT SWEEP ********************* ****************************************************************** Sep 26 16:13:05 368029 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** LIGHT SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:13:15 368498 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: ****************************************************************** ********************* INITIATING LIGHT SWEEP ********************* ****************************************************************** Sep 26 16:13:15 368498 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** LIGHT SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:13:25 368968 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: ****************************************************************** ********************* INITIATING LIGHT SWEEP ********************* ****************************************************************** Sep 26 16:13:25 368968 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** LIGHT SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:13:35 369438 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: ****************************************************************** ********************* INITIATING LIGHT SWEEP ********************* ****************************************************************** Sep 26 16:13:35 369438 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** LIGHT SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:13:45 369908 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: ****************************************************************** ********************* INITIATING LIGHT SWEEP ********************* ****************************************************************** Sep 26 16:13:45 369908 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** LIGHT SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:13:55 370377 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: ****************************************************************** ********************* INITIATING LIGHT SWEEP ********************* ****************************************************************** Sep 26 16:13:55 370377 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** LIGHT SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:14:05 370847 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: ****************************************************************** ********************* INITIATING LIGHT SWEEP ********************* ****************************************************************** Sep 26 16:14:05 370847 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** LIGHT SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:14:15 371317 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: ****************************************************************** ********************* INITIATING LIGHT SWEEP ********************* ****************************************************************** Sep 26 16:14:15 371317 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** LIGHT SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:14:25 371786 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: ****************************************************************** ********************* INITIATING LIGHT SWEEP ********************* ****************************************************************** Sep 26 16:14:25 371786 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** LIGHT SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:14:35 372256 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: ****************************************************************** ********************* INITIATING LIGHT SWEEP ********************* ****************************************************************** Sep 26 16:14:35 372256 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** LIGHT SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:14:45 372726 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: ****************************************************************** ********************* INITIATING LIGHT SWEEP ********************* ****************************************************************** Sep 26 16:14:45 372726 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** LIGHT SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:14:55 373195 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: ****************************************************************** ********************* INITIATING LIGHT SWEEP ********************* ****************************************************************** Sep 26 16:14:55 373195 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** LIGHT SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:15:05 373665 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: ****************************************************************** ********************* INITIATING LIGHT SWEEP ********************* ****************************************************************** Sep 26 16:15:05 373665 [0008] 0x04 -> do_sweep: ****************************************************************** ********************** LIGHT SWEEP COMPLETE ********************** ****************************************************************** Sep 26 16:15:15 374135 [0008] 0x04 -> __osm_state_mgr_light_sweep_start: **************************************************# Sasha Khapyorsky 10/03/2008 10:51 AM To Yicheng Jia cc general at lists.openfabrics.org Subject Re: [ofa-general] questions about opensm and unmanaged switch On 18:50 Fri 03 Oct , Sasha Khapyorsky wrote: > On 17:14 Thu 02 Oct , Yicheng Jia wrote: > > > > The error I got are "osm_db_store: ERR 6109: Failed to remove > > file:/tmp//guid2lid" and "osm_db_store: ERR 6108: Failed to rename the db > > file to:/tmp//guid2lid". I set "OSM_DEFAULT_CACHE_DIR" to "/tmp/". And what is the error status printed there? Sasha _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: opensm_bootup_verbose.txt URL: From rdreier at cisco.com Fri Oct 3 09:54:56 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 03 Oct 2008 09:54:56 -0700 Subject: [ofa-general] Re: [PATCH 3/4] RDMA/nes: Fix routed RDMA connection In-Reply-To: <200810021439.m92EdUY3005503@velma.neteffect.com> (Chien Tung's message of "Thu, 2 Oct 2008 09:39:30 -0500") References: <200810021439.m92EdUY3005503@velma.neteffect.com> Message-ID: thanks, applied From rdreier at cisco.com Fri Oct 3 09:55:44 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 03 Oct 2008 09:55:44 -0700 Subject: [ofa-general] Re: [PATCH 4/4] RDMA/nes: Clear cm_id only when done with cm_node In-Reply-To: <200810021439.m92EdUfG005505@velma.neteffect.com> (Chien Tung's message of "Thu, 2 Oct 2008 09:39:30 -0500") References: <200810021439.m92EdUfG005505@velma.neteffect.com> Message-ID: > Clear cm_node->cm_id only when we are really done with it. I'd like a better changelog for this. What problem is this fixing? What happens that's bad if we clear cm_node->cm_id the way the current code does? - R. From arlin.r.davis at intel.com Fri Oct 3 11:04:01 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Fri, 3 Oct 2008 11:04:01 -0700 Subject: [ofa-general] [PATCH 1/3][v1.2] dtest: reduce default IOV's during dat_ep_create for iWARP devices Message-ID: <000001c92582$6b8b0960$d662fe0a@amr.corp.intel.com> Patch set to fix some minor issues found at the OFA interop event. iWarp adapters tend to have less IOV resources then IB adapters. Signed-off-by: Arlin Davis --- test/dtest/dtest.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c index a93f878..0255a90 100755 --- a/test/dtest/dtest.c +++ b/test/dtest/dtest.c @@ -94,7 +94,7 @@ static DAT_VADDR registered_addr_recv; /* Initial msg receive buf, RMR exchange, and Rdma-write notification */ #define MSG_BUF_COUNT 3 -#define MSG_IOV_COUNT 4 +#define MSG_IOV_COUNT 2 static DAT_RMR_TRIPLET rmr_recv_msg[MSG_BUF_COUNT]; static DAT_LMR_HANDLE h_lmr_recv_msg = DAT_HANDLE_NULL; static DAT_LMR_CONTEXT lmr_context_recv_msg; -- 1.5.2.5 From arlin.r.davis at intel.com Fri Oct 3 11:04:05 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Fri, 3 Oct 2008 11:04:05 -0700 Subject: [ofa-general] [PATCH 2/3][v1.2] dapl: adjust max_rdma_read_iov to 1 for query on iWARP devices Message-ID: <000201c92582$6cfee4b0$d662fe0a@amr.corp.intel.com> iWarp spec allows only one iov on rdma reads Signed-off-by: Arlin Davis --- dapl/openib_cma/dapl_ib_util.c | 17 +++++++++++++++-- 1 files changed, 15 insertions(+), 2 deletions(-) diff --git a/dapl/openib_cma/dapl_ib_util.c b/dapl/openib_cma/dapl_ib_util.c index afb7463..598afab 100755 --- a/dapl/openib_cma/dapl_ib_util.c +++ b/dapl/openib_cma/dapl_ib_util.c @@ -506,7 +506,14 @@ DAT_RETURN dapls_ib_query_hca(IN DAPL_HCA *hca_ptr, ia_attr->transport_attr = NULL; ia_attr->num_vendor_attr = 0; ia_attr->vendor_attr = NULL; - ia_attr->max_iov_segments_per_rdma_read = dev_attr.max_sge; + /* iWARP spec. - 1 sge for RDMA reads */ + if (hca_ptr->ib_hca_handle->device->transport_type + == IBV_TRANSPORT_IWARP) + ia_attr->max_iov_segments_per_rdma_read = 1; + else + ia_attr->max_iov_segments_per_rdma_read = + dev_attr.max_sge; + ia_attr->max_iov_segments_per_rdma_write = dev_attr.max_sge; /* save rd_atom for peer validation during connect requests */ hca_ptr->ib_trans.max_rdma_rd_in = dev_attr.max_qp_rd_atom; @@ -538,7 +545,13 @@ DAT_RETURN dapls_ib_query_hca(IN DAPL_HCA *hca_ptr, ep_attr->max_request_iov = dev_attr.max_sge; ep_attr->max_rdma_read_in = dev_attr.max_qp_rd_atom; ep_attr->max_rdma_read_out= dev_attr.max_qp_init_rd_atom; - ep_attr->max_rdma_read_iov= dev_attr.max_sge; + /* iWARP spec. - 1 sge for RDMA reads */ + if (hca_ptr->ib_hca_handle->device->transport_type + == IBV_TRANSPORT_IWARP) + ep_attr->max_rdma_read_iov = 1; + else + ep_attr->max_rdma_read_iov = dev_attr.max_sge; + ep_attr->max_rdma_write_iov= dev_attr.max_sge; dapl_log(DAPL_DBG_TYPE_UTIL, "dapl_query_hca: MAX msg %llu dto %d iov %d" -- 1.5.2.5 From arlin.r.davis at intel.com Fri Oct 3 11:04:05 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Fri, 3 Oct 2008 11:04:05 -0700 Subject: [ofa-general] [PATCH 3/3][v1.2] dat.conf: add OpenIB-iwarp entry for iwarp devices Message-ID: <000101c92582$6c9d15a0$d662fe0a@amr.corp.intel.com> Signed-off-by: Arlin Davis --- Makefile.am | 3 ++- dapl.spec.in | 1 + 2 files changed, 3 insertions(+), 1 deletions(-) diff --git a/Makefile.am b/Makefile.am index 1dd996c..5f39ece 100644 --- a/Makefile.am +++ b/Makefile.am @@ -396,7 +396,8 @@ install-exec-hook: echo OpenIB-mthca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 '"mthca0 1" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ echo OpenIB-mthca0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 '"mthca0 2" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ echo OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 '"mlx4_0 1" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ - echo OpenIB-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 '"mlx4_0 2" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; + echo OpenIB-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 '"mlx4_0 2" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ + echo OpenIB-iwarp u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 '"eth2 0" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; uninstall-hook: if test -e $(DESTDIR)$(sysconfdir)/dat.conf; then \ diff --git a/dapl.spec.in b/dapl.spec.in index b3d103e..241fabc 100644 --- a/dapl.spec.in +++ b/dapl.spec.in @@ -100,6 +100,7 @@ echo OpenIB-mthca0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 '"mthca echo OpenIB-mthca0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 '"mthca0 2" ""' >> %{_sysconfdir}/dat.conf echo OpenIB-mlx4_0-1 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 '"mlx4_0 1" ""' >> %{_sysconfdir}/dat.conf echo OpenIB-mlx4_0-2 u1.2 nonthreadsafe default libdaplscm.so.1 dapl.1.2 '"mlx4_0 2" ""' >> %{_sysconfdir}/dat.conf +echo OpenIB-iwarp u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 '"eth2 0" ""' >> %{_sysconfdir}/dat.conf %postun /sbin/ldconfig -- 1.5.2.5 From arlin.r.davis at intel.com Fri Oct 3 11:09:05 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Fri, 3 Oct 2008 11:09:05 -0700 Subject: [ofa-general] [PATCH 1/4][v2.0] dtest: fix 32-bit build issues in dtest and dtestx examples. Message-ID: <000301c92583$1ffbd0f0$d662fe0a@amr.corp.intel.com> Patch set for v2.0 to fix some minor issues found at the OFA interop event. Signed-off-by: Arlin Davis --- test/dtest/dtest.c | 24 ++++++++++++------------ test/dtest/dtestx.c | 28 ++++++++++++++-------------- 2 files changed, 26 insertions(+), 26 deletions(-) diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c index 00d14e3..55f325b 100755 --- a/test/dtest/dtest.c +++ b/test/dtest/dtest.c @@ -92,14 +92,14 @@ (((uint32_t)(x) & 0xFF000000) >> 24)) #define hton32(x) ntoh32(x) #define ntoh64(x) (uint64_t)( \ - (((uint64_t)x & 0x00000000000000FF) << 56) | \ - (((uint64_t)x & 0x000000000000FF00) << 40) | \ - (((uint64_t)x & 0x0000000000FF0000) << 24) | \ - (((uint64_t)x & 0x00000000FF000000) << 8 ) | \ - (((uint64_t)x & 0x000000FF00000000) >> 8 ) | \ - (((uint64_t)x & 0x0000FF0000000000) >> 24) | \ - (((uint64_t)x & 0x00FF000000000000) >> 40) | \ - (((uint64_t)x & 0xFF00000000000000) >> 56)) + (((uint64_t)x & 0x00000000000000FFULL) << 56) | \ + (((uint64_t)x & 0x000000000000FF00ULL) << 40) | \ + (((uint64_t)x & 0x0000000000FF0000ULL) << 24) | \ + (((uint64_t)x & 0x00000000FF000000ULL) << 8 ) | \ + (((uint64_t)x & 0x000000FF00000000ULL) >> 8 ) | \ + (((uint64_t)x & 0x0000FF0000000000ULL) >> 24) | \ + (((uint64_t)x & 0x00FF000000000000ULL) >> 40) | \ + (((uint64_t)x & 0xFF00000000000000ULL) >> 56)) #define hton64(x) ntoh64(x) #elif __BYTE_ORDER == __BIG_ENDIAN #define hton16(x) (x) @@ -1029,7 +1029,7 @@ connect_ep( char *hostname, DAT_CONN_QUAL conn_id ) /* * Setup our remote memory and tell the other side about it */ - rmr_send_msg.virtual_address = hton64((DAT_VADDR)rbuf); + rmr_send_msg.virtual_address = hton64((DAT_VADDR)(uintptr_t)rbuf); rmr_send_msg.segment_length = hton32(RDMA_BUFFER_SIZE); rmr_send_msg.rmr_context = hton32(rmr_context_recv); @@ -1230,7 +1230,7 @@ do_rdma_write_with_msg( void ) for (i=0;i> 24)) #define hton32(x) ntoh32(x) #define ntoh64(x) (uint64_t)( \ - (((uint64_t)x & 0x00000000000000FF) << 56) | \ - (((uint64_t)x & 0x000000000000FF00) << 40) | \ - (((uint64_t)x & 0x0000000000FF0000) << 24) | \ - (((uint64_t)x & 0x00000000FF000000) << 8 ) | \ - (((uint64_t)x & 0x000000FF00000000) >> 8 ) | \ - (((uint64_t)x & 0x0000FF0000000000) >> 24) | \ - (((uint64_t)x & 0x00FF000000000000) >> 40) | \ - (((uint64_t)x & 0xFF00000000000000) >> 56)) + (((uint64_t)x & 0x00000000000000FFULL) << 56) | \ + (((uint64_t)x & 0x000000000000FF00ULL) << 40) | \ + (((uint64_t)x & 0x0000000000FF0000ULL) << 24) | \ + (((uint64_t)x & 0x00000000FF000000ULL) << 8 ) | \ + (((uint64_t)x & 0x000000FF00000000ULL) >> 8 ) | \ + (((uint64_t)x & 0x0000FF0000000000ULL) >> 24) | \ + (((uint64_t)x & 0x00FF000000000000ULL) >> 40) | \ + (((uint64_t)x & 0xFF00000000000000ULL) >> 56)) #define hton64(x) ntoh64(x) #elif __BYTE_ORDER == __BIG_ENDIAN #define hton16(x) (x) @@ -216,7 +216,7 @@ send_msg( &event.event_data.dto_completion_event_data; iov.lmr_context = context; - iov.virtual_address = (DAT_VADDR)data; + iov.virtual_address = (DAT_VADDR)(uintptr_t)data; iov.segment_length = (DAT_VLEN)size; for (i=0;irmr_context = hton32(rmr_context[RCV_RDMA_BUF_INDEX]); - r_iov->virtual_address = hton64((DAT_VADDR)buf[RCV_RDMA_BUF_INDEX]); + r_iov->virtual_address = hton64((DAT_VADDR)(uintptr_t)buf[RCV_RDMA_BUF_INDEX]); r_iov->segment_length = hton32(buf_size); printf("Send RMR message: r_key_ctx=0x%x,va="F64x",len=0x%x\n", @@ -781,7 +781,7 @@ do_immediate() r_iov = *buf[RECV_BUF_INDEX]; iov.lmr_context = lmr_context[SND_RDMA_BUF_INDEX]; - iov.virtual_address = (DAT_VADDR) buf[SND_RDMA_BUF_INDEX]; + iov.virtual_address = (DAT_VADDR)(uintptr_t)buf[SND_RDMA_BUF_INDEX]; iov.segment_length = buf_size; cookie.as_64 = 0x9999; @@ -939,7 +939,7 @@ do_cmp_swap() r_iov = *buf[ RECV_BUF_INDEX ]; l_iov.lmr_context = lmr_atomic_context; - l_iov.virtual_address = (DAT_UINT64)atomic_buf; + l_iov.virtual_address = (DAT_UINT64)(uintptr_t)atomic_buf; l_iov.segment_length = BUF_SIZE_ATOMIC; cookie.as_64 = 3333; @@ -1040,7 +1040,7 @@ do_fetch_add() r_iov = *buf[ RECV_BUF_INDEX ]; l_iov.lmr_context = lmr_atomic_context; - l_iov.virtual_address = (DAT_UINT64)atomic_buf; + l_iov.virtual_address = (DAT_UINT64)(uintptr_t)atomic_buf; l_iov.segment_length = BUF_SIZE_ATOMIC; cookie.as_64 = 0x7777; -- 1.5.2.5 From arlin.r.davis at intel.com Fri Oct 3 11:09:08 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Fri, 3 Oct 2008 11:09:08 -0700 Subject: [ofa-general] [PATCH 3/4][v2.0] dapl: adjust max_rdma_read_iov to 1 for query on iWARP devices Message-ID: <000501c92583$21639e50$d662fe0a@amr.corp.intel.com> iWarp spec allows only one iov on rdma reads Signed-off-by: Arlin Davis --- dapl/openib_cma/dapl_ib_util.c | 17 +++++++++++++++-- 1 files changed, 15 insertions(+), 2 deletions(-) diff --git a/dapl/openib_cma/dapl_ib_util.c b/dapl/openib_cma/dapl_ib_util.c index 72d8237..36b534e 100755 --- a/dapl/openib_cma/dapl_ib_util.c +++ b/dapl/openib_cma/dapl_ib_util.c @@ -502,7 +502,14 @@ DAT_RETURN dapls_ib_query_hca(IN DAPL_HCA *hca_ptr, ia_attr->transport_attr = NULL; ia_attr->num_vendor_attr = 0; ia_attr->vendor_attr = NULL; - ia_attr->max_iov_segments_per_rdma_read = dev_attr.max_sge; + /* iWARP spec. - 1 sge for RDMA reads */ + if (hca_ptr->ib_hca_handle->device->transport_type + == IBV_TRANSPORT_IWARP) + ia_attr->max_iov_segments_per_rdma_read = 1; + else + ia_attr->max_iov_segments_per_rdma_read = + dev_attr.max_sge; + ia_attr->max_iov_segments_per_rdma_write = dev_attr.max_sge; /* save rd_atom for peer validation during connect requests */ hca_ptr->ib_trans.max_rdma_rd_in = dev_attr.max_qp_rd_atom; @@ -537,7 +544,13 @@ DAT_RETURN dapls_ib_query_hca(IN DAPL_HCA *hca_ptr, ep_attr->max_request_iov = dev_attr.max_sge; ep_attr->max_rdma_read_in = dev_attr.max_qp_rd_atom; ep_attr->max_rdma_read_out= dev_attr.max_qp_init_rd_atom; - ep_attr->max_rdma_read_iov= dev_attr.max_sge; + /* iWARP spec. - 1 sge for RDMA reads */ + if (hca_ptr->ib_hca_handle->device->transport_type + == IBV_TRANSPORT_IWARP) + ep_attr->max_rdma_read_iov = 1; + else + ep_attr->max_rdma_read_iov = dev_attr.max_sge; + ep_attr->max_rdma_write_iov= dev_attr.max_sge; dapl_log(DAPL_DBG_TYPE_UTIL, "dapl_query_hca: MAX msg %llu dto %d iov %d" -- 1.5.2.5 From arlin.r.davis at intel.com Fri Oct 3 11:09:06 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Fri, 3 Oct 2008 11:09:06 -0700 Subject: [ofa-general] [PATCH 2/4][v2.0] dtest: reduce default IOV's during dat_ep_create for iWARP devices Message-ID: <000401c92583$20708bc0$d662fe0a@amr.corp.intel.com> iWarp adapters tend to have less IOV resources then IB adapters. Signed-off-by: Arlin Davis --- test/dtest/dtest.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c index 55f325b..37edd6c 100755 --- a/test/dtest/dtest.c +++ b/test/dtest/dtest.c @@ -156,7 +156,7 @@ static DAT_VADDR registered_addr_recv; /* Initial msg receive buf, RMR exchange, and Rdma-write notification */ #define MSG_BUF_COUNT 3 -#define MSG_IOV_COUNT 4 +#define MSG_IOV_COUNT 2 static DAT_RMR_TRIPLET rmr_recv_msg[MSG_BUF_COUNT]; static DAT_LMR_HANDLE h_lmr_recv_msg = DAT_HANDLE_NULL; static DAT_LMR_CONTEXT lmr_context_recv_msg; -- 1.5.2.5 From arlin.r.davis at intel.com Fri Oct 3 11:09:10 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Fri, 3 Oct 2008 11:09:10 -0700 Subject: [ofa-general] [PATCH 4/4][v2.0] dat.conf: add ofa-v2-iwarp entry for iwarp devices Message-ID: <000601c92583$228b2e60$d662fe0a@amr.corp.intel.com> Changes to install/uninstall hooks for the new ofa-v2-iwarp entry Signed-off-by: Arlin Davis --- Makefile.am | 7 ++++--- dapl.spec.in | 1 + 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/Makefile.am b/Makefile.am index 8afa666..4929f83 100755 --- a/Makefile.am +++ b/Makefile.am @@ -412,10 +412,11 @@ install-exec-hook: echo ofa-v2-mthca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"mthca0 1" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ echo ofa-v2-mthca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"mthca0 2" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ echo ofa-v2-mlx4_0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"mlx4_0 1" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ - echo ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"mlx4_0 2" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; + echo ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"mlx4_0 2" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ echo ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"ipath0 1" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ - echo ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"ipath0 2" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; - echo ofa-v2-ehca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"ehca0 1" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; + echo ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"ipath0 2" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ + echo ofa-v2-ehca0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"ehca0 1" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; \ + echo ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 '"eth2 0" ""' >> $(DESTDIR)$(sysconfdir)/dat.conf; uninstall-hook: if test -e $(DESTDIR)$(sysconfdir)/dat.conf; then \ diff --git a/dapl.spec.in b/dapl.spec.in index ce39cd9..27a369b 100644 --- a/dapl.spec.in +++ b/dapl.spec.in @@ -103,6 +103,7 @@ echo ofa-v2-mlx4_0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"mlx4 echo ofa-v2-ipath0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"ipath0 1" ""' >> %{_sysconfdir}/dat.conf echo ofa-v2-ipath0-2 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"ipath0 2" ""' >> %{_sysconfdir}/dat.conf echo ofa-v2-ehca0-1 u2.0 nonthreadsafe default libdaploscm.so.2 dapl.2.0 '"ehca0 1" ""' >> %{_sysconfdir}/dat.conf +echo ofa-v2-iwarp u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 '"eth2 0" ""' >> %{_sysconfdir}/dat.conf %postun /sbin/ldconfig -- 1.5.2.5 From cameron at harr.org Fri Oct 3 11:09:36 2008 From: cameron at harr.org (Cameron Harr) Date: Fri, 03 Oct 2008 12:09:36 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48E6498A.3070002@mellanox.com> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> Message-ID: <48E65FE0.2060602@harr.org> Vu Pham wrote: > >> Alternatively, is there anything in the SCST layer I should tweak. I'm >> still running rev 245 of that code (kinda old, but works with OFED 1.3.1 >> w/o hacks). >> > > What is the mode (pass thru, blockio...)? blockio > What is the scst_threads= parameters? Default, which I believe is #cpus > >> >>> >>> >>> My target server (with DAS) contains 8 2.8 GHz CPU cores and can >>> sustain over 200K IOPs locally, but only around 73K IOPs over SRP. > > Is this number from one initiator or multiple? One initiator. At first I thought it might be a limitation of the SRP, and added a second initiator, but the aggregate performance of the two was about equal to that of a single initiator. > >>> Looking at /proc/interrupts, I see that the mlx_core (comp) device >>> is pushing about 135K Int/s on 1 of 2 CPUs. All CPUs are enabled for >>> that PCI-E slot, but it only ever uses 2 of the CPUs, and only 1 at >>> a time. None of the other CPUs has an interrupt rate more than about >>> 40-50K/s. >>> > > The number of interrupt can be cut down if there are more completions > to be processed by sw. ie. please test with multiple QPs between one > initiator vs. your target and multiple initiators vs. your target > A couple questions here on my side. How would more QP connections reduce interrupts? It seems like they'd still need to come through the same mlx device, causing the same number or more, of interrupts. More importantly thought, how would one increase the number of QPs between and initiator and target? I did have my ib_srpt threads up, would that be comparable? >>> Does anyone know of a trick to spread those interrupts out more >>> (which I realize might be bad due to context switching), or >>> something else that will reduce my interrupts on that cpu? The mlx4 >>> is a MSI-X interrupt. I've changed it to an APIC int, but it seems >>> to give slightly lower performance. >>> > There userspace daemon, irqbalanced, that dynamically directs IRQs to > different CPUs. You can define which CPUs CAN handle an IRQ but you > cannot control how it is done. You can look at > Documentation/IRQ-affinity.txt for details how to configure it. In > some cases I found better performance-wise to shut the irqbalanced off > and assign the process to one (ore more) CPU and use a different CPU > to serve interrupts. > Earlier, I did go over that file, and tried playing around with /sys/class/pci_bus//cpu_affinity and /proc/irq//smp_affinity for the pci slot I was using, but didn't have much luck. I also tried turning off irqbalance, but that made no difference. Additionally, I found that I can load the newer scst code if I use the kernel-supplied modules and the standalone srpt-1.0.0 package that I think you provide Vu. I was about to try it along with dropping a module param for ib_srpt (I was using a thread count of 32 that had given me better performance on an earlier test). I'll report back on this. Thanks for the help, Cameron From halr at obsidianresearch.com Fri Oct 3 11:54:42 2008 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Fri, 03 Oct 2008 12:54:42 -0600 Subject: [ofa-general] [PATCH][TRIVIAL] OpenSM: Display port number in decimal in log messages Message-ID: <48E66A72.6020408@obsidianresearch.com> Sasha, Cosmetic patch attached. -- Hal -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: patch-osm-portnum1 URL: From rdreier at cisco.com Fri Oct 3 12:26:31 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 03 Oct 2008 12:26:31 -0700 Subject: [ofa-general] Re: [PATCH 13/13] RDMA/nes: Enhanced PFT management scheme In-Reply-To: <200809262008.m8QK8A2B011727@velma.neteffect.com> (Chien Tung's message of "Fri, 26 Sep 2008 15:08:10 -0500") References: <200809262008.m8QK8A2B011727@velma.neteffect.com> Message-ID: This patch seems broken: > @@ -842,7 +845,12 @@ static void nes_netdev_set_multicast_list(struct net_device *netdev) > > + spin_lock_irqsave(&nesadapter->resource_lock, flags); nes_netdev_set_multicast_list() takes resource lock unconditionally here, but... > if (!mc_all_on) { ... > + spin_unlock_irqrestore(&nesadapter->resource_lock, flags); > } > } only drops the lock if mc_all_on is not set. I rolled the below into the patch to fix things: diff --git a/drivers/infiniband/hw/nes/nes_nic.c b/drivers/infiniband/hw/nes/nes_nic.c index 26809f4..12ff698 100644 --- a/drivers/infiniband/hw/nes/nes_nic.c +++ b/drivers/infiniband/hw/nes/nes_nic.c @@ -952,8 +952,9 @@ static void nes_netdev_set_multicast_list(struct net_device *netdev) nes_write_indexed(nesdev, NES_IDX_NIC_UNICAST_ALL, nic_active); } - spin_unlock_irqrestore(&nesadapter->resource_lock, flags); } + + spin_unlock_irqrestore(&nesadapter->resource_lock, flags); } From ctung at NetEffect.com Fri Oct 3 13:00:35 2008 From: ctung at NetEffect.com (Chien Tung) Date: Fri, 3 Oct 2008 15:00:35 -0500 Subject: [ofa-general] RE: [PATCH 13/13] RDMA/nes: Enhanced PFT management scheme In-Reply-To: References: <200809262008.m8QK8A2B011727@velma.neteffect.com> Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC08510D00@venom2> > This patch seems broken: It is broken. > > > @@ -842,7 +845,12 @@ static void > nes_netdev_set_multicast_list(struct net_device *netdev) > > > + spin_lock_irqsave(&nesadapter->resource_lock, flags); > > nes_netdev_set_multicast_list() takes resource lock > unconditionally here, but... > > > if (!mc_all_on) { > > ... > > > + > spin_unlock_irqrestore(&nesadapter->resource_lock, flags); > > } > > } spin_unlock_irqrestore should be outside of the if(...). + if (!mc_all_on) { + ... + nes_write_indexed(nesdev, NES_IDX_NIC_UNICAST_ALL, + nic_active); + } } + spin_unlock_irqrestore(&nesadapter->resource_lock, flags); } If you can fix it again that would be great but let me know if you want a v2 patch. Chien From cameron at harr.org Fri Oct 3 13:04:28 2008 From: cameron at harr.org (Cameron Harr) Date: Fri, 03 Oct 2008 14:04:28 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48E65FE0.2060602@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> Message-ID: <48E67ACC.1020903@harr.org> Cameron Harr wrote: > Additionally, I found that I can load the newer scst code if I use the > kernel-supplied modules and the standalone srpt-1.0.0 package that I > think you provide Vu. I was about to try it along with dropping a > module param for ib_srpt (I was using a thread count of 32 that had > given me better performance on an earlier test). I'll report back on > this. Not much luck using the newer scst code and default kernel modules (Running CentOS 5.2). If I try using the default kernel modules on the initiator, I can't get them to see anything (the ofed SM pkg doesn't see any devices to run on). When using the regular OFED on the initiator, my target dies when I try to attach to the target on the initiator: --------------------------------- ib_srpt: Host login i_port_id=0x0:0x2c90300026053 t_port_id=0x2c90300026046:0x2c90300026046 it_iu_len=996 Oct 3 13:44:23 test05 kernel: i[4127]: scst: scst_mgmt_thread:5187:***CRITICAL ERROR*** session ffff8107f3222b88 is in scst_sess_shut_list, but in unknown shut phase 0 BUG at /usr/src/scst.tot/src/scst_targ.c:5188 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at /usr/src/scst.tot/src/scst_targ.c:5188 invalid opcode: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:00.0/class CPU 2 Modules linked in: ib_srpt(U) ib_cm ib_sa scst_vdisk(U) scst(U) fio_driver(PU) fio_port(PU) mlx4_ib ib_mad ib_core ipv6 xfrm_nalgo crypto_api autofs4 hidp rfcomm l2cap bluetooth sunrpc nls_utf8 hfsplus dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport i2c_i801 i5000_edac i2c_core edac_mc pcspkr shpchp mlx4_core e1000e ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 4127, comm: scsi_tgt_mgmt Tainted: P 2.6.18-92.1.13.el5 #1 RIP: 0010:[] [] :scst:scst_mgmt_thread+0x3ff/0x577 --------------------------------- From halr at obsidianresearch.com Fri Oct 3 13:07:13 2008 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Fri, 03 Oct 2008 14:07:13 -0600 Subject: [ofa-general] [PATCHv2][TRIVIAL] OpenSM: Display port number in decimal in log messages Message-ID: <48E67B71.8050508@obsidianresearch.com> Sasha, Cosmetic patch attached. Found some more cases... -- Hal -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: patch-osm-portnum2 URL: From rdreier at cisco.com Fri Oct 3 13:16:05 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 03 Oct 2008 13:16:05 -0700 Subject: [ofa-general] RE: [PATCH 13/13] RDMA/nes: Enhanced PFT management scheme In-Reply-To: <5E701717F2B2ED4EA60F87C8AA57B7CC08510D00@venom2> (Chien Tung's message of "Fri, 3 Oct 2008 15:00:35 -0500") References: <200809262008.m8QK8A2B011727@velma.neteffect.com> <5E701717F2B2ED4EA60F87C8AA57B7CC08510D00@venom2> Message-ID: > spin_unlock_irqrestore should be outside of the if(...). > > > > + if (!mc_all_on) { > + ... > + nes_write_indexed(nesdev, > NES_IDX_NIC_UNICAST_ALL, > + > nic_active); > + } > } > + spin_unlock_irqrestore(&nesadapter->resource_lock, flags); > } > > > If you can fix it again that would be great but let me know if you want > a v2 patch. That's what I did in the patch I rolled in. It's working here so I think I got it right. From ctung at neteffect.com Fri Oct 3 13:43:21 2008 From: ctung at neteffect.com (Chien Tung) Date: Fri, 3 Oct 2008 15:43:21 -0500 Subject: [ofa-general] [PATCH] RDMA/nes: Fix slab corruption Message-ID: <200810032043.m93KhL9P013848@velma.neteffect.com> From: Chien Tung RDMA/nes: Fix slab corruption Referencing cm_node after it is freed via rem_ref_cm_node() caused a slab corruption. There is no need to set cm_node->cm_id to NULL in mini_cm_close(). Signed-off-by: Chien Tung -- Roland, Please discard "[PATCH 4/4] RDMA/nes: Clear cm_id only when done with cm_node" and use this patch instead. The intent of the original patch was to patch a slab corruption caused by referencing cm_node->cm_id after cm_node is freed. Adding cm_node->cm_id = NULL; to cases that are not freeing cm_node doesn't make any sense either as cm_id is needed to free cm_node. Needless to say, we are working on more fix/cleanup patches for nes_cm.c drivers/infiniband/hw/nes/nes_cm.c | 1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index d69226d..2caf9da 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -2029,7 +2029,6 @@ static int mini_cm_close(struct nes_cm_core *cm_core, struct nes_cm_node *cm_nod ret = rem_ref_cm_node(cm_core, cm_node); break; } - cm_node->cm_id = NULL; return ret; } From cameron at harr.org Fri Oct 3 15:00:25 2008 From: cameron at harr.org (Cameron Harr) Date: Fri, 03 Oct 2008 16:00:25 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48E67ACC.1020903@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> Message-ID: <48E695F9.80703@harr.org> I was able to get the latest scst code working with Vu's standalone ib_srpt and the kernel IB modules, and dropped my ib_srpt thread count to 2. However, I still get about the same IOP performance on the target although interrupts on the "busy" cpu have gone up to around 140K. Interesting, but now I'm at a bit of a loss as to where the bottleneck could be. I figured it was Interrupts, but if the CPU is handling more right now, perhaps the problem is elsewhere? Cameron Cameron Harr wrote: > Cameron Harr wrote: >> Additionally, I found that I can load the newer scst code if I use >> the kernel-supplied modules and the standalone srpt-1.0.0 package >> that I think you provide Vu. I was about to try it along with >> dropping a module param for ib_srpt (I was using a thread count of 32 >> that had given me better performance on an earlier test). I'll report >> back on this. > > Not much luck using the newer scst code and default kernel modules > (Running CentOS 5.2). If I try using the default kernel modules on the > initiator, I can't get them to see anything (the ofed SM pkg doesn't > see any devices to run on). When using the regular OFED on the > initiator, my target dies when I try to attach to the target on the > initiator: > --------------------------------- > ib_srpt: Host login i_port_id=0x0:0x2c90300026053 > t_port_id=0x2c90300026046:0x2c90300026046 it_iu_len=996 > Oct 3 13:44:23 test05 kernel: i[4127]: scst: > scst_mgmt_thread:5187:***CRITICAL ERROR*** session ffff8107f3222b88 is > in scst_sess_shut_list, but in unknown shut phase 0 > BUG at /usr/src/scst.tot/src/scst_targ.c:5188 > ----------- [cut here ] --------- [please bite here ] --------- > Kernel BUG at /usr/src/scst.tot/src/scst_targ.c:5188 > invalid opcode: 0000 [1] SMP > last sysfs file: /devices/pci0000:00/0000:00:00.0/class > CPU 2 > Modules linked in: ib_srpt(U) ib_cm ib_sa scst_vdisk(U) scst(U) > fio_driver(PU) fio_port(PU) mlx4_ib ib_mad ib_core ipv6 xfrm_nalgo > crypto_api autofs4 hidp rfcomm l2cap bluetooth sunrpc nls_utf8 hfsplus > dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button > battery asus_acpi acpi_memhotplug ac parport_pc lp parport i2c_i801 > i5000_edac i2c_core edac_mc pcspkr shpchp mlx4_core e1000e ata_piix > libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd > Pid: 4127, comm: scsi_tgt_mgmt Tainted: P 2.6.18-92.1.13.el5 #1 > RIP: 0010:[] [] > :scst:scst_mgmt_thread+0x3ff/0x577 > --------------------------------- > From vuhuong at mellanox.com Fri Oct 3 15:57:54 2008 From: vuhuong at mellanox.com (Vu Pham) Date: Fri, 03 Oct 2008 15:57:54 -0700 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48E65FE0.2060602@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> Message-ID: <48E6A372.8000702@mellanox.com> Cameron Harr wrote: > Vu Pham wrote: >> >>> Alternatively, is there anything in the SCST layer I should tweak. I'm >>> still running rev 245 of that code (kinda old, but works with OFED >>> 1.3.1 >>> w/o hacks). >>> >> >> What is the mode (pass thru, blockio...)? > blockio >> What is the scst_threads= parameters? > Default, which I believe is #cpus With blockio I get the best performance + stability with scst_threads=1 >> >>> >>>> >>>> >>>> My target server (with DAS) contains 8 2.8 GHz CPU cores and can >>>> sustain over 200K IOPs locally, but only around 73K IOPs over SRP. >> >> Is this number from one initiator or multiple? > One initiator. At first I thought it might be a limitation of the SRP, > and added a second initiator, but the aggregate performance of the two > was about equal to that of a single initiator. Try again with scst_threads=1. I expect that you can get ~140K with two initiators > >> >>>> Looking at /proc/interrupts, I see that the mlx_core (comp) device >>>> is pushing about 135K Int/s on 1 of 2 CPUs. All CPUs are enabled >>>> for that PCI-E slot, but it only ever uses 2 of the CPUs, and only >>>> 1 at a time. None of the other CPUs has an interrupt rate more than >>>> about 40-50K/s. >>>> >> >> The number of interrupt can be cut down if there are more completions >> to be processed by sw. ie. please test with multiple QPs between one >> initiator vs. your target and multiple initiators vs. your target >> > A couple questions here on my side. How would more QP connections > reduce interrupts? It seems like they'd still need to come through the > same mlx device, causing the same number or more, of interrupts. More > importantly thought, how would one increase the number of QPs between > and initiator and target? I did have my ib_srpt threads up, would that > be comparable? ib_srpt process completions in event callback handler. With more QPs there are more completions pending per interrupt instead of one completion event per interrupt. You can have multiple QPs between initiator vs. target by using different initiator_id_ext ie. echo id_ext=xxx,ioc_guid=yyy,....initiator_ext=1 > /sys/class/infiniband_srp/.../add_target echo id_ext=xxx,ioc_guid=yyy,....initiator_ext=2 > /sys/class/infiniband_srp/.../add_target echo id_ext=xxx,ioc_guid=yyy,....initiator_ext=3 > /sys/class/infiniband_srp/.../add_target ... For example you see /dev/sda, sdb through first connection/qp and you have sdc, sdd through second connection/qp Then you can do I/Os to sda and sdd thru different QPs -vu From vlad at lists.openfabrics.org Sat Oct 4 03:13:35 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 4 Oct 2008 03:13:35 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081004-0200 daily build status Message-ID: <20081004101335.653EBE60C5D@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: Build failed on ppc64 with linux-2.6.24 Log: /home/vlad/tmp/ofa_1_4_kernel-20081004-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c: In function 'ehca_poll_eqs': /home/vlad/tmp/ofa_1_4_kernel-20081004-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:942: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type /home/vlad/tmp/ofa_1_4_kernel-20081004-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:946: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081004-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081004-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081004-0200_linux-2.6.24_ppc64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081004-0200_linux-2.6.24_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.24' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From vlad at lists.openfabrics.org Sun Oct 5 03:25:03 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 5 Oct 2008 03:25:03 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081005-0200 daily build status Message-ID: <20081005102503.8B345E60D74@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: Build failed on ppc64 with linux-2.6.24 Log: /home/vlad/tmp/ofa_1_4_kernel-20081005-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c: In function 'ehca_poll_eqs': /home/vlad/tmp/ofa_1_4_kernel-20081005-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:942: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type /home/vlad/tmp/ofa_1_4_kernel-20081005-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:946: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081005-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081005-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081005-0200_linux-2.6.24_ppc64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081005-0200_linux-2.6.24_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.24' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From amirv at mellanox.co.il Sun Oct 5 06:34:42 2008 From: amirv at mellanox.co.il (Amir Vadai) Date: Sun, 5 Oct 2008 15:34:42 +0200 Subject: [ofa-general] [PATCH] SDP: fix initial recv buffer size In-Reply-To: <> References: <> Message-ID: <1223213682-30014-1-git-send-email-amirv@mellanox.co.il> Set initial recv buffer according to incoming hha header. Fixed bugzilla 1086: SDP Linux and SDP windows don't work togeather Signed-off-by: Amir Vadai --- drivers/infiniband/ulp/sdp/sdp.h | 2 ++ drivers/infiniband/ulp/sdp/sdp_bcopy.c | 15 +++++++++++++++ drivers/infiniband/ulp/sdp/sdp_cma.c | 7 ++----- 3 files changed, 19 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/ulp/sdp/sdp.h b/drivers/infiniband/ulp/sdp/sdp.h index bcf2125..2f09073 100644 --- a/drivers/infiniband/ulp/sdp/sdp.h +++ b/drivers/infiniband/ulp/sdp/sdp.h @@ -196,6 +196,7 @@ struct bzcopy_state { struct page **pages; }; +extern int rcvbuf_initial_size; extern struct proto sdp_proto; extern struct workqueue_struct *sdp_workqueue; @@ -311,6 +312,7 @@ void sdp_add_sock(struct sdp_sock *ssk); void sdp_remove_sock(struct sdp_sock *ssk); void sdp_remove_large_sock(struct sdp_sock *ssk); int sdp_resize_buffers(struct sdp_sock *ssk, u32 new_size); +int sdp_init_buffers(struct sdp_sock *ssk, u32 new_size); void sdp_post_keepalive(struct sdp_sock *ssk); void sdp_start_keepalive_timer(struct sock *sk); void sdp_bzcopy_write_space(struct sdp_sock *ssk); diff --git a/drivers/infiniband/ulp/sdp/sdp_bcopy.c b/drivers/infiniband/ulp/sdp/sdp_bcopy.c index 7553003..20f6a33 100644 --- a/drivers/infiniband/ulp/sdp/sdp_bcopy.c +++ b/drivers/infiniband/ulp/sdp/sdp_bcopy.c @@ -45,6 +45,10 @@ struct sdp_chrecvbuf { static int rcvbuf_scale = 0x10; +int rcvbuf_initial_size = SDP_HEAD_SIZE; +module_param_named(rcvbuf_initial_size, rcvbuf_initial_size, int, 0644); +MODULE_PARM_DESC(rcvbuf_initial_size, "Receive buffer initial size in bytes."); + module_param_named(rcvbuf_scale, rcvbuf_scale, int, 0644); MODULE_PARM_DESC(rcvbuf_scale, "Receive buffer size scale factor."); @@ -578,6 +582,17 @@ void sdp_post_sends(struct sdp_sock *ssk, int nonagle) } } +int sdp_init_buffers(struct sdp_sock *ssk, u32 new_size) +{ + ssk->recv_frags = PAGE_ALIGN(new_size - SDP_HEAD_SIZE) / PAGE_SIZE; + if (ssk->recv_frags > SDP_MAX_SEND_SKB_FRAGS) + ssk->recv_frags = SDP_MAX_SEND_SKB_FRAGS; + + sdp_post_recvs(ssk); + + return 0; +} + int sdp_resize_buffers(struct sdp_sock *ssk, u32 new_size) { u32 curr_size = SDP_HEAD_SIZE + ssk->recv_frags * PAGE_SIZE; diff --git a/drivers/infiniband/ulp/sdp/sdp_cma.c b/drivers/infiniband/ulp/sdp/sdp_cma.c index 6126902..ba46098 100644 --- a/drivers/infiniband/ulp/sdp/sdp_cma.c +++ b/drivers/infiniband/ulp/sdp/sdp_cma.c @@ -176,10 +176,6 @@ int sdp_init_qp(struct sock *sk, struct rdma_cm_id *id) init_waitqueue_head(&sdp_sk(sk)->wq); - sdp_sk(sk)->recv_frags = 0; - sdp_sk(sk)->rcvbuf_scale = 1; - sdp_post_recvs(sdp_sk(sk)); - sdp_dbg(sk, "%s done\n", __func__); return 0; @@ -241,7 +237,7 @@ int sdp_connect_handler(struct sock *sk, struct rdma_cm_id *id, sizeof(struct sdp_bsdh); sdp_sk(child)->send_frags = PAGE_ALIGN(sdp_sk(child)->xmit_size_goal) / PAGE_SIZE; - sdp_resize_buffers(sdp_sk(child), ntohl(h->desremrcvsz)); + sdp_init_buffers(sdp_sk(child), ntohl(h->desremrcvsz)); sdp_dbg(child, "%s bufs %d xmit_size_goal %d send trigger %d\n", __func__, @@ -419,6 +415,7 @@ int sdp_cma_handler(struct rdma_cm_id *id, struct rdma_cm_event *event) hh.bsdh.len = htonl(sizeof(struct sdp_bsdh) + SDP_HH_SIZE); hh.max_adverts = 1; hh.majv_minv = SDP_MAJV_MINV; + sdp_init_buffers(sdp_sk(sk), rcvbuf_initial_size); hh.localrcvsz = hh.desremrcvsz = htonl(sdp_sk(sk)->recv_frags * PAGE_SIZE + SDP_HEAD_SIZE); hh.max_adverts = 0x1; -- 1.5.3 From kliteyn at dev.mellanox.co.il Sun Oct 5 18:26:00 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 06 Oct 2008 03:26:00 +0200 Subject: [ofa-general] [PATCH 0/6] opensm: Unicast Routing Cache Message-ID: <48E96928.8030200@dev.mellanox.co.il> Hi Sasha, The following series of 6 patches implements unicast routing cache in OpenSM. This implementation (v2, previous version was sent before OFED 1.3) was rewritten from scratch: - no caching of existing connectivity - no caching of existing lid matrices - each switch has an LFT buffer that contains the result of the last routing engine execution (instead of one buffer in ucast_mgr) - links/ports/nodes changes are spotted during the discovery - only the links/ports/nodes that went down are cached - when switch goes down, caching its lid matrices and LFT In one of the following cases we can use cached routing - there is no topology change - one or more CAs disappeared - one or more leaf switches disappeared In these cases cached routing is written to the switches as is (unless the switch doesn't exist). If there is any other topology change, existing cache is invalidated and the routing engine(s) run as usual. The patches are: - patch 1/6: move lft_buf from ucast_mgr to osm_switch - patch 2/6: Add "-A" or "--ucast_cache" option to opensm - patch 3/6: adding osm_ucast_cache.{c,h} files (this is the cache implementation itself) - patch 4/6: adding new cache files to makefile - patch 5/6: integrating unicast cache into the discovery and ucast manager - patch 6/6: man entry for cached routing -- Yevgeny From kliteyn at dev.mellanox.co.il Sun Oct 5 18:26:42 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 06 Oct 2008 03:26:42 +0200 Subject: [ofa-general] [PATCH 1/6] opensm/Unicast Routing Cache: move lft_buf from ucast_mgr to osm_switch Message-ID: <48E96952.9080503@dev.mellanox.co.il> Instead of having single lft_buf in ucast_mgr, each switch will hold lft_buf which is the LFT that was calculated by the recent routing engine execution. Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_switch.h | 7 ++++++- opensm/include/opensm/osm_ucast_mgr.h | 6 +----- opensm/opensm/osm_switch.c | 12 +++++++++++- opensm/opensm/osm_ucast_file.c | 5 +++-- opensm/opensm/osm_ucast_ftree.c | 2 +- opensm/opensm/osm_ucast_lash.c | 8 ++++---- opensm/opensm/osm_ucast_mgr.c | 18 +++++------------- 7 files changed, 31 insertions(+), 27 deletions(-) diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h index 1c0f6e9..3d9a72d 100644 --- a/opensm/include/opensm/osm_switch.h +++ b/opensm/include/opensm/osm_switch.h @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -102,6 +102,7 @@ typedef struct osm_switch { uint8_t **hops; osm_port_profile_t *p_prof; osm_fwd_tbl_t fwd_tbl; + uint8_t *lft_buf; osm_mcast_tbl_t mcast_tbl; uint32_t discovery_count; unsigned need_update; @@ -137,6 +138,10 @@ typedef struct osm_switch { * fwd_tbl * This switch's forwarding table. * +* lft_buf +* This switch's linear forwarding table, as was +* calculated by the last routing engine execution. +* * mcast_tbl * Multicast forwarding table for this switch. * diff --git a/opensm/include/opensm/osm_ucast_mgr.h b/opensm/include/opensm/osm_ucast_mgr.h index 12be97a..27e89e9 100644 --- a/opensm/include/opensm/osm_ucast_mgr.h +++ b/opensm/include/opensm/osm_ucast_mgr.h @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -97,7 +97,6 @@ typedef struct osm_ucast_mgr { cl_qlist_t port_order_list; boolean_t is_dor; boolean_t some_hop_count_set; - uint8_t *lft_buf; } osm_ucast_mgr_t; /* * FIELDS @@ -129,9 +128,6 @@ typedef struct osm_ucast_mgr { * tables calculation iteration cycle, set to TRUE to indicate * that some hop count changes were done. * -* lft_buf -* LFT buffer - used during LFT calculation/setup. -* * SEE ALSO * Unicast Manager object *********/ diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c index 77ef61e..9bf76e0 100644 --- a/opensm/opensm/osm_switch.c +++ b/opensm/opensm/osm_switch.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -101,6 +101,13 @@ osm_switch_init(IN osm_switch_t * const p_sw, if (status != IB_SUCCESS) goto Exit; + p_sw->lft_buf = malloc(IB_LID_UCAST_END_HO + 1); + if (!p_sw->lft_buf) { + status = IB_INSUFFICIENT_MEMORY; + goto Exit; + } + memset(p_sw->lft_buf, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); + p_sw->p_prof = malloc(sizeof(*p_sw->p_prof) * num_ports); if (p_sw->p_prof == NULL) { status = IB_INSUFFICIENT_MEMORY; @@ -132,6 +139,8 @@ void osm_switch_delete(IN OUT osm_switch_t ** const pp_sw) osm_mcast_tbl_destroy(&p_sw->mcast_tbl); free(p_sw->p_prof); osm_fwd_tbl_destroy(&p_sw->fwd_tbl); + if (p_sw->lft_buf) + free(p_sw->lft_buf); if (p_sw->hops) { for (i = 0; i < p_sw->num_hops; i++) if (p_sw->hops[i]) @@ -537,6 +546,7 @@ osm_switch_prepare_path_rebuild(IN osm_switch_t * p_sw, IN uint16_t max_lids) osm_port_prof_construct(&p_sw->p_prof[i]); osm_switch_clear_hops(p_sw); + memset(p_sw->lft_buf, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); if (!p_sw->hops) { hops = malloc((max_lids + 1) * sizeof(hops[0])); diff --git a/opensm/opensm/osm_ucast_file.c b/opensm/opensm/osm_ucast_file.c index cbd65c1..a6edf5d 100644 --- a/opensm/opensm/osm_ucast_file.c +++ b/opensm/opensm/osm_ucast_file.c @@ -1,5 +1,6 @@ /* * Copyright (c) 2006,2007 Voltaire, Inc. All rights reserved. + * Copyright (c) 2008 Mellanox Technologies LTD. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -91,7 +92,7 @@ static void add_path(osm_opensm_t * p_osm, new_lid); } - p_osm->sm.ucast_mgr.lft_buf[new_lid] = port_num; + p_sw->lft_buf[new_lid] = port_num; if (!(p_osm->subn.opt.port_profile_switch_nodes && port_guid && osm_get_switch_by_guid(&p_osm->subn, port_guid))) osm_switch_count_path(p_sw, port_num); @@ -195,7 +196,7 @@ static int do_ucast_file_load(void *context) cl_ntoh64(sw_guid)); continue; } - memset(p_osm->sm.ucast_mgr.lft_buf, OSM_NO_PATH, + memset(p_sw->lft_buf, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); } else if (p_sw && !strncmp(p, "0x", 2)) { p += 2; diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c index 15168b7..35a6a1c 100644 --- a/opensm/opensm/osm_ucast_ftree.c +++ b/opensm/opensm/osm_ucast_ftree.c @@ -1945,7 +1945,7 @@ static void __osm_ftree_set_sw_fwd_table(IN cl_map_item_t * const p_map_item, p_sw->p_osm_sw->max_lid_ho = p_ftree->lft_max_lid_ho; - memcpy(p_ftree->p_osm->sm.ucast_mgr.lft_buf, p_sw->lft_buf, lft_len); + memcpy(p_sw->p_osm_sw->lft_buf, p_sw->lft_buf, lft_len); osm_ucast_mgr_set_fwd_table(&p_ftree->p_osm->sm.ucast_mgr, p_sw->p_osm_sw); } diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index ce3982f..560a210 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2007 Simula Research Laboratory. All rights reserved. * Copyright (c) 2007 Silicon Graphics Inc. All rights reserved. @@ -1068,7 +1068,7 @@ static void populate_fwd_tbls(lash_t * p_lash) current_guid = p_sw->p_node->node_info.port_guid; sw = p_sw->priv; - memset(p_osm->sm.ucast_mgr.lft_buf, 0xff, + memset(p_sw->lft_buf, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); for (lid = 1; lid <= max_lid_ho; lid++) { @@ -1080,7 +1080,7 @@ static void populate_fwd_tbls(lash_t * p_lash) if (p_dst_sw == p_sw) { uint8_t egress_port = port->p_node->sw ? 0 : port->p_physp->p_remote_physp->port_num; - p_osm->sm.ucast_mgr.lft_buf[lid] = egress_port; + p_sw->lft_buf[lid] = egress_port; OSM_LOG(p_log, OSM_LOG_VERBOSE, "LASH fwd MY SRC SRC GUID 0x%016" PRIx64 " src lash id (%d), src lid no (%u) src lash port (%d) " @@ -1100,7 +1100,7 @@ static void populate_fwd_tbls(lash_t * p_lash) virtual_physical_port_table [lash_egress_port]; - p_osm->sm.ucast_mgr.lft_buf[lid] = + p_sw->lft_buf[lid] = physical_egress_port; OSM_LOG(p_log, OSM_LOG_VERBOSE, "LASH fwd SRC GUID 0x%016" PRIx64 diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index bde0c29..12a8b58 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -73,10 +73,6 @@ void osm_ucast_mgr_destroy(IN osm_ucast_mgr_t * const p_mgr) CL_ASSERT(p_mgr); OSM_LOG_ENTER(p_mgr->p_log); - - if (p_mgr->lft_buf) - free(p_mgr->lft_buf); - OSM_LOG_EXIT(p_mgr->p_log); } @@ -96,10 +92,6 @@ osm_ucast_mgr_init(IN osm_ucast_mgr_t * const p_mgr, IN osm_sm_t * sm) p_mgr->p_subn = sm->p_subn; p_mgr->p_lock = sm->p_lock; - p_mgr->lft_buf = malloc(IB_LID_UCAST_END_HO + 1); - if (!p_mgr->lft_buf) - return IB_INSUFFICIENT_MEMORY; - OSM_LOG_EXIT(p_mgr->p_log); return (status); } @@ -297,7 +289,7 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr, We have selected the port for this LID. Write it to the forwarding tables. */ - p_mgr->lft_buf[lid_ho] = port; + p_sw->lft_buf[lid_ho] = port; if (!is_ignored_by_port_prof) { struct osm_remote_node *rem_node_used; osm_switch_count_path(p_sw, port); @@ -397,14 +389,14 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr, osm_switch_get_fwd_tbl_block(p_sw, block_id_ho, block); block_id_ho++) { if (!p_sw->need_update && - !memcmp(block, p_mgr->lft_buf + block_id_ho * 64, 64)) + !memcmp(block, p_sw->lft_buf + block_id_ho * 64, 64)) continue; OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, "Writing FT block %u\n", block_id_ho); status = osm_req_set(p_mgr->sm, p_path, - p_mgr->lft_buf + block_id_ho * 64, + p_sw->lft_buf + block_id_ho * 64, sizeof(block), IB_MAD_ATTR_LIN_FWD_TBL, cl_hton32(block_id_ho), @@ -481,7 +473,7 @@ __osm_ucast_mgr_process_tbl(IN cl_map_item_t * const p_map_item, cl_ntoh64(osm_node_get_node_guid(p_sw->p_node))); /* Initialize LIDs in buffer to invalid port number. */ - memset(p_mgr->lft_buf, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); + memset(p_sw->lft_buf, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); if (p_mgr->p_subn->opt.lmc) alloc_ports_priv(p_mgr); -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Sun Oct 5 18:27:00 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 06 Oct 2008 03:27:00 +0200 Subject: [ofa-general] [PATCH 2/6] opensm/Unicast Routing Cache: add -A / --ucast_cache option Message-ID: <48E96964.1040409@dev.mellanox.co.il> Add "-A" or "--ucast_cache" option to opensm. Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_subnet.h | 6 +++++- opensm/opensm/main.c | 20 ++++++++++++++++++-- opensm/opensm/osm_subnet.c | 11 ++++++++++- 3 files changed, 33 insertions(+), 4 deletions(-) diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index 0c7f3b9..1ee6362 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. * @@ -183,6 +183,7 @@ typedef struct osm_subn_opt { boolean_t port_profile_switch_nodes; boolean_t sweep_on_trap; char *routing_engine_names; + boolean_t use_ucast_cache; boolean_t connect_roots; char *lid_matrix_dump_file; char *lfts_file; @@ -361,6 +362,9 @@ typedef struct osm_subn_opt { * up/down routing engine (even if this violates "pure" deadlock * free up/down algorithm) * +* use_ucast_cache +* When TRUE enables unicast routing cache. +* * lid_matrix_dump_file * Name of the lid matrix dump file from where switch * lid matrices (min hops tables) will be loaded diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index 2f53157..3f4f9dd 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -189,6 +189,16 @@ static void show_usage(void) " and in this way be IBA compliant. In many cases,\n" " this can violate \"pure\" deadlock free algorithm, so\n" " use it carefully.\n\n"); + printf("-A\n" + "--ucast_cache\n" + " This option enables unicast routing cache to prevent\n" + " routing recalculation (which is a heavy task in a\n" + " large cluster) when there was no topology change\n" + " detected during the heavy sweep, or when the topology\n" + " change does not require new routing calculation,\n" + " e.g. in case of host reboot.\n" + " This option becomes very handy when the cluster size\n" + " is thousands of nodes.\n\n"); printf("-M\n" "--lid_matrix_file \n" " This option specifies the name of the lid matrix dump file\n" @@ -546,7 +556,7 @@ int main(int argc, char *argv[]) uint32_t val; unsigned config_file_done = 0; const char *const short_option = - "F:c:i:f:ed:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:NBIQvVhoryxp:n:q:k:C:"; + "F:c:i:f:ed:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:ANBIQvVhoryxp:n:q:k:C:"; /* In the array below, the 2nd parameter specifies the number @@ -583,6 +593,7 @@ int main(int argc, char *argv[]) {"priority", 1, NULL, 'p'}, {"smkey", 1, NULL, 'k'}, {"routing_engine", 1, NULL, 'R'}, + {"ucast_cache", 0, NULL, 'A'}, {"connect_roots", 0, NULL, 'z'}, {"lid_matrix_file", 1, NULL, 'M'}, {"lfts_file", 1, NULL, 'U'}, @@ -862,6 +873,11 @@ int main(int argc, char *argv[]) printf(" Connect roots option is on\n"); break; + case 'A': + opt.use_ucast_cache = TRUE; + printf(" Unicast routing cache option is on\n"); + break; + case 'M': opt.lid_matrix_dump_file = optarg; printf(" Lid matrix dump file is \'%s\'\n", optarg); diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index a39ce75..63c111c 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. * @@ -442,6 +442,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) p_opt->port_prof_ignore_file = NULL; p_opt->port_profile_switch_nodes = FALSE; p_opt->sweep_on_trap = TRUE; + p_opt->use_ucast_cache = FALSE; p_opt->routing_engine_names = NULL; p_opt->connect_roots = FALSE; p_opt->lid_matrix_dump_file = NULL; @@ -1269,6 +1270,9 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts) opts_unpack_boolean("connect_roots", p_key, p_val, &p_opts->connect_roots); + opts_unpack_boolean("use_ucast_cache", + p_key, p_val, &p_opts->use_ucast_cache); + opts_unpack_charp("log_file", p_key, p_val, &p_opts->log_file); opts_unpack_uint32("log_max_size", @@ -1534,6 +1538,11 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t *const p_opts) p_opts->connect_roots ? "TRUE" : "FALSE"); fprintf(opts_file, + "# Use unicast routing cache (use FALSE if unsure)\n" + "use_ucast_cache %s\n\n", + p_opts->use_ucast_cache ? "TRUE" : "FALSE"); + + fprintf(opts_file, "# Lid matrix dump file name\n" "lid_matrix_dump_file %s\n\n", p_opts->lid_matrix_dump_file ? p_opts->lid_matrix_dump_file : null_str); -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Sun Oct 5 18:28:06 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 06 Oct 2008 03:28:06 +0200 Subject: [ofa-general] [PATCH 3/6] opensm/Unicast Routing Cache: add osm_ucast_cache.{c, h} files Message-ID: <48E969A6.1000607@dev.mellanox.co.il> Implementation of the osm unicast routing cache. Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_ucast_cache.h | 439 ++++++++++++ opensm/opensm/osm_ucast_cache.c | 1176 +++++++++++++++++++++++++++++++ 2 files changed, 1615 insertions(+), 0 deletions(-) create mode 100644 opensm/include/opensm/osm_ucast_cache.h create mode 100644 opensm/opensm/osm_ucast_cache.c diff --git a/opensm/include/opensm/osm_ucast_cache.h b/opensm/include/opensm/osm_ucast_cache.h new file mode 100644 index 0000000..2dc1c4e --- /dev/null +++ b/opensm/include/opensm/osm_ucast_cache.h @@ -0,0 +1,439 @@ +/* + * Copyright (c) 2008 Mellanox Technologies LTD. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +/* + * Abstract: + * Header file that describes Unicast Cache functions. + * + * Environment: + * Linux User Mode + * + * $Revision: 1.4 $ + */ + +#ifndef _OSM_UCAST_CACHE_H_ +#define _OSM_UCAST_CACHE_H_ + +#include +#include +#include + +#ifdef __cplusplus +# define BEGIN_C_DECLS extern "C" { +# define END_C_DECLS } +#else /* !__cplusplus */ +# define BEGIN_C_DECLS +# define END_C_DECLS +#endif /* __cplusplus */ + +BEGIN_C_DECLS + +struct osm_ucast_mgr; + +/****h* OpenSM/Unicast Manager/Unicast Cache +* NAME +* Unicast Cache +* +* DESCRIPTION +* The Unicast Cache object encapsulates the information +* needed to cache and write unicast routing of the subnet. +* +* The Unicast Cache object is NOT thread safe. +* +* This object should be treated as opaque and should be +* manipulated only through the provided functions. +* +* AUTHOR +* Yevgeny Kliteynik, Mellanox +* +*********/ + +/****s* OpenSM: Unicast Cache/osm_ucast_cache_t +* NAME +* osm_ucast_cache_t +* +* DESCRIPTION +* Unicast Cache structure. +* +* This object should be treated as opaque and should +* be manipulated only through the provided functions. +* +* SYNOPSIS +*/ +typedef struct _osm_ucast_cache { + cl_qmap_t sw_tbl; + boolean_t valid; + struct osm_ucast_mgr * p_ucast_mgr; +} osm_ucast_cache_t; +/* +* FIELDS +* sw_tbl +* Cached switches table. +* +* valid +* TRUE if the cache is valid. +* +* p_ucast_mgr +* Pointer to the Unicast Manager for this subnet. +* +* SEE ALSO +* Unicast Manager object +*********/ + +/****f* OpenSM: Unicast Cache/osm_ucast_cache_construct +* NAME +* osm_ucast_cache_construct +* +* DESCRIPTION +* This function constructs a Unicast Cache object. +* +* SYNOPSIS +*/ +osm_ucast_cache_t * +osm_ucast_cache_construct(struct osm_ucast_mgr * const p_mgr); +/* +* PARAMETERS +* p_mgr +* [in] Pointer to a Unicast Manager object. +* +* RETURN VALUE +* This function return the created Ucast Cache object on success, +* or NULL on any error. +* +* NOTES +* Allows osm_ucast_cache_destroy +* +* Calling osm_ucast_mgr_construct is a prerequisite to +* calling any other method. +* +* SEE ALSO +* Unicast Cache object, osm_ucast_cache_destroy +*********/ + +/****f* OpenSM: Unicast Cache/osm_ucast_cache_destroy +* NAME +* osm_ucast_cache_destroy +* +* DESCRIPTION +* The osm_ucast_cache_destroy function destroys the object, +* releasing all resources. +* +* SYNOPSIS +*/ +void +osm_ucast_cache_destroy(osm_ucast_cache_t * p_cache); +/* +* PARAMETERS +* p_cache +* [in] Pointer to the object to destroy. +* +* RETURN VALUE +* This function does not return any value. +* +* NOTES +* Performs any necessary cleanup of the specified +* Unicast Cache object. +* Further operations should not be attempted on the +* destroyed object. +* This function should only be called after a call to +* osm_ucast_cache_construct. +* +* SEE ALSO +* Unicast Cache object, osm_ucast_cache_construct +*********/ + +/****f* OpenSM: Unicast Cache/osm_ucast_cache_invalidate +* NAME +* osm_ucast_cache_invalidate +* +* DESCRIPTION +* The osm_ucast_cache_invalidate function purges the +* unicast cache and marks the cache as invalid. +* +* SYNOPSIS +*/ +void +osm_ucast_cache_invalidate(osm_ucast_cache_t * p_cache); +/* +* PARAMETERS +* p_cache +* [in] Pointer to the object to invalidate. +* +* RETURN VALUE +* This function does not return any value. +* +* NOTES +* +* SEE ALSO +* Unicast Cache object +*********/ + +/****f* OpenSM: Unicast Cache/osm_ucast_cache_validate +* NAME +* osm_ucast_cache_validate +* +* DESCRIPTION +* The osm_ucast_cache_validate function checks +* whether or not the cached routing can be applied +* to the current subnet switches. +* +* SYNOPSIS +*/ +void +osm_ucast_cache_validate(osm_ucast_cache_t * p_cache); +/* +* PARAMETERS +* p_cache +* [in] Pointer to the object to check. +* +* RETURN VALUE +* This function does not return any value. +* +* NOTES +* This function checks the current subnet and the +* cached links, and decides whether or not there +* is a need to re-run unicast routing engine. +* If the cached routing can't be applied to the +* current subnet switches as is, cache is invalidated. +* +* SEE ALSO +* Unicast Cache object +*********/ + +/****f* OpenSM: Unicast Cache/osm_ucast_cache_mark_valid +* NAME +* osm_ucast_cache_mark_valid +* +* DESCRIPTION +* The osm_ucast_cache_mark_valid function marks +* the cache as valid. +* +* SYNOPSIS +*/ +void +osm_ucast_cache_mark_valid(osm_ucast_cache_t * p_cache); +/* +* PARAMETERS +* p_cache +* [in] Pointer to the object to mark. +* +* RETURN VALUE +* This function does not return any value. +* +* NOTES +* +* SEE ALSO +* Unicast Cache object +*********/ + +/****f* OpenSM: Unicast Cache/osm_ucast_cache_is_valid +* NAME +* osm_ucast_cache_is_valid +* +* DESCRIPTION +* Check whether the unicast cache is valid. +* +* SYNOPSIS +*/ +boolean_t +osm_ucast_cache_is_valid(osm_ucast_cache_t * p_cache); +/* +* PARAMETERS +* p_cache +* [in] Pointer to the object to check. +* +* RETURN VALUE +* TRUE if the cache is valid, FALSE otherwise. +* +* NOTES +* +* SEE ALSO +* Unicast Cache object +*********/ + +/****f* OpenSM: Unicast Cache/osm_ucast_cache_check_new_link +* NAME +* osm_ucast_cache_check_new_link +* +* DESCRIPTION +* The osm_ucast_cache_check_new_link checks whether +* the newly discovered link still allows us to use +* cached unicast routing. +* +* SYNOPSIS +*/ +void +osm_ucast_cache_check_new_link(osm_ucast_cache_t * p_cache, + osm_node_t * p_node_1, + uint8_t port_num_1, + osm_node_t * p_node_2, + uint8_t port_num_2); +/* +* PARAMETERS +* p_cache +* [in] Pointer to the cache object. +* +* p_node_1 +* [in] Pointer to the first node of the link. +* +* port_num_1 +* [in] Port number on the first node of the link. +* +* p_node_2 +* [in] Pointer to the second node of the link. +* +* port_num_2 +* [in] Port number on the second node of the link. +* +* RETURN VALUE +* This function does not return any value. +* +* NOTES +* The function checks whether the link was previously +* cached/dropped or is this a completely new link. +* If it decides that the new link makes cached routing +* invalid, the cache is purged and marked as invalid. +* +* SEE ALSO +* Unicast Cache object +*********/ + +/****f* OpenSM: Unicast Cache/osm_ucast_cache_add_link +* NAME +* osm_ucast_cache_add_link +* +* DESCRIPTION +* The osm_ucast_cache_add_link adds link to the cache. +* +* SYNOPSIS +*/ +void +osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, + osm_node_t * p_node_1, + uint8_t port_num_1, + osm_node_t * p_node_2, + uint8_t port_num_2); +/* +* PARAMETERS +* p_cache +* [in] Pointer to the cache object. +* +* p_node_1 +* [in] Pointer to the first node of the link. +* +* port_num_1 +* [in] Port number on the first node of the link. +* +* p_node_2 +* [in] Pointer to the second node of the link. +* +* port_num_2 +* [in] Port number on the second node of the link. +* +* RETURN VALUE +* This function does not return any value. +* +* NOTES +* Since the cache operates with ports and not links, +* the function adds two port entries (both sides of the +* link) to the cache. +* If it decides that the dropped link makes cached routing +* invalid, the cache is purged and marked as invalid. +* +* SEE ALSO +* Unicast Cache object +*********/ + +/****f* OpenSM: Unicast Cache/osm_ucast_cache_add_node +* NAME +* osm_ucast_cache_add_node +* +* DESCRIPTION +* The osm_ucast_cache_add_node adds node and all +* its links to the cache. +* +* SYNOPSIS +*/ +void +osm_ucast_cache_add_node(osm_ucast_cache_t * p_cache, + osm_node_t * p_node); +/* +* PARAMETERS +* p_cache +* [in] Pointer to the cache object. +* +* p_node +* [in] Pointer to the node object that should be cached. +* +* RETURN VALUE +* This function does not return any value. +* +* NOTES +* If the function decides that the dropped node makes cached +* routing invalid, the cache is purged and marked as invalid. +* +* SEE ALSO +* Unicast Cache object +*********/ + +/****f* OpenSM: Unicast Cache/osm_ucast_cache_apply +* NAME +* osm_ucast_cache_apply +* +* DESCRIPTION +* The osm_ucast_cache_apply function writes the +* cached unicast routing on the subnet switches. +* +* SYNOPSIS +*/ +void +osm_ucast_cache_apply(osm_ucast_cache_t * p_cache); +/* +* PARAMETERS +* p_cache +* [in] Pointer to the cache object to be used. +* +* RETURN VALUE +* This function does not return any value. +* +* NOTES +* Iterates through all the subnet switches and writes +* the LFTs that were calculated during the last routing +* engine execution to the switches. +* +* SEE ALSO +* Unicast Cache object +*********/ + +END_C_DECLS +#endif /* _OSM_UCAST_CACHE_H_ */ diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c new file mode 100644 index 0000000..2c2154a --- /dev/null +++ b/opensm/opensm/osm_ucast_cache.c @@ -0,0 +1,1176 @@ +/* + * Copyright (c) 2008 Mellanox Technologies LTD. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +/* + * Abstract: + * Implementation of OpenSM Cached Unicast Routing + * + * Environment: + * Linux User Mode + * + */ + +#if HAVE_CONFIG_H +# include +#endif + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +typedef struct _cache_port { + boolean_t is_leaf; + uint16_t remote_lid_ho; +} cache_port_t; + +typedef struct _cache_switch { + cl_map_item_t map_item; + boolean_t dropped; + uint16_t max_lid_ho; + uint8_t num_ports; + cache_port_t * ports; + uint16_t num_hops; + uint8_t ** hops; + uint8_t * lft; +} cache_switch_t; + +/********************************************************************** + **********************************************************************/ + +static uint16_t +__cache_sw_get_base_lid_ho(cache_switch_t * p_sw) +{ + return p_sw->ports[0].remote_lid_ho; +} + +/********************************************************************** + **********************************************************************/ + +static boolean_t +__cache_sw_is_leaf(cache_switch_t * p_sw) +{ + return p_sw->ports[0].is_leaf; +} + +/********************************************************************** + **********************************************************************/ + +static void +__cache_sw_set_leaf(cache_switch_t * p_sw) +{ + p_sw->ports[0].is_leaf = TRUE; +} + +/********************************************************************** + **********************************************************************/ + +static cache_switch_t * +__cache_sw_new(uint16_t lid_ho) +{ + cache_switch_t * p_cache_sw = + (cache_switch_t *)malloc(sizeof(cache_switch_t)); + if (!p_cache_sw) + return NULL; + + memset(p_cache_sw, 0, sizeof(cache_switch_t)); + + p_cache_sw->ports = (cache_port_t *)malloc(sizeof(cache_port_t)); + if (!p_cache_sw->ports) { + free(p_cache_sw); + return NULL; + } + + /* port[0] fields represent this switch details - lid and type */ + p_cache_sw->ports[0].remote_lid_ho = lid_ho; + p_cache_sw->ports[0].is_leaf = FALSE; + + return p_cache_sw; +} + +/********************************************************************** + **********************************************************************/ + +static void +__cache_sw_destroy(cache_switch_t * p_sw) +{ + if (!p_sw) + return; + + if (p_sw->lft) + free(p_sw->lft); + if (p_sw->hops) + free(p_sw->hops); + if (p_sw->ports) + free(p_sw->ports); + free(p_sw); +} + +/********************************************************************** + **********************************************************************/ + +static cache_switch_t * +__cache_get_sw(osm_ucast_cache_t * p_cache, uint16_t lid_ho) +{ + cache_switch_t * p_cache_sw = + (cache_switch_t *)cl_qmap_get(&p_cache->sw_tbl, lid_ho); + if (p_cache_sw == (cache_switch_t *)cl_qmap_end(&p_cache->sw_tbl)) + p_cache_sw = NULL; + + return p_cache_sw; +} + +/********************************************************************** + **********************************************************************/ + +static cache_switch_t * +__cache_get_or_add_sw(osm_ucast_cache_t * p_cache, uint16_t lid_ho) +{ + cache_switch_t * p_cache_sw = __cache_get_sw(p_cache, lid_ho); + if (!p_cache_sw) { + p_cache_sw = __cache_sw_new(lid_ho); + if (p_cache_sw) + cl_qmap_insert(&p_cache->sw_tbl, lid_ho, + &p_cache_sw->map_item); + } + return p_cache_sw; +} + +/********************************************************************** + **********************************************************************/ + +static void +__cache_add_port(osm_ucast_cache_t * p_cache, + uint16_t lid_ho, + uint8_t port_num, + uint16_t remote_lid_ho, + boolean_t is_ca) +{ + cache_switch_t * p_cache_sw; + + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); + + if (!lid_ho || !remote_lid_ho || !port_num) + goto Exit; + + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, + "Caching switch port: lid %u [port %u] -> lid %u (%s)\n", + lid_ho, port_num, remote_lid_ho, + (is_ca)? "CA/RTR" : "SW"); + + p_cache_sw = __cache_get_or_add_sw(p_cache, lid_ho); + if (!p_cache_sw) { + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_ERROR, + "ERR AD01: Out of memory - cache is invalid\n"); + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + + if (port_num >= p_cache_sw->num_ports) { + cache_port_t * ports = (cache_port_t *) + malloc(sizeof(cache_port_t)*(port_num+1)); + memset(ports, 0, sizeof(cache_port_t)*(port_num+1)); + + if (p_cache_sw->ports) { + memcpy(ports, p_cache_sw->ports, + sizeof(cache_port_t)*(p_cache_sw->num_ports+1)); + free(p_cache_sw->ports); + } + + p_cache_sw->ports = ports; + p_cache_sw->num_ports = port_num + 1; + } + + if (is_ca) + __cache_sw_set_leaf(p_cache_sw); + + if (p_cache_sw->ports[port_num].remote_lid_ho == 0) { + /* cache this link only if it hasn't been already cached */ + p_cache_sw->ports[port_num].remote_lid_ho = remote_lid_ho; + p_cache_sw->ports[port_num].is_leaf = is_ca; + } +Exit: + OSM_LOG_EXIT(p_cache->p_ucast_mgr->p_log); +} + +/********************************************************************** + **********************************************************************/ + +static void +__cache_cleanup_switches(osm_ucast_cache_t * p_cache) +{ + cache_switch_t * p_sw; + cache_switch_t * p_next_sw; + unsigned port_num; + boolean_t found_port; + + CL_ASSERT(p_cache); + if (!p_cache->valid) + return; + + p_next_sw = (cache_switch_t *) cl_qmap_head(&p_cache->sw_tbl); + while (p_next_sw != (cache_switch_t *) cl_qmap_end(&p_cache->sw_tbl)) { + p_sw = p_next_sw; + p_next_sw = (cache_switch_t *) cl_qmap_next(&p_sw->map_item); + + found_port = FALSE; + for (port_num = 1; port_num < p_sw->num_ports; port_num++) + if (p_sw->ports[port_num].remote_lid_ho) + found_port = TRUE; + + if (!found_port) { + cl_qmap_remove_item(&p_cache->sw_tbl, &p_sw->map_item); + __cache_sw_destroy(p_sw); + } + } +} + +/********************************************************************** + **********************************************************************/ + +static void +__cache_check_link_change(osm_ucast_cache_t * p_cache, + osm_physp_t * p_physp_1, + osm_physp_t * p_physp_2) +{ + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); + CL_ASSERT(p_physp_1 && p_physp_2); + + if (!p_cache->valid) + goto Exit; + + if (!p_physp_1->p_remote_physp && !p_physp_2->p_remote_physp) + /* both ports were down - new link */ + goto Exit; + + /* unicast cache cannot tolerate any link location change */ + + if ((p_physp_1->p_remote_physp && + p_physp_1->p_remote_physp->p_remote_physp) || + (p_physp_2->p_remote_physp && + p_physp_2->p_remote_physp->p_remote_physp)) { + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, + "Link location change discovered - cache is invalid\n"); + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } +Exit: + OSM_LOG_EXIT(p_cache->p_ucast_mgr->p_log); +} + +/********************************************************************** + **********************************************************************/ + +static void +__cache_remove_port(osm_ucast_cache_t * p_cache, + uint16_t lid_ho, + uint8_t port_num, + uint16_t remote_lid_ho, + boolean_t is_ca) +{ + cache_switch_t * p_cache_sw; + + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); + + if (!p_cache->valid) + goto Exit; + + p_cache_sw = __cache_get_sw(p_cache, lid_ho); + if (!p_cache_sw) { + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, + "Found uncached switch/link (lid %u, port %u) - " + "cache is invalid\n", lid_ho, port_num); + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + + if (port_num >= p_cache_sw->num_ports || + !p_cache_sw->ports[port_num].remote_lid_ho) { + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, + "Found uncached switch link (lid %u, port %u) - " + "cache is invalid\n", lid_ho, port_num); + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + + if (p_cache_sw->ports[port_num].remote_lid_ho != remote_lid_ho) { + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, + "Remote lid change on switch lid %u, port %u" + "(was %u, now %u) - cache is invalid\n", + lid_ho, port_num, + p_cache_sw->ports[port_num].remote_lid_ho, + remote_lid_ho); + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + + if ((p_cache_sw->ports[port_num].is_leaf && !is_ca) || + (!p_cache_sw->ports[port_num].is_leaf && is_ca)) { + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, + "Remote node type change on switch lid %u, port %u - " + "cache is invalid\n", + lid_ho, port_num); + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, + "New link from lid %u, port %u to lid %u - " + "found in cache\n", + lid_ho, port_num, remote_lid_ho); + + /* the new link was cached - clean it from the cache */ + + p_cache_sw->ports[port_num].remote_lid_ho = 0; + p_cache_sw->ports[port_num].is_leaf = FALSE; +Exit: + OSM_LOG_EXIT(p_cache->p_ucast_mgr->p_log); +} /* __cache_remove_port() */ + +/********************************************************************** + **********************************************************************/ + +static void +__cache_restore_ucast_info(osm_ucast_cache_t * p_cache, + cache_switch_t * p_cache_sw, + osm_switch_t * p_sw) +{ + if (!p_cache->valid) + return; + + /* when seting unicast info, the cached port + should have all the required info */ + CL_ASSERT(p_cache_sw->max_lid_ho && p_cache_sw->lft && + p_cache_sw->num_hops && p_cache_sw->hops); + + p_sw->max_lid_ho = p_cache_sw->max_lid_ho; + + if (p_sw->lft_buf) + free(p_sw->lft_buf); + p_sw->lft_buf = p_cache_sw->lft; + p_cache_sw->lft = NULL; + + p_sw->num_hops = p_cache_sw->num_hops; + p_cache_sw->num_hops = 0; + if (p_sw->hops) + free(p_sw->hops); + p_sw->hops = p_cache_sw->hops; + p_cache_sw->hops = NULL; +} + +/********************************************************************** + **********************************************************************/ + +static void +__ucast_cache_dump(osm_ucast_cache_t * p_cache) +{ + cache_switch_t * p_sw; + unsigned i; + + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); + + if (!osm_log_is_active(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE)) + goto Exit; + + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, + "Dumping missing nodes/links as logged by unicast cache:\n"); + for (p_sw = (cache_switch_t *) cl_qmap_head(&p_cache->sw_tbl); + p_sw != (cache_switch_t *) cl_qmap_end(&p_cache->sw_tbl); + p_sw = (cache_switch_t *) cl_qmap_next(&p_sw->map_item)) { + + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, + "\t Switch lid %u %s%s\n", + __cache_sw_get_base_lid_ho(p_sw), + (__cache_sw_is_leaf(p_sw))? "[leaf switch] " : "", + (p_sw->dropped)? "[whole switch missing]" : ""); + + for (i = 1; i < p_sw->num_ports; i++) + if (p_sw->ports[i].remote_lid_ho > 0) + OSM_LOG(p_cache->p_ucast_mgr->p_log, + OSM_LOG_VERBOSE, + "\t - port %u -> lid %u %s\n", + i, p_sw->ports[i].remote_lid_ho, + (p_sw->ports[i].is_leaf) ? + "[remote node is leaf]" : ""); + } +Exit: + OSM_LOG_EXIT(p_cache->p_ucast_mgr->p_log); +} + +/********************************************************************** + **********************************************************************/ + +osm_ucast_cache_t * +osm_ucast_cache_construct(osm_ucast_mgr_t * const p_mgr) +{ + osm_ucast_cache_t * p_cache = (osm_ucast_cache_t *) + malloc(sizeof(osm_ucast_cache_t)); + if (!p_cache) + return NULL; + + memset(p_cache, 0, sizeof(osm_ucast_cache_t)); + cl_qmap_init(&p_cache->sw_tbl); + p_cache->p_ucast_mgr = p_mgr; + return p_cache; +} + +/********************************************************************** + **********************************************************************/ + +void +osm_ucast_cache_destroy(osm_ucast_cache_t * p_cache) +{ + if (!p_cache) + return; + osm_ucast_cache_invalidate(p_cache); + free(p_cache); +} + +/********************************************************************** + **********************************************************************/ + +void +osm_ucast_cache_mark_valid(osm_ucast_cache_t * p_cache) +{ + CL_ASSERT(p_cache && p_cache->p_ucast_mgr); + p_cache->valid = TRUE; +} + +/********************************************************************** + **********************************************************************/ + +boolean_t +osm_ucast_cache_is_valid(osm_ucast_cache_t * p_cache) +{ + CL_ASSERT(p_cache && p_cache->p_ucast_mgr); + return p_cache->valid; +} + +/********************************************************************** + **********************************************************************/ + +void +osm_ucast_cache_invalidate(osm_ucast_cache_t * p_cache) +{ + cache_switch_t * p_sw; + cache_switch_t * p_next_sw; + + CL_ASSERT(p_cache); + + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, + "Invalidating unicast cache\n"); + + if (!p_cache->valid) + goto Exit; + + p_cache->valid = FALSE; + + p_next_sw = (cache_switch_t *) cl_qmap_head(&p_cache->sw_tbl); + while (p_next_sw != (cache_switch_t *) cl_qmap_end(&p_cache->sw_tbl)) { + p_sw = p_next_sw; + p_next_sw = (cache_switch_t *) cl_qmap_next(&p_sw->map_item); + __cache_sw_destroy(p_sw); + } + cl_qmap_remove_all(&p_cache->sw_tbl); +Exit: + OSM_LOG_EXIT(p_cache->p_ucast_mgr->p_log); +} + +/********************************************************************** + **********************************************************************/ + +void +osm_ucast_cache_validate(osm_ucast_cache_t * p_cache) +{ + cache_switch_t * p_cache_sw; + cache_switch_t * p_remote_cache_sw; + unsigned port_num; + unsigned max_ports; + uint8_t remote_node_type; + uint16_t lid_ho; + uint16_t remote_lid_ho; + osm_switch_t * p_sw; + osm_switch_t * p_remote_sw; + osm_node_t * p_node; + osm_physp_t * p_physp; + osm_physp_t * p_remote_physp; + osm_port_t * p_remote_port; + cl_qmap_t * p_node_guid_tbl; + + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); + if (!p_cache->valid) + goto Exit; + + /* + * Scan all the physical switch ports in the subnet. + * If the port need_update flag is on, check whether + * it's just some node/port reset or a cached topology + * change. Otherwise the cache is invalid. + */ + p_node_guid_tbl = &p_cache->p_ucast_mgr->p_subn->node_guid_tbl; + for (p_node = (osm_node_t *) cl_qmap_head(p_node_guid_tbl); + p_node != (osm_node_t *) cl_qmap_end(p_node_guid_tbl); + p_node = (osm_node_t *) cl_qmap_next(&p_node->map_item)) { + + if (osm_node_get_type(p_node) != IB_NODE_TYPE_SWITCH) + continue; + + lid_ho = cl_ntoh16(osm_node_get_base_lid(p_node,0)); + p_cache_sw = __cache_get_sw(p_cache, lid_ho); + + p_sw = p_node->sw; + max_ports = osm_node_get_num_physp(p_node); + + /* skip port 0 */ + for (port_num = 1; port_num < max_ports; port_num++) { + + p_physp = osm_node_get_physp_ptr(p_node, port_num); + + if (!p_physp || !p_physp->p_remote_physp || + !osm_physp_link_exists(p_physp, p_physp->p_remote_physp)) + /* no valid link */ + continue; + + /* + * While scanning all the physical ports in the subnet, + * mark corresponding leaf switches in the cache. + */ + if (p_cache_sw && + !p_cache_sw->dropped && + !__cache_sw_is_leaf(p_cache_sw) && + p_physp->p_remote_physp->p_node && + osm_node_get_type( + p_physp->p_remote_physp->p_node) != + IB_NODE_TYPE_SWITCH) + __cache_sw_set_leaf(p_cache_sw); + + if (!p_physp->need_update) + continue; + + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, + "Checking switch lid %u, port %u\n", + lid_ho, port_num); + + p_remote_physp = osm_physp_get_remote(p_physp); + remote_node_type = osm_node_get_type(p_remote_physp->p_node); + + if (remote_node_type == IB_NODE_TYPE_SWITCH) + remote_lid_ho = cl_ntoh16(osm_node_get_base_lid( + p_remote_physp->p_node, 0)); + else + remote_lid_ho = cl_ntoh16(osm_node_get_base_lid( + p_remote_physp->p_node, + osm_physp_get_port_num(p_remote_physp))); + + if (!p_cache_sw || + port_num >= p_cache_sw->num_ports || + !p_cache_sw->ports[port_num].remote_lid_ho) { + /* + * There is some uncached change on the port. + * In general, the reasons might be as follows: + * - switch reset + * - port reset (or port down/up) + * - quick connection location change + * - new link (or new switch) + * + * First two reasons allow cache usage, while + * the last two reasons should invalidate cache. + * + * In case of quick connection location change, + * cache would have been invalidated by + * osm_ucast_cache_check_new_link() function. + * + * In case of new link between two known nodes, + * cache also would have been invalidated by + * osm_ucast_cache_check_new_link() function. + * + * Another reason is cached link between two + * known switches went back. In this case the + * osm_ucast_cache_check_new_link() function would + * clear both sides of the link from the cache + * during the discovery process, so effectively + * this would be equivalent to port reset. + * + * So three possible reasons remain: + * - switch reset + * - port reset (or port down/up) + * - link of a new switch + * + * To validate cache, we need to check only the + * third reason - link of a new node/switch: + * - If this is the local switch that is new, + * then it should have (p_sw->need_update == 2). + * - If the remote node is switch and it's new, + * then it also should have + * (p_sw->need_update == 2). + * - If the remote node is CA/RTR and it's new, + * then its port should have is_new flag on. + */ + if (p_sw->need_update == 2) { + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, + "New switch found (lid %u) - " + "cache is invalid\n", + lid_ho); + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + + if (remote_node_type == IB_NODE_TYPE_SWITCH) { + + p_remote_sw = p_remote_physp->p_node->sw; + if (p_remote_sw->need_update == 2) { + /* this could also be case of + switch coming back with an + additional link that it + didn't have before */ + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, + "New switch/link found (lid %u) - " + "cache is invalid\n", + remote_lid_ho); + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + } + else { + /* + * Remote node is CA/RTR. + * Get p_port of the remote node and + * check its p_port->is_new flag. + */ + p_remote_port = osm_get_port_by_guid( + p_cache->p_ucast_mgr->p_subn, + osm_physp_get_port_guid(p_remote_physp)); + if (p_remote_port->is_new) { + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, + "New CA/RTR found (lid %u) - " + "cache is invalid\n", + remote_lid_ho); + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + } + } + else { + /* + * The change on the port is cached. + * In general, the reasons might be as follows: + * - link between two known nodes went back + * - one or more nodes went back, causing all + * the links to reappear + * + * If it was link that went back, then this case + * would have been taken care of during the + * discovery by osm_ucast_cache_check_new_link(), + * so it's some node that went back. + */ + if ((p_cache_sw->ports[port_num].is_leaf && + remote_node_type == IB_NODE_TYPE_SWITCH) || + (!p_cache_sw->ports[port_num].is_leaf && + remote_node_type != IB_NODE_TYPE_SWITCH)) { + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, + "Remote node type change on switch lid %u, port %u - " + "cache is invalid\n", + lid_ho, port_num); + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + + if (p_cache_sw->ports[port_num].remote_lid_ho != + remote_lid_ho) { + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, + "Remote lid change on switch lid %u, port %u" + "(was %u, now %u) - cache is invalid\n", + lid_ho, port_num, + p_cache_sw->ports[port_num].remote_lid_ho, + remote_lid_ho); + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + + /* + * We don't care who is the node that has + * reappeared in the subnet (local or remote). + * What's important that the cached link matches + * the real fabrics link. + * Just clean it from cache. + */ + + p_cache_sw->ports[port_num].remote_lid_ho = 0; + p_cache_sw->ports[port_num].is_leaf = FALSE; + if (p_cache_sw->dropped) { + __cache_restore_ucast_info( + p_cache, p_cache_sw, p_sw); + p_cache_sw->dropped = FALSE; + } + + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, + "Restored link from cache: lid %u, port %u to lid %u\n", + lid_ho, port_num, remote_lid_ho); + } + } + } + + /* Remove all the cached switches that + have all their ports restored */ + __cache_cleanup_switches(p_cache); + + /* + * Done scanning all the physical switch ports in the subnet. + * Now we need to check the other side: + * Scan all the cached switches and their ports: + * - If the cached switch is missing in the subnet + * (dropped flag is on), check that it's a leaf switch. + * If it's not a leaf, the cache is invalid, because + * cache can tolerate only leaf switch removal. + * - If the cached switch exists in fabric, check all + * its cached ports. These cached ports represent + * missing link in the fabric. + * The missing links that can be tolerated are: + * + link to missing CA/RTR + * + link to missing leaf switch + */ + for (p_cache_sw = (cache_switch_t *) cl_qmap_head(&p_cache->sw_tbl); + p_cache_sw != (cache_switch_t *) cl_qmap_end(&p_cache->sw_tbl); + p_cache_sw = (cache_switch_t *) cl_qmap_next(&p_cache_sw->map_item)) { + + if (p_cache_sw->dropped) { + if (!__cache_sw_is_leaf(p_cache_sw)){ + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, + "Missing non-leaf switch (lid %u) - " + "cache is invalid\n", + __cache_sw_get_base_lid_ho(p_cache_sw)); + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, + "Missing leaf switch (lid %u) - " + "continuing validation\n", + __cache_sw_get_base_lid_ho(p_cache_sw)); + continue; + } + + for (port_num = 1; port_num < p_cache_sw->num_ports; port_num++) { + if (!p_cache_sw->ports[port_num].remote_lid_ho) + continue; + + if (p_cache_sw->ports[port_num].is_leaf){ + CL_ASSERT(__cache_sw_is_leaf(p_cache_sw)); + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, + "Switch lid %u, port %u: missing link to CA/RTR - " + "continuing validation\n", + __cache_sw_get_base_lid_ho(p_cache_sw), port_num); + continue; + } + + p_remote_cache_sw = __cache_get_sw(p_cache, + p_cache_sw->ports[port_num].remote_lid_ho); + + if (!p_remote_cache_sw || !p_remote_cache_sw->dropped) { + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, + "Switch lid %u, port %u: missing link to existing switch - " + "cache is invalid\n", + __cache_sw_get_base_lid_ho(p_cache_sw), port_num); + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + + if (!__cache_sw_is_leaf(p_remote_cache_sw)) { + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, + "Switch lid %u, port %u: missing link to non-leaf switch - " + "cache is invalid\n", + __cache_sw_get_base_lid_ho(p_cache_sw), port_num); + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + + /* + * At this point we know that the missing link is to + * a leaf switch. However, one case deserves a special + * treatment. If there was a link between two leaf + * switches, then missing leaf switch might break + * routing. It is possible that there are routes + * that use leaf switches to get from switch to switch + * and not just to get to the CAs behind the leaf switch. + */ + if (__cache_sw_is_leaf(p_cache_sw) && + __cache_sw_is_leaf(p_remote_cache_sw)) { + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, + "Switch lid %u, port %u: missing leaf-2-leaf link - " + "cache is invalid\n", + __cache_sw_get_base_lid_ho(p_cache_sw), port_num); + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, + "Switch lid %u, port %u: missing remote leaf switch - " + "continuing validation\n", + __cache_sw_get_base_lid_ho(p_cache_sw), port_num); + } + } + + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, + "Unicast cache is valid\n"); + __ucast_cache_dump(p_cache); +Exit: + OSM_LOG_EXIT(p_cache->p_ucast_mgr->p_log); +} /* osm_ucast_cache_validate() */ + +/********************************************************************** + **********************************************************************/ + +void +osm_ucast_cache_check_new_link(osm_ucast_cache_t * p_cache, + osm_node_t * p_node_1, + uint8_t port_num_1, + osm_node_t * p_node_2, + uint8_t port_num_2) +{ + uint16_t lid_ho_1; + uint16_t lid_ho_2; + + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); + + if (!p_cache->valid) + goto Exit; + + __cache_check_link_change(p_cache, + osm_node_get_physp_ptr(p_node_1, port_num_1), + osm_node_get_physp_ptr(p_node_2, port_num_2)); + + if (!p_cache->valid) + goto Exit; + + if (osm_node_get_type(p_node_1) != IB_NODE_TYPE_SWITCH && + osm_node_get_type(p_node_2) != IB_NODE_TYPE_SWITCH) { + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, + "Found CA/RTR-2-CA/RTR link - cache is invalid\n"); + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + + /* for code simplicity, we want the first node to be switch */ + if (osm_node_get_type(p_node_1) != IB_NODE_TYPE_SWITCH) { + osm_node_t * tmp_node = p_node_1; + uint8_t tmp_port_num = port_num_1; + p_node_1 = p_node_2; + port_num_1 = port_num_2; + p_node_2 = tmp_node; + port_num_2 = tmp_port_num; + } + + lid_ho_1 = cl_ntoh16(osm_node_get_base_lid(p_node_1, 0)); + + if (osm_node_get_type(p_node_2) == IB_NODE_TYPE_SWITCH) + lid_ho_2 = cl_ntoh16( + osm_node_get_base_lid(p_node_2, 0)); + else + lid_ho_2 = cl_ntoh16( + osm_node_get_base_lid(p_node_2, port_num_2)); + + if (!lid_ho_1 || !lid_ho_2) { + /* + * No lid assigned, which means that one of the nodes is new. + * Need to wait for lid manager to process this node. + * The switches and their links will be checked later when + * the whole cache validity will be verified. + */ + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, + "Link port %u <-> %u reveals new node - cache will " + "be validated later\n", + port_num_1, port_num_2); + goto Exit; + } + + __cache_remove_port(p_cache, lid_ho_1, port_num_1, lid_ho_2, + (osm_node_get_type(p_node_2) != IB_NODE_TYPE_SWITCH)); + + /* if node_2 is a switch, the link should be cleaned from its cache */ + + if (osm_node_get_type(p_node_2) == IB_NODE_TYPE_SWITCH) + __cache_remove_port(p_cache, lid_ho_2, + port_num_2, lid_ho_1, FALSE); + +Exit: + OSM_LOG_EXIT(p_cache->p_ucast_mgr->p_log); +} /* osm_ucast_cache_check_new_link() */ + +/********************************************************************** + **********************************************************************/ + +void +osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, + osm_node_t * p_node_1, + uint8_t port_num_1, + osm_node_t * p_node_2, + uint8_t port_num_2) +{ + uint16_t lid_ho_1; + uint16_t lid_ho_2; + + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); + + if (!p_cache->valid) + goto Exit; + + if (osm_node_get_type(p_node_1) != IB_NODE_TYPE_SWITCH && + osm_node_get_type(p_node_2) != IB_NODE_TYPE_SWITCH) { + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, + "Dropping CA-2-CA link - cache invalid\n"); + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + + if (((osm_node_get_type(p_node_1) == IB_NODE_TYPE_SWITCH) && + (!osm_node_get_physp_ptr(p_node_1, 0) || + !osm_physp_is_valid(osm_node_get_physp_ptr(p_node_1, 0)))) || + ((osm_node_get_type(p_node_2) == IB_NODE_TYPE_SWITCH) && + (!osm_node_get_physp_ptr(p_node_2, 0) || + !osm_physp_is_valid(osm_node_get_physp_ptr(p_node_2, 0))))) { + /* we're caching a link when one of the nodes + has already been dropped and cached */ + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, + "Port %u <-> port %u: port0 on one of the nodes" + "has already been dropped and cached\n", + port_num_1, port_num_2); + goto Exit; + } + + /* One of the nodes is switch. Just for code + simplicity, make sure that it's the first node. */ + + if (osm_node_get_type(p_node_1) != IB_NODE_TYPE_SWITCH) { + osm_node_t * tmp_node = p_node_1; + uint8_t tmp_port_num = port_num_1; + p_node_1 = p_node_2; + port_num_1 = port_num_2; + p_node_2 = tmp_node; + port_num_2 = tmp_port_num; + } + + if (!p_node_1->sw) { + /* something is wrong - we'd better not use cache */ + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + + lid_ho_1 = cl_ntoh16(osm_node_get_base_lid(p_node_1,0)); + + if (osm_node_get_type(p_node_2) == IB_NODE_TYPE_SWITCH) { + + if (!p_node_2->sw) { + /* something is wrong - we'd better not use cache */ + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + + lid_ho_2 = cl_ntoh16(osm_node_get_base_lid(p_node_2,0)); + + /* lost switch-2-switch link - cache both sides */ + __cache_add_port(p_cache, lid_ho_1, port_num_1, + lid_ho_2, FALSE); + __cache_add_port(p_cache, lid_ho_2, port_num_2, + lid_ho_1, FALSE); + } + else { + lid_ho_2 = cl_ntoh16( + osm_node_get_base_lid(p_node_2, port_num_2)); + + /* lost link to CA/RTR - cache only switch side */ + __cache_add_port(p_cache, lid_ho_1, port_num_1, + lid_ho_2, TRUE); + } + +Exit: + OSM_LOG_EXIT(p_cache->p_ucast_mgr->p_log); +} /* osm_ucast_cache_add_link() */ + +/********************************************************************** + **********************************************************************/ + +void +osm_ucast_cache_add_node(osm_ucast_cache_t * p_cache, + osm_node_t * p_node) +{ + uint16_t lid_ho; + uint8_t max_ports; + uint8_t port_num; + osm_physp_t * p_physp; + osm_node_t * p_remote_node; + cache_switch_t * p_cache_sw; + + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); + + if (!p_cache->valid) + goto Exit; + + if (osm_node_get_type(p_node) == IB_NODE_TYPE_SWITCH) { + + lid_ho = cl_ntoh16(osm_node_get_base_lid(p_node,0)); + + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, + "Caching dropped switch lid %u\n", lid_ho); + + if (!p_node->sw) { + /* something is wrong - forget about cache */ + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_ERROR, + "ERR AD02: no switch info for node lid %u -" + " clearing cache\n", lid_ho); + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + + /* unlink (add to cache) all the ports of this switch */ + max_ports = osm_node_get_num_physp(p_node); + for (port_num = 1; port_num < max_ports; port_num++) { + + p_physp = osm_node_get_physp_ptr(p_node, port_num); + if (!p_physp || !p_physp->p_node || + !p_physp->p_remote_physp || + !p_physp->p_remote_physp->p_node) + continue; + + osm_ucast_cache_add_link(p_cache, p_node, port_num, + p_physp->p_remote_physp->p_node, + p_physp->p_remote_physp->port_num); + } + + /* + * All the ports have been dropped (cached). + * If one of the ports was connected to CA/RTR, + * then the cached switch would be marked as leaf. + * If it isn't, then the dropped switch isn't a leaf, + * and cache can't handle it. + */ + + p_cache_sw = __cache_get_sw(p_cache, lid_ho); + CL_ASSERT(p_cache_sw); + + if (!__cache_sw_is_leaf(p_cache_sw)) { + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, + "Dropped non-leaf switch (lid %u) - " + "cache is invalid\n", lid_ho); + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + + p_cache_sw->dropped = TRUE; + + if (!p_node->sw->num_hops || !p_node->sw->hops) { + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, + "No LID matrices for switch lid %u - " + "cache is invalid\n", lid_ho); + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + + /* lid matrices */ + + p_cache_sw->num_hops = p_node->sw->num_hops; + p_node->sw->num_hops = 0; + p_cache_sw->hops = p_node->sw->hops; + p_node->sw->hops = NULL; + + /* linear forwarding table */ + + p_cache_sw->lft = p_node->sw->lft_buf; + p_node->sw->lft_buf = NULL; + p_cache_sw->max_lid_ho = p_node->sw->max_lid_ho; + } + else { + /* dropping CA/RTR: add to cache all the ports of this switch */ + max_ports = osm_node_get_num_physp(p_node); + for (port_num = 0; port_num < max_ports; port_num++) { + + p_physp = osm_node_get_physp_ptr(p_node, port_num); + if (!p_physp || !p_physp->p_node || + !p_physp->p_remote_physp || + !p_physp->p_remote_physp->p_node) + continue; + + p_remote_node = p_physp->p_remote_physp->p_node; + if (osm_node_get_type(p_remote_node) != + IB_NODE_TYPE_SWITCH) { + /* CA/RTR to CA/RTR connection */ + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, + "Dropping CA/RTR to CA/RTR connection - " + "cache is invalid\n"); + osm_ucast_cache_invalidate(p_cache); + goto Exit; + } + + osm_ucast_cache_add_link(p_cache, p_remote_node, + p_physp->p_remote_physp->port_num, + p_node, port_num); + } + } +Exit: + OSM_LOG_EXIT(p_cache->p_ucast_mgr->p_log); +} /* osm_ucast_cache_add_node() */ + +/********************************************************************** + **********************************************************************/ + +void +osm_ucast_cache_apply(osm_ucast_cache_t * p_cache) +{ + osm_subn_t * p_subn = p_cache->p_ucast_mgr->p_subn; + osm_switch_t *p_sw; + + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, + "Applying unicast cache\n"); + CL_ASSERT(p_cache && p_cache->p_ucast_mgr && p_cache->valid); + + for (p_sw = (osm_switch_t *) cl_qmap_head(&p_subn->sw_guid_tbl); + p_sw != (osm_switch_t *) cl_qmap_end(&p_subn->sw_guid_tbl); + p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item)) + osm_ucast_mgr_set_fwd_table(p_cache->p_ucast_mgr, p_sw); + + OSM_LOG_EXIT(p_cache->p_ucast_mgr->p_log); +} + +/********************************************************************** + **********************************************************************/ -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Sun Oct 5 18:29:12 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 06 Oct 2008 03:29:12 +0200 Subject: [ofa-general] [PATCH 4/6] opensm/Unicast Routing Cache: compile cache files Message-ID: <48E969E8.7020108@dev.mellanox.co.il> Adding cache files to makefile. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/Makefile.am | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/Makefile.am b/opensm/opensm/Makefile.am index e95a482..02c963b 100644 --- a/opensm/opensm/Makefile.am +++ b/opensm/opensm/Makefile.am @@ -54,7 +54,7 @@ opensm_SOURCES = main.c osm_console_io.c osm_console.c osm_db_files.c \ osm_ucast_lash.c osm_ucast_file.c osm_ucast_ftree.c \ osm_vl15intf.c osm_vl_arb_rcv.c \ st.c osm_perfmgr.c osm_perfmgr_db.c \ - osm_event_plugin.c osm_dump.c \ + osm_event_plugin.c osm_dump.c osm_ucast_cache.c \ osm_qos_parser_y.y osm_qos_parser_l.l osm_qos_policy.c AM_YFLAGS:= -d @@ -115,6 +115,7 @@ opensminclude_HEADERS = \ $(srcdir)/../include/opensm/osm_subnet.h \ $(srcdir)/../include/opensm/osm_switch.h \ $(srcdir)/../include/opensm/osm_ucast_mgr.h \ + $(srcdir)/../include/opensm/osm_ucast_cache.h \ $(srcdir)/../include/opensm/osm_vl15intf.h \ $(top_builddir)/include/opensm/osm_version.h -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Sun Oct 5 18:29:34 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 06 Oct 2008 03:29:34 +0200 Subject: [ofa-general] [PATCH 5/6] opensm/Unicast Routing Cache: integrate cache into opensm Message-ID: <48E969FE.7050507@dev.mellanox.co.il> Integrating unicast cache into the discovery and ucast manager. Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_ucast_mgr.h | 6 ++++ opensm/opensm/osm_drop_mgr.c | 13 ++++++++- opensm/opensm/osm_node_info_rcv.c | 9 +++++- opensm/opensm/osm_port_info_rcv.c | 9 +++++- opensm/opensm/osm_state_mgr.c | 10 ++++++- opensm/opensm/osm_ucast_mgr.c | 50 ++++++++++++++++++++++---------- 6 files changed, 77 insertions(+), 20 deletions(-) diff --git a/opensm/include/opensm/osm_ucast_mgr.h b/opensm/include/opensm/osm_ucast_mgr.h index 27e89e9..e4006bb 100644 --- a/opensm/include/opensm/osm_ucast_mgr.h +++ b/opensm/include/opensm/osm_ucast_mgr.h @@ -49,6 +49,7 @@ #include #include #include +#include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { @@ -77,6 +78,7 @@ BEGIN_C_DECLS * *********/ struct osm_sm; +struct _osm_ucast_cache; /****s* OpenSM: Unicast Manager/osm_ucast_mgr_t * NAME * osm_ucast_mgr_t @@ -97,6 +99,7 @@ typedef struct osm_ucast_mgr { cl_qlist_t port_order_list; boolean_t is_dor; boolean_t some_hop_count_set; + struct _osm_ucast_cache *p_cache; } osm_ucast_mgr_t; /* * FIELDS @@ -128,6 +131,9 @@ typedef struct osm_ucast_mgr { * tables calculation iteration cycle, set to TRUE to indicate * that some hop count changes were done. * +* p_cache +* Pointer to the Unicast Cache object. +* * SEE ALSO * Unicast Manager object *********/ diff --git a/opensm/opensm/osm_drop_mgr.c b/opensm/opensm/osm_drop_mgr.c index e827c26..e8d3454 100644 --- a/opensm/opensm/osm_drop_mgr.c +++ b/opensm/opensm/osm_drop_mgr.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. * @@ -61,6 +61,7 @@ #include #include #include +#include /********************************************************************** **********************************************************************/ @@ -134,6 +135,13 @@ static void drop_mgr_clean_physp(osm_sm_t * sm, IN osm_physp_t * p_physp) (p_remote_physp->p_node)), p_remote_physp->port_num); + if (sm->p_subn->opt.use_ucast_cache) + osm_ucast_cache_add_link(sm->ucast_mgr.p_cache, + p_physp->p_node, + p_physp->port_num, + p_remote_physp->p_node, + p_remote_physp->port_num); + osm_physp_unlink(p_physp, p_remote_physp); } @@ -307,6 +315,9 @@ __osm_drop_mgr_process_node(osm_sm_t * sm, IN osm_node_t * p_node) "Unreachable node 0x%016" PRIx64 "\n", cl_ntoh64(osm_node_get_node_guid(p_node))); + if (sm->p_subn->opt.use_ucast_cache) + osm_ucast_cache_add_node(sm->ucast_mgr.p_cache, p_node); + /* Delete all the logical and physical port objects associated with this node. diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c index 86710d1..27b842f 100644 --- a/opensm/opensm/osm_node_info_rcv.c +++ b/opensm/opensm/osm_node_info_rcv.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -59,6 +59,7 @@ #include #include #include +#include static void report_duplicated_guid(IN osm_sm_t * sm, @@ -240,6 +241,12 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm, cl_ntoh64(osm_node_get_node_guid(p_node)), port_num, cl_ntoh64(p_ni_context->node_guid), p_ni_context->port_num); + if (sm->p_subn->opt.use_ucast_cache) + osm_ucast_cache_check_new_link(sm->ucast_mgr.p_cache, + p_node, port_num, + p_neighbor_node, + p_ni_context->port_num); + osm_node_link(p_node, port_num, p_neighbor_node, p_ni_context->port_num); diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c index a820069..cac3c05 100644 --- a/opensm/opensm/osm_port_info_rcv.c +++ b/opensm/opensm/osm_port_info_rcv.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -60,6 +60,7 @@ #include #include #include +#include /********************************************************************** **********************************************************************/ @@ -244,6 +245,12 @@ __osm_pi_rcv_process_switch_port(IN osm_sm_t * sm, (p_remote_node)), remote_port_num); + if (sm->p_subn->opt.use_ucast_cache) + osm_ucast_cache_add_link(sm->ucast_mgr.p_cache, + p_node, port_num, + p_remote_node, + remote_port_num); + osm_node_unlink(p_node, (uint8_t) port_num, p_remote_node, (uint8_t) remote_port_num); diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index b4eb87b..88d8bf9 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -1075,6 +1075,10 @@ static void do_sweep(osm_sm_t * sm) /* Re-program the switches fully */ sm->p_subn->ignore_existing_lfts = TRUE; + /* we want to re-route, so cache should be invalidated */ + if (sm->p_subn->opt.use_ucast_cache) + osm_ucast_cache_invalidate(sm->ucast_mgr.p_cache); + osm_ucast_mgr_process(&sm->ucast_mgr); /* Reset flag */ @@ -1229,6 +1233,10 @@ _repeat_discovery: /* * Proceed with unicast forwarding table configuration. */ + + if (sm->p_subn->opt.use_ucast_cache) + osm_ucast_cache_validate(sm->ucast_mgr.p_cache); + osm_ucast_mgr_process(&sm->ucast_mgr); if (wait_for_pending_transactions(&sm->p_subn->p_osm->stats)) return; diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index 12a8b58..97b4fb9 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -73,6 +73,8 @@ void osm_ucast_mgr_destroy(IN osm_ucast_mgr_t * const p_mgr) CL_ASSERT(p_mgr); OSM_LOG_ENTER(p_mgr->p_log); + if (p_mgr->p_cache) + osm_ucast_cache_destroy(p_mgr->p_cache); OSM_LOG_EXIT(p_mgr->p_log); } @@ -92,6 +94,12 @@ osm_ucast_mgr_init(IN osm_ucast_mgr_t * const p_mgr, IN osm_sm_t * sm) p_mgr->p_subn = sm->p_subn; p_mgr->p_lock = sm->p_lock; + if (sm->p_subn->opt.use_ucast_cache){ + p_mgr->p_cache = osm_ucast_cache_construct(p_mgr); + if (!p_mgr->p_cache) + status = IB_INSUFFICIENT_MEMORY; + } + OSM_LOG_EXIT(p_mgr->p_log); return (status); } @@ -818,27 +826,37 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * const p_mgr) /* If there are no switches in the subnet, we are done. */ - if (cl_qmap_count(p_sw_guid_tbl) == 0 || - ucast_mgr_setup_all_switches(p_mgr->p_subn) < 0) + if (cl_qmap_count(p_sw_guid_tbl) == 0) goto Exit; p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_NONE; - while (p_routing_eng) { - if (!ucast_mgr_route(p_routing_eng, p_osm)) - break; - p_routing_eng = p_routing_eng->next; - } + if (p_mgr->p_subn->opt.use_ucast_cache && + osm_ucast_cache_is_valid(p_mgr->p_cache)) { + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "Configuring switch tables using cached routing\n"); + osm_ucast_cache_apply(p_mgr->p_cache); - if (p_osm->routing_engine_used == OSM_ROUTING_ENGINE_TYPE_NONE) { - /* If configured routing algorithm failed, use default MinHop */ - osm_ucast_mgr_build_lid_matrices(p_mgr); - ucast_mgr_build_lfts(p_mgr); - p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_MINHOP; - } + } else { + + if (ucast_mgr_setup_all_switches(p_mgr->p_subn) < 0) + goto Exit; - OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, - "%s tables configured on all switches\n", - osm_routing_engine_type_str(p_osm->routing_engine_used)); + while (p_routing_eng) { + if (!ucast_mgr_route(p_routing_eng, p_osm)) + break; + p_routing_eng = p_routing_eng->next; + } + + if (p_osm->routing_engine_used == OSM_ROUTING_ENGINE_TYPE_NONE) { + /* If configured routing algorithm failed, use default MinHop */ + osm_ucast_mgr_build_lid_matrices(p_mgr); + ucast_mgr_build_lfts(p_mgr); + p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_MINHOP; + } + + if (p_mgr->p_subn->opt.use_ucast_cache) + osm_ucast_cache_mark_valid(p_mgr->p_cache); + } Exit: CL_PLOCK_RELEASE(p_mgr->p_lock); -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Sun Oct 5 18:53:52 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 06 Oct 2008 03:53:52 +0200 Subject: [ofa-general] [PATCH 6/6] opensm/Unicast Routing Cache: manpage entry Message-ID: <48E96FB0.30206@dev.mellanox.co.il> Signed-off-by: Yevgeny Kliteynik --- opensm/man/opensm.8.in | 14 +++++++++++++- 1 files changed, 13 insertions(+), 1 deletions(-) diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in index c1ea584..efadf8e 100644 --- a/opensm/man/opensm.8.in +++ b/opensm/man/opensm.8.in @@ -10,7 +10,7 @@ opensm \- InfiniBand subnet manager and administration (SM/SA) [\-g(uid) ] [\-l(mc) ] [\-p(riority) ] [\-smkey ] [\-r(eassign_lids)] [\-R | \-\-routing_engine ] -[\-z | \-\-connect_roots] +[\-A | \-\-ucast_cache] [\-z | \-\-connect_roots] [\-M | \-\-lid_matrix_file ] [\-U | \-\-lfts_file ] [\-S | \-\-sadb_file ] [\-a | \-\-root_guid_file ] @@ -122,6 +122,18 @@ separated by commas so that specific ordering of routing algorithms will be tried if earlier routing engines fail. Supported engines: minhop, updn, file, ftree, lash, dor .TP +\fB\-A\fR, \fB\-\-ucast_cache\fR +This option enables unicast routing cache and prevents routing +recalculation (which is a heavy task in a large cluster) when +there was no topology change detected during the heavy sweep, or +when the topology change does not require new routing calculation, +e.g. when one or more CAs/RTRs/leaf switches going down, or one or +more of these nodes coming back after being down. +A very common case that is handled by the unicast routing cache +is host reboot, which otherwise would cause two full routing +recalculations: one when the host goes down, and the other when +the host comes back online. +.TP \fB\-z\fR, \fB\-\-connect_roots\fR This option enforces a routing engine (currently up/down only) to make connectivity between root switches and in -- 1.5.1.4 From ogerlitz at voltaire.com Mon Oct 6 02:08:45 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 06 Oct 2008 11:08:45 +0200 Subject: [ofa-general] rdma_resolve_route() returning -EINVAL In-Reply-To: References: Message-ID: <48E9D59D.30104@voltaire.com> Roland Dreier wrote: > The issue is more of spec compliance than a likely real-life scenario... and as for why no one else is worrying about it, I think it's because the only other user of rdma_connect() in the tree is iSER, and I guess no one worried too much there. SRP uses the IB CM directly, and waits for timewait exit before calling a connection closed. Roland, I guess there's some tradeoff here between the time connection recovery would take when the ULPs does wait for the timewait event vs the risk of getting into the target considering the REQ as stale and rejecting. Does SRP just wait for the event, or the new connection is established in parallel, which also means it would use a different QP number always. Or. From vst at vlnb.net Mon Oct 6 03:20:49 2008 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Mon, 06 Oct 2008 14:20:49 +0400 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48E695F9.80703@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> Message-ID: <48E9E681.8090600@vlnb.net> Cameron Harr wrote: > I was able to get the latest scst code working with Vu's standalone > ib_srpt and the kernel IB modules, and dropped my ib_srpt thread count > to 2. However, I still get about the same IOP performance on the target > although interrupts on the "busy" cpu have gone up to around 140K. > Interesting, but now I'm at a bit of a loss as to where the bottleneck > could be. I figured it was Interrupts, but if the CPU is handling more > right now, perhaps the problem is elsewhere? How many context switches per second do you have during your test on the target? Once in scst-devel mailing list there was a thread about observation that SRP target driver produces 10 context switches per command. See http://sourceforge.net/mailarchive/message.php?msg_id=e2e108260802070110q1fa084a1j54945d06c16c94f2%40mail.gmail.com If it is so in your case as well, it would very well explain your issue. 10 CS/cmd is a definite overkill, it should be 1 or, at max, 2 CS/cmd. BTW, I suppose you don't use the debug SCST build, do you? Vlad > Cameron > > Cameron Harr wrote: >> Cameron Harr wrote: >>> Additionally, I found that I can load the newer scst code if I use >>> the kernel-supplied modules and the standalone srpt-1.0.0 package >>> that I think you provide Vu. I was about to try it along with >>> dropping a module param for ib_srpt (I was using a thread count of 32 >>> that had given me better performance on an earlier test). I'll report >>> back on this. >> Not much luck using the newer scst code and default kernel modules >> (Running CentOS 5.2). If I try using the default kernel modules on the >> initiator, I can't get them to see anything (the ofed SM pkg doesn't >> see any devices to run on). When using the regular OFED on the >> initiator, my target dies when I try to attach to the target on the >> initiator: >> --------------------------------- >> ib_srpt: Host login i_port_id=0x0:0x2c90300026053 >> t_port_id=0x2c90300026046:0x2c90300026046 it_iu_len=996 >> Oct 3 13:44:23 test05 kernel: i[4127]: scst: >> scst_mgmt_thread:5187:***CRITICAL ERROR*** session ffff8107f3222b88 is >> in scst_sess_shut_list, but in unknown shut phase 0 >> BUG at /usr/src/scst.tot/src/scst_targ.c:5188 >> ----------- [cut here ] --------- [please bite here ] --------- >> Kernel BUG at /usr/src/scst.tot/src/scst_targ.c:5188 >> invalid opcode: 0000 [1] SMP >> last sysfs file: /devices/pci0000:00/0000:00:00.0/class >> CPU 2 >> Modules linked in: ib_srpt(U) ib_cm ib_sa scst_vdisk(U) scst(U) >> fio_driver(PU) fio_port(PU) mlx4_ib ib_mad ib_core ipv6 xfrm_nalgo >> crypto_api autofs4 hidp rfcomm l2cap bluetooth sunrpc nls_utf8 hfsplus >> dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button >> battery asus_acpi acpi_memhotplug ac parport_pc lp parport i2c_i801 >> i5000_edac i2c_core edac_mc pcspkr shpchp mlx4_core e1000e ata_piix >> libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd >> Pid: 4127, comm: scsi_tgt_mgmt Tainted: P 2.6.18-92.1.13.el5 #1 >> RIP: 0010:[] [] >> :scst:scst_mgmt_thread+0x3ff/0x577 >> --------------------------------- >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From vst at vlnb.net Mon Oct 6 03:21:45 2008 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Mon, 06 Oct 2008 14:21:45 +0400 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48E67ACC.1020903@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> Message-ID: <48E9E6B9.9020303@vlnb.net> Cameron Harr wrote: > Cameron Harr wrote: >> Additionally, I found that I can load the newer scst code if I use the >> kernel-supplied modules and the standalone srpt-1.0.0 package that I >> think you provide Vu. I was about to try it along with dropping a >> module param for ib_srpt (I was using a thread count of 32 that had >> given me better performance on an earlier test). I'll report back on >> this. > > Not much luck using the newer scst code and default kernel modules > (Running CentOS 5.2). If I try using the default kernel modules on the > initiator, I can't get them to see anything (the ofed SM pkg doesn't see > any devices to run on). When using the regular OFED on the initiator, my > target dies when I try to attach to the target on the initiator: > --------------------------------- > ib_srpt: Host login i_port_id=0x0:0x2c90300026053 > t_port_id=0x2c90300026046:0x2c90300026046 it_iu_len=996 > Oct 3 13:44:23 test05 kernel: i[4127]: scst: > scst_mgmt_thread:5187:***CRITICAL ERROR*** session ffff8107f3222b88 is > in scst_sess_shut_list, but in unknown shut phase 0 > BUG at /usr/src/scst.tot/src/scst_targ.c:5188 > ----------- [cut here ] --------- [please bite here ] --------- > Kernel BUG at /usr/src/scst.tot/src/scst_targ.c:5188 This can happen if the target driver frees some IO or TM command twice, by, eg, calling scst_tgt_cmd_done() two times for the same command. > invalid opcode: 0000 [1] SMP > last sysfs file: /devices/pci0000:00/0000:00:00.0/class > CPU 2 > Modules linked in: ib_srpt(U) ib_cm ib_sa scst_vdisk(U) scst(U) > fio_driver(PU) fio_port(PU) mlx4_ib ib_mad ib_core ipv6 xfrm_nalgo > crypto_api autofs4 hidp rfcomm l2cap bluetooth sunrpc nls_utf8 hfsplus > dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button battery > asus_acpi acpi_memhotplug ac parport_pc lp parport i2c_i801 i5000_edac > i2c_core edac_mc pcspkr shpchp mlx4_core e1000e ata_piix libata sd_mod > scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd > Pid: 4127, comm: scsi_tgt_mgmt Tainted: P 2.6.18-92.1.13.el5 #1 > RIP: 0010:[] [] > :scst:scst_mgmt_thread+0x3ff/0x577 > --------------------------------- > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From vlad at lists.openfabrics.org Mon Oct 6 03:58:32 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 6 Oct 2008 03:58:32 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081006-0200 daily build status Message-ID: <20081006105832.5E0E7E608AF@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: Build failed on ppc64 with linux-2.6.24 Log: /home/vlad/tmp/ofa_1_4_kernel-20081006-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c: In function 'ehca_poll_eqs': /home/vlad/tmp/ofa_1_4_kernel-20081006-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:942: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type /home/vlad/tmp/ofa_1_4_kernel-20081006-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:946: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081006-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081006-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081006-0200_linux-2.6.24_ppc64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081006-0200_linux-2.6.24_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.24' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From tziporet at dev.mellanox.co.il Mon Oct 6 05:24:37 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 06 Oct 2008 14:24:37 +0200 Subject: [ofa-general] [PATCH] SDP: fix initial recv buffer size In-Reply-To: <1223213682-30014-1-git-send-email-amirv@mellanox.co.il> References: <> <1223213682-30014-1-git-send-email-amirv@mellanox.co.il> Message-ID: <48EA0385.6010106@mellanox.co.il> Amir Vadai wrote: > Set initial recv buffer according to incoming hha header. > > Fixed bugzilla 1086: SDP Linux and SDP windows don't work togeather > Have you asked Vlad to pull this? Tziporet From hal.rosenstock at gmail.com Mon Oct 6 05:26:00 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 6 Oct 2008 08:26:00 -0400 Subject: [ofa-general] ***SPAM*** ibdm network topology format In-Reply-To: References: <829ded920809290139vf2cc151w4cc8a6fafb49edfe@mail.gmail.com> <829ded920809292304k3ffc78c0m556efbdd7d35c528@mail.gmail.com> <20080930121252.GA7396@sashak.voltaire.com> <829ded920810010207r475d82abu269d47cd3baddb3f@mail.gmail.com> <20081001203813.GL7396@sashak.voltaire.com> <20081002022430.GQ7396@sashak.voltaire.com> <20081002170033.GI25831@sashak.voltaire.com> Message-ID: Sasha, On Thu, Oct 2, 2008 at 6:22 PM, Hal Rosenstock wrote: > Sasha, > > On Thu, Oct 2, 2008 at 1:00 PM, Sasha Khapyorsky wrote: >> Hi Hal, >> >> On 10:18 Thu 02 Oct , Hal Rosenstock wrote: >>> > >>> > 2. ibis doesn't register class 0x81 - SM direct routed, only SM lid >>> > routed (0x1). In comment in ibutils/ibis/src/ibsm.c line 118 is stated: >>> > >>> > /* no need to bind the Directed Route class as it will automatically >>> > be handled by the osm_vendor_bind if asked for LID route */ >>> > >>> > As far as I can see in osm_vendor_bind() it is not (but it is in >>> > opposite order - when class 0x81 is registered class 0x1 will be >>> > registered too). >>> >>> Yes that is what osm_vendor_ibumad.c:osm_vendor_bind does. >>> >>> So either ibdiagnet needs to register 0x81 r.t.1 or >>> osm_vendor_ibumad.c:osm_vendor_bind needs to be "symmetric" in terms >>> of registering the other SM class when only one is requested. This is >>> a minor change in the underlying semantics. [Popping up a level in >>> terms of this, (other than applications taking advantage of this >>> "feature",) I'm not sure why the vendor layer should assume that just >>> because one SM class is requested, the other should be too]. I just >>> looked and the latter appears to be consistent with the other vendor >>> layers. I think either solution will work. Your solution below also >>> looks like it would work but don't that should be done in a sim layer. >> >> I'm not like this "solution" too, but the fact that ibis works with real >> stack without registering 0x81 class is unclear for me. > > Me too. See below. > >>> > Somehow it works without ibsim - so I suspect user_mad handles it. >>> > >>> > (Hal, could you clarify?) >>> >>> The kernel (user_mad/mad) does not change the requested registrations >>> but I'm not sure I understand the question you are asking to be >>> clarified. Is that what you're asking ? >> >> ibis works somehow with real stack. It registers 0x1 class only and >> uses direct routing SMPs. Do you have any idea about why >> (osm_vendor_idumad and/or libibumad don't help)? > > libibumad umad_register does not do anything that would affect this > either. I can only conclude there must be something in ibutils that > fixes this if it does work with the real stack. It shouldn't be too > hard to track down where that registration for class 0x81 comes from. Are you sure this is the only registration and not DR class too ? That's the first thing to confirm or maybe you've already confirmed this and it wasn't clear to me in what you wrote. If so, I have a theory about what could be occuring. It may be the case that it is an effect of the kernel MAD layer in that a MAD agent can send any class and when using request/response it matches on transaction ID which contains the MAD agent. Unsolicited messages on that other class wouldn't get through though. I just ran a simple test of this and that appears to be the case. -- Hal > > -- Hal > >> Sasha >> > From tziporet at dev.mellanox.co.il Mon Oct 6 05:33:21 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 06 Oct 2008 14:33:21 +0200 Subject: [ofa-general] Status of NFS over RDMA and SRP? In-Reply-To: <48E4DDB1.8010303@array.ca> References: <48E4DDB1.8010303@array.ca> Message-ID: <48EA0591.8060003@mellanox.co.il> Steven Truelove wrote: > Hi, > > I am considering using our existing Infiniband interconnect to > provide high-speed storage access to our compute cluster. It looks > like the two ways to do this are NFS over RDMA and SRP. > SRP initiator is part of the Linux kernel and also part of OFED. SRP target is part of OFED (starting from OFED 1.3) and also submitted to the kernel as part of Generic SCSI target mid-level driver - SCST (http://scst.sourceforge.net) SRP is in GA stage Tziporet From Thomas.Talpey at netapp.com Mon Oct 6 05:35:46 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Mon, 06 Oct 2008 08:35:46 -0400 Subject: [ofa-general] rdma_resolve_route() returning -EINVAL In-Reply-To: <48E9D59D.30104@voltaire.com> References: <48E9D59D.30104@voltaire.com> Message-ID: At 05:08 AM 10/6/2008, Or Gerlitz wrote: >Roland Dreier wrote: >> The issue is more of spec compliance than a likely real-life >scenario... and as for why no one else is worrying about it, I think >it's because the only other user of rdma_connect() in the tree is >iSER, and I guess no one worried too much there. SRP uses the IB CM >directly, and waits for timewait exit before calling a connection closed. >Roland, > >I guess there's some tradeoff here between the time connection recovery >would take when the ULPs does wait for the timewait event vs the risk of >getting into the target considering the REQ as stale and rejecting. Does >SRP just wait for the event, or the new connection is established in >parallel, which also means it would use a different QP number always. That's what the NFS/RDMA client does - I always create a new cm_id and qp. So the TIMEWAIT upcall isn't very interesting, unless I change that. My problem was that I started getting the upcall, when I didn't before. Tom. From tziporet at dev.mellanox.co.il Mon Oct 6 05:49:56 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 06 Oct 2008 14:49:56 +0200 Subject: [ofa-general] Catastrophic error on an mthca driver In-Reply-To: <1222860080.31161.279.camel@mundo> References: <1222860080.31161.279.camel@mundo> Message-ID: <48EA0974.70406@mellanox.co.il> Ramiro Alba Queipo wrote: > Hi all, > > I recently had a problem with the server card of an infiniband cluster > which in turn made all the fabric down as the opensm daemon had run > into problems. Running dmesg you could see: > > -------------------------------------------------------------------- > [408188.411258] ib_mthca 0000:0c:00.0: Catastrophic error detected: > internal error > [408188.411266] ib_mthca 0000:0c:00.0: buf[00]: 000d0000 > [408188.411269] ib_mthca 0000:0c:00.0: buf[01]: 00000000 > [408188.411271] ib_mthca 0000:0c:00.0: buf[02]: 00000000 > [408188.411274] ib_mthca 0000:0c:00.0: buf[03]: 00000000 > [408188.411276] ib_mthca 0000:0c:00.0: buf[04]: 00000000 > [408188.411279] ib_mthca 0000:0c:00.0: buf[05]: 00127e9c > [408188.411281] ib_mthca 0000:0c:00.0: buf[06]: ffffffff > [408188.411283] ib_mthca 0000:0c:00.0: buf[07]: 00000000 > [408188.411286] ib_mthca 0000:0c:00.0: buf[08]: 00000000 > [408188.411288] ib_mthca 0000:0c:00.0: buf[09]: 00000000 > [408188.411290] ib_mthca 0000:0c:00.0: buf[0a]: 00000000 > [408188.411292] ib_mthca 0000:0c:00.0: buf[0b]: 00000000 > [408188.411295] ib_mthca 0000:0c:00.0: buf[0c]: 00000000 > [408188.411297] ib_mthca 0000:0c:00.0: buf[0d]: 00000000 > [408188.411299] ib_mthca 0000:0c:00.0: buf[0e]: 00000000 > [408188.411302] ib_mthca 0000:0c:00.0: buf[0f]: 00000000 > ------------------------------------------------------------ > Problems get solved once I restarted networking. I mean: > > > Is this a hardware problem? Is there a way to check for a hardware > problem? > It can be a HW problem. I forward this mail to our support people. You can also submit a request on our support web: http://www.mellanox.com/support/support_signup.php Tziporet Tziporet From alex.estrin at qlogic.com Mon Oct 6 06:34:51 2008 From: alex.estrin at qlogic.com (Alex Estrin) Date: Mon, 6 Oct 2008 08:34:51 -0500 Subject: [ofa-general] RE: IPoIB CM connectivity issue. In-Reply-To: References: <20080923083956.GA14288@mtls03> Message-ID: > > > In second case (when OFED is CREQ initiator) only one RC > QP was used > > to establish a connection and apparently bidirectional traffic was > > capable to go through that one QP. > > Yes, at least in the case where you have an SRQ-capable adapter, it > doesn't really matter which QP has incoming traffic. However, it was > much simpler in the IPoIB implementation to simply open a QP to send > traffic rather than searching through all passive connections for a > connection to the same peer. > > Is this behavior causing problems for you? I just didn't expect of using TWO QPs per one connection. It probably won't simplify my implementation :) > > > I assume you mean sending ARP replies. Yes, you are > correct. I never > > > noticed before but RFC 4755 does say: > > > > Additionally, all address resolution responses (ARP > or Neighbor > > > Discovery) MUST always be encapsulated in a UD mode packet. > > > Yes, you are right. Please discard my note regarding ARP reply. > > Not sure what you mean -- if Linux is sending ARP replies on > a connected > QP, then that is not allowed according to the RFC. I meant "Linux sends ARP reply over UD" behaviour is in compliance with the document. > However, looking at this quote again I see that the RFC's requirement > rather unfortunately includes neighbour discovery too. It's not *too* > bad to look at the ethertype in the IPoIB pseudo-header to > check for an > ARP packet, but sending all neighbour discovery messages > seems very ugly > -- even just sending all ICMP6 messages via UD wouldn't be > very nice to > implement, and it would eg break ping6 with large messages, > so we would > have to look deep deep into packets to see which were ND messages. > > I wonder what the rationale behind that part of the RFC was? > > - R. Yes, it would be good to know the reason behind this. So far handling ND messages over UD ONLY seem odd considering it's function to detect unreacheable neighbours. If one has alive RC QP to a neighbour most likely it is reacheable. Alex. From hal.rosenstock at gmail.com Mon Oct 6 07:08:12 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 6 Oct 2008 10:08:12 -0400 Subject: ***SPAM*** Re: [ofa-general] [PATCH 0/6] opensm: Unicast Routing Cache In-Reply-To: <48E96928.8030200@dev.mellanox.co.il> References: <48E96928.8030200@dev.mellanox.co.il> Message-ID: Hi Yevgeny, On Sun, Oct 5, 2008 at 9:26 PM, Yevgeny Kliteynik wrote: > Hi Sasha, > > The following series of 6 patches implements unicast routing cache > in OpenSM. > > This implementation (v2, previous version was sent before OFED 1.3) > was rewritten from scratch: > - no caching of existing connectivity > - no caching of existing lid matrices > - each switch has an LFT buffer that contains the result of > the last routing engine execution (instead of one buffer > in ucast_mgr) > - links/ports/nodes changes are spotted during the discovery > - only the links/ports/nodes that went down are cached > - when switch goes down, caching its lid matrices and LFT > > In one of the following cases we can use cached routing > - there is no topology change > - one or more CAs disappeared > - one or more leaf switches disappeared > In these cases cached routing is written to the switches as is > (unless the switch doesn't exist). > If there is any other topology change, existing cache is invalidated > and the routing engine(s) run as usual. Glad to see this! A few comments/questions: It seems that there is a LFT cache per switch. This seems to be a big memory penalty to me (in large subnets). So I have two questions related to this: Can this only be done this way when cached routing is being used ? Also, when cached routing is being used, is this only needed for leaf switches ? I'm wondering when there is a cached node match whether the available peer ports/neighbors are validated (or something equivalent) to know caching is valid ? It might also include whether a switch is still a leaf switch (which may be redundant as that should show up as a peer port/neighbor change). It looks like the structure is there for this but I didn't review the code in detail. Are you sure all the memory allocation failures are handled properly within the routing cache code ? What I mean is that NULL is returned and does this always result in a caching not used/routing recalculated ? Also, in that case, should some log message be indicated rather than hiding this ? Nit: doc/current-routing.txt should also be updated for this feature. -- Hal > The patches are: > - patch 1/6: move lft_buf from ucast_mgr to osm_switch > - patch 2/6: Add "-A" or "--ucast_cache" option to opensm > - patch 3/6: adding osm_ucast_cache.{c,h} files (this is > the cache implementation itself) > - patch 4/6: adding new cache files to makefile > - patch 5/6: integrating unicast cache into the discovery > and ucast manager > - patch 6/6: man entry for cached routing > > -- Yevgeny > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From michael.heinz at qlogic.com Mon Oct 6 08:09:14 2008 From: michael.heinz at qlogic.com (Mike Heinz) Date: Mon, 6 Oct 2008 10:09:14 -0500 Subject: [ofa-general] Allowing end-users to query for fabric information In-Reply-To: References: Message-ID: Roland, I've been thinking about this some more and I have to say I'm still a bit confused. Are you saying that any root user on any node of the fabric can change the routing tables? Isn't the ability to access and alter subnet information controlled via the management key? -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Mike Heinz Sent: Monday, September 22, 2008 3:19 PM To: Roland Dreier Cc: general at lists.openfabrics.org Subject: RE: [ofa-general] Allowing end-users to query for fabric information Thanks for the explanation. -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -----Original Message----- From: Roland Dreier [mailto:rdreier at cisco.com] Sent: Monday, September 22, 2008 3:18 PM To: Mike Heinz Cc: general at lists.openfabrics.org Subject: Re: [ofa-general] Allowing end-users to query for fabric information > What was the reason for making this design choice? While I could > certainly provide boot scripts to change the permissions to > /dev/infiniband/umad*, I'd rather understand why the decision was made > to restrict access. because /dev/infiniband/umadX allows full unfiltered access to send/receive any MADs. Including changing routing tables, bringing ports down, etc. Not stuff that unprivileged users should be able to do. It would make sense to have a higher-level interface that only allows safe queries without side effects, but that's quite a bit more work than just changing permissions on device nodes. - R. _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hal.rosenstock at gmail.com Mon Oct 6 08:16:29 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 6 Oct 2008 11:16:29 -0400 Subject: ***SPAM*** Re: [ofa-general] Allowing end-users to query for fabric information In-Reply-To: References: Message-ID: Mike, On Mon, Oct 6, 2008 at 11:09 AM, Mike Heinz wrote: > Roland, > > I've been thinking about this some more and I have to say I'm still a > bit confused. Are you saying that any root user on any node of the > fabric can change the routing tables? Isn't the ability to access and > alter subnet information controlled via the management key? There are two levels to this. First you must be able to send the MAD and once that can happen the receiving SMA performs the usual MKey checks which depend on the protection level assuming it is an SM class MAD like the one to change the routing tables. -- Hal > > > -- > Michael Heinz > Principal Engineer, Qlogic Corporation > King of Prussia, Pennsylvania > > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Mike Heinz > Sent: Monday, September 22, 2008 3:19 PM > To: Roland Dreier > Cc: general at lists.openfabrics.org > Subject: RE: [ofa-general] Allowing end-users to query for fabric > information > > Thanks for the explanation. > > > -- > Michael Heinz > Principal Engineer, Qlogic Corporation > King of Prussia, Pennsylvania > > -----Original Message----- > From: Roland Dreier [mailto:rdreier at cisco.com] > Sent: Monday, September 22, 2008 3:18 PM > To: Mike Heinz > Cc: general at lists.openfabrics.org > Subject: Re: [ofa-general] Allowing end-users to query for fabric > information > > > What was the reason for making this design choice? While I could > > certainly provide boot scripts to change the permissions to > > /dev/infiniband/umad*, I'd rather understand why the decision was made >> to restrict access. > > because /dev/infiniband/umadX allows full unfiltered access to > send/receive any MADs. Including changing routing tables, bringing > ports down, etc. Not stuff that unprivileged users should be able to > do. > > It would make sense to have a higher-level interface that only allows > safe queries without side effects, but that's quite a bit more work than > just changing permissions on device nodes. > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From michael.heinz at qlogic.com Mon Oct 6 08:27:05 2008 From: michael.heinz at qlogic.com (Mike Heinz) Date: Mon, 6 Oct 2008 10:27:05 -0500 Subject: [ofa-general] Allowing end-users to query for fabric information In-Reply-To: References: Message-ID: Well, I guess that's my point - I'd like to be able to create tools for non-root users that would collect interesting information about the fabric. As far as I know, this should be a safe operation, because the SA should be protected by the m-key - but it seems that the policy in OFED is that this is not a safe operation and access must be tightly controlled. While it's a trivial task to patch OFED to give non-root users access to the /dev/infiniband/umad* devices, I certainly don't want to provide tools to my users that create security holes in the fabric. -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -----Original Message----- From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] Sent: Monday, October 06, 2008 11:16 AM To: Mike Heinz Cc: Roland Dreier; general at lists.openfabrics.org Subject: Re: [ofa-general] Allowing end-users to query for fabric information Mike, On Mon, Oct 6, 2008 at 11:09 AM, Mike Heinz wrote: > Roland, > > I've been thinking about this some more and I have to say I'm still a > bit confused. Are you saying that any root user on any node of the > fabric can change the routing tables? Isn't the ability to access and > alter subnet information controlled via the management key? There are two levels to this. First you must be able to send the MAD and once that can happen the receiving SMA performs the usual MKey checks which depend on the protection level assuming it is an SM class MAD like the one to change the routing tables. -- Hal > > > -- > Michael Heinz > Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania > > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Mike Heinz > Sent: Monday, September 22, 2008 3:19 PM > To: Roland Dreier > Cc: general at lists.openfabrics.org > Subject: RE: [ofa-general] Allowing end-users to query for fabric > information > > Thanks for the explanation. > > > -- > Michael Heinz > Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania > > -----Original Message----- > From: Roland Dreier [mailto:rdreier at cisco.com] > Sent: Monday, September 22, 2008 3:18 PM > To: Mike Heinz > Cc: general at lists.openfabrics.org > Subject: Re: [ofa-general] Allowing end-users to query for fabric > information > > > What was the reason for making this design choice? While I could > > certainly provide boot scripts to change the permissions to > > /dev/infiniband/umad*, I'd rather understand why the decision was made >> to restrict access. > > because /dev/infiniband/umadX allows full unfiltered access to > send/receive any MADs. Including changing routing tables, bringing > ports down, etc. Not stuff that unprivileged users should be able to > do. > > It would make sense to have a higher-level interface that only allows > safe queries without side effects, but that's quite a bit more work > than just changing permissions on device nodes. > > - R. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From tziporet at mellanox.co.il Mon Oct 6 08:33:20 2008 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 6 Oct 2008 17:33:20 +0200 Subject: [ofa-general] OFED meeting agenda for today (Oct 6) Message-ID: <5D49E7A8952DC44FB38C38FA0D758EADABAF5D@mtlexch01.mtl.com> Agenda for OFED meeting today on OFED 1.4 status toward RC3: 1. Interop event status - Rupert 2. RC3 features: 1. NFS-RDMA to work on RHEL 5.1 - done 2. OSM: Cashed routing - patches sent - should be committed in a day or two 3. Cleanup compilation warning - Mellanox started - any progress by other companies? 3. OFED testing status - all 4. Critical bugs review: 1128 blo Othe stefan.roscher at de.ibm.com release IPoIB-CM QP resources in flushing CQE context 1113 cri RHEL dorons at voltaire.com rpm -e scsi-target-utils-0.1-2008715 fails 1198 cri SLES yosefe at voltaire.com hang during ipoib create_child/ifdown 1164 maj SLES eli at mellanox.co.il iperf over IPoIB fails for 100 tcp connections 1247 maj RHEL eli at mellanox.co.il ipoib_ud_test caused kernel oops on ofed_1_4 (sw083/084) 1221 maj SLES Jeffrey.C.Becker at nasa.gov SLES10 sp2: remote logins via ssh fail due to rpcbind and... 1248 maj SLES monis at voltaire.com Bonding - after reboot the host stucks while raising the ... 1099 maj All vlad at mellanox.co.il IPoIB IPv6 does not work on RH4 1153 maj Othe yosefe at voltaire.com OpenSM- Multicast group will not open when IB host is the... 5. OFA BOF at SC08 - Woody 6. Open discussion Tziporet From cameron at harr.org Mon Oct 6 08:31:14 2008 From: cameron at harr.org (Cameron Harr) Date: Mon, 06 Oct 2008 09:31:14 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48E9E681.8090600@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> Message-ID: <48EA2F42.80008@harr.org> Vlad, Thanks for the suggestion. As I look via vmstat, my CSw/s rate is fairly constant around 280K when scst_threads=1 (per Vu's suggestion) and pops up to ~330-340K CSw/s when scst_threads is set to 8. I'm currently doing 512B writes, and this gives me about a 4:1 ratio of context switches to IOPs with 1 SCST thread (70K IOPs) and around 4.5:1 when there are 8 SCST threads (75K IOPs). You say those numbers could be overkill - do you know of a way to drop the number? I'm very interested in trying Vu's other suggestions (multiple initiators and multiple QPs, but my other initiator has been too busy all weekend to run on. Debug, tracing, and all that was turned off in the SCST Makefiles. -Cameron Vladislav Bolkhovitin wrote: > Cameron Harr wrote: >> I was able to get the latest scst code working with Vu's standalone >> ib_srpt and the kernel IB modules, and dropped my ib_srpt thread >> count to 2. However, I still get about the same IOP performance on >> the target although interrupts on the "busy" cpu have gone up to >> around 140K. Interesting, but now I'm at a bit of a loss as to where >> the bottleneck could be. I figured it was Interrupts, but if the CPU >> is handling more right now, perhaps the problem is elsewhere? > > How many context switches per second do you have during your test on > the target? > > Once in scst-devel mailing list there was a thread about observation > that SRP target driver produces 10 context switches per command. See > http://sourceforge.net/mailarchive/message.php?msg_id=e2e108260802070110q1fa084a1j54945d06c16c94f2%40mail.gmail.com > > > If it is so in your case as well, it would very well explain your > issue. 10 CS/cmd is a definite overkill, it should be 1 or, at max, 2 > CS/cmd. > > BTW, I suppose you don't use the debug SCST build, do you? > > Vlad > >> Cameron >> >> Cameron Harr wrote: >>> Cameron Harr wrote: >>>> Additionally, I found that I can load the newer scst code if I use >>>> the kernel-supplied modules and the standalone srpt-1.0.0 package >>>> that I think you provide Vu. I was about to try it along with >>>> dropping a module param for ib_srpt (I was using a thread count of >>>> 32 that had given me better performance on an earlier test). I'll >>>> report back on this. >>> Not much luck using the newer scst code and default kernel modules >>> (Running CentOS 5.2). If I try using the default kernel modules on >>> the initiator, I can't get them to see anything (the ofed SM pkg >>> doesn't see any devices to run on). When using the regular OFED on >>> the initiator, my target dies when I try to attach to the target on >>> the initiator: >>> --------------------------------- >>> ib_srpt: Host login i_port_id=0x0:0x2c90300026053 >>> t_port_id=0x2c90300026046:0x2c90300026046 it_iu_len=996 >>> Oct 3 13:44:23 test05 kernel: i[4127]: scst: >>> scst_mgmt_thread:5187:***CRITICAL ERROR*** session ffff8107f3222b88 >>> is in scst_sess_shut_list, but in unknown shut phase 0 >>> BUG at /usr/src/scst.tot/src/scst_targ.c:5188 >>> ----------- [cut here ] --------- [please bite here ] --------- >>> Kernel BUG at /usr/src/scst.tot/src/scst_targ.c:5188 >>> invalid opcode: 0000 [1] SMP >>> last sysfs file: /devices/pci0000:00/0000:00:00.0/class >>> CPU 2 >>> Modules linked in: ib_srpt(U) ib_cm ib_sa scst_vdisk(U) scst(U) >>> fio_driver(PU) fio_port(PU) mlx4_ib ib_mad ib_core ipv6 xfrm_nalgo >>> crypto_api autofs4 hidp rfcomm l2cap bluetooth sunrpc nls_utf8 >>> hfsplus dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec >>> button battery asus_acpi acpi_memhotplug ac parport_pc lp parport >>> i2c_i801 i5000_edac i2c_core edac_mc pcspkr shpchp mlx4_core e1000e >>> ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd >>> Pid: 4127, comm: scsi_tgt_mgmt Tainted: P 2.6.18-92.1.13.el5 #1 >>> RIP: 0010:[] [] >>> :scst:scst_mgmt_thread+0x3ff/0x577 >>> --------------------------------- >>> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > From jsquyres at cisco.com Mon Oct 6 08:48:31 2008 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 6 Oct 2008 11:48:31 -0400 Subject: [ofa-general] OFED meeting agenda for today (Oct 6) In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EADABAF5D@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EADABAF5D@mtlexch01.mtl.com> Message-ID: Are we meeting at noon US Eastern (about 15 mins from now) or 12:30pm (about 45 mins from now)? On Oct 6, 2008, at 11:33 AM, Tziporet Koren wrote: > > > Agenda for OFED meeting today on OFED 1.4 status toward RC3: > > 1. Interop event status - Rupert > 2. RC3 features: > 1. NFS-RDMA to work on RHEL 5.1 - done > 2. OSM: Cashed routing - patches sent - should be committed in a > day or two > 3. Cleanup compilation warning - Mellanox started - any progress > by other companies? > 3. OFED testing status - all > 4. Critical bugs review: > 1128 blo Othe stefan.roscher at de.ibm.com release > IPoIB-CM QP resources in flushing CQE context > 1113 cri RHEL dorons at voltaire.com rpm -e > scsi-target-utils-0.1-2008715 fails > 1198 cri SLES yosefe at voltaire.com hang during > ipoib create_child/ifdown > 1164 maj SLES eli at mellanox.co.il iperf over IPoIB > fails for 100 tcp connections > 1247 maj RHEL eli at mellanox.co.il ipoib_ud_test > caused kernel oops on ofed_1_4 (sw083/084) > 1221 maj SLES Jeffrey.C.Becker at nasa.gov SLES10 sp2: > remote logins via ssh fail due to rpcbind and... > 1248 maj SLES monis at voltaire.com Bonding - after > reboot the host stucks while raising the ... > 1099 maj All vlad at mellanox.co.il IPoIB IPv6 does > not work on RH4 > 1153 maj Othe yosefe at voltaire.com OpenSM- > Multicast group will not open when IB host is the... > 5. OFA BOF at SC08 - Woody > 6. Open discussion > > Tziporet > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Jeff Squyres Cisco Systems From publications at kressworks.com Mon Oct 6 08:57:14 2008 From: publications at kressworks.com (publications) Date: Mon, 6 Oct 2008 11:57:14 -0400 Subject: [ofa-general] OFED Roll In-Reply-To: References: <48E4F93A.8040309@ibt.unam.mx> Message-ID: <9A1DE9E267DB43339B1589572062D0AF@inspiron9100> Am I correct that the Cisco OFED Roll installs Infiniband but not Infiniband over IP? Does it just use RDMA as a transport? The OFED download from Openfabrics installs IB over IP and I prefer not to use it since the latencies are double that of RDMA and the throughput is about one half of RDMA. So, another question. Suppose I have already installed OFED 1.3.1 with IB over IP. How do I configure my system (including the ib0.conf and other conf files) to use RDMA rather than IB over IP? Thanks for your help. Jim From eli at dev.mellanox.co.il Mon Oct 6 08:58:58 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Mon, 6 Oct 2008 17:58:58 +0200 Subject: [ofa-general] [PATCH/RFC] IB/mthca: Use pci_request_regions() In-Reply-To: References: Message-ID: <20081006155722.GA27011@mtls03> On Mon, Sep 29, 2008 at 09:41:37PM -0700, Roland Dreier wrote: > Back in prehistoric (pre-git!) days, the kernel's MSI-X support did > request_mem_region() on a device's MSI-X tables, which meant that a > driver that enabled MSI-X couldn't use pci_request_regions() (since > that would clash with the PCI layer's MSI-X request). > > However, that was removed (by me!) years ago, so mthca can just use > pci_request_regions() and pci_release_regions() instead of its own > much more complicated code that avoids requesting the MSI-X tables. > Looks like a nice diet to the code. Acked by: Eli Cohen From hal.rosenstock at gmail.com Mon Oct 6 09:00:17 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 6 Oct 2008 12:00:17 -0400 Subject: ***SPAM*** Re: [ofa-general] Allowing end-users to query for fabric information In-Reply-To: References: Message-ID: On Mon, Oct 6, 2008 at 11:27 AM, Mike Heinz wrote: > Well, > > I guess that's my point - I'd like to be able to create tools for > non-root users that would collect interesting information about the > fabric. As far as I know, this should be a safe operation, because the > SA should be protected by the m-key - but it seems that the policy in > OFED is that this is not a safe operation and access must be tightly > controlled. Do you mean SM or SA ? Subverting the SM is not a good idea. The SM is the central point for setting up SM attributes. Policy needs to be instilled through the SM. There are some SA attributes which are somewhat dangerous too as they are essentially writable as well from an end node. Furthermore, most fabrics do not utilize MKey protection so the second level is not there yet and only the most primitive form of this is available within some SMs. > While it's a trivial task to patch OFED to give non-root users access to > the /dev/infiniband/umad* devices, I certainly don't want to provide > tools to my users that create security holes in the fabric. IMO this would do that although I would phrase it slightly differently. -- Hal > -- > Michael Heinz > Principal Engineer, Qlogic Corporation > King of Prussia, Pennsylvania > > -----Original Message----- > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > Sent: Monday, October 06, 2008 11:16 AM > To: Mike Heinz > Cc: Roland Dreier; general at lists.openfabrics.org > Subject: Re: [ofa-general] Allowing end-users to query for fabric > information > > Mike, > > On Mon, Oct 6, 2008 at 11:09 AM, Mike Heinz > wrote: >> Roland, >> >> I've been thinking about this some more and I have to say I'm still a >> bit confused. Are you saying that any root user on any node of the >> fabric can change the routing tables? Isn't the ability to access and >> alter subnet information controlled via the management key? > > There are two levels to this. First you must be able to send the MAD and > once that can happen the receiving SMA performs the usual MKey checks > which depend on the protection level assuming it is an SM class MAD like > the one to change the routing tables. > > -- Hal > >> >> >> -- >> Michael Heinz >> Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania >> >> -----Original Message----- >> From: general-bounces at lists.openfabrics.org >> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Mike Heinz >> Sent: Monday, September 22, 2008 3:19 PM >> To: Roland Dreier >> Cc: general at lists.openfabrics.org >> Subject: RE: [ofa-general] Allowing end-users to query for fabric >> information >> >> Thanks for the explanation. >> >> >> -- >> Michael Heinz >> Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania >> >> -----Original Message----- >> From: Roland Dreier [mailto:rdreier at cisco.com] >> Sent: Monday, September 22, 2008 3:18 PM >> To: Mike Heinz >> Cc: general at lists.openfabrics.org >> Subject: Re: [ofa-general] Allowing end-users to query for fabric >> information >> >> > What was the reason for making this design choice? While I could > > >> certainly provide boot scripts to change the permissions to > >> /dev/infiniband/umad*, I'd rather understand why the decision was made >>> to restrict access. >> >> because /dev/infiniband/umadX allows full unfiltered access to >> send/receive any MADs. Including changing routing tables, bringing >> ports down, etc. Not stuff that unprivileged users should be able to >> do. >> >> It would make sense to have a higher-level interface that only allows >> safe queries without side effects, but that's quite a bit more work >> than just changing permissions on device nodes. >> >> - R. >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > From vst at vlnb.net Mon Oct 6 09:04:22 2008 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Mon, 06 Oct 2008 20:04:22 +0400 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48EA2F42.80008@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> Message-ID: <48EA3706.2080700@vlnb.net> Cameron Harr wrote: > Vlad, > Thanks for the suggestion. As I look via vmstat, my CSw/s rate is fairly > constant around 280K when scst_threads=1 (per Vu's suggestion) and pops > up to ~330-340K CSw/s when scst_threads is set to 8. I'm currently doing > 512B writes, and this gives me about a 4:1 ratio of context switches to > IOPs with 1 SCST thread (70K IOPs) and around 4.5:1 when there are 8 > SCST threads (75K IOPs). This is still too high. Considering that each CS is about 1 microsecond you can estimate how many IOPS's it costs you. > You say those numbers could be overkill - do > you know of a way to drop the number? Sorry, I don't. I can only say that too many CSs problem is in SRPT driver. With qla2x00t driver and BLOCKIO backstorage you will have 1 CS/sec or less in average. > I'm very interested in trying Vu's > other suggestions (multiple initiators and multiple QPs, but my other > initiator has been too busy all weekend to run on. > > Debug, tracing, and all that was turned off in the SCST Makefiles. > > -Cameron > > Vladislav Bolkhovitin wrote: >> Cameron Harr wrote: >>> I was able to get the latest scst code working with Vu's standalone >>> ib_srpt and the kernel IB modules, and dropped my ib_srpt thread >>> count to 2. However, I still get about the same IOP performance on >>> the target although interrupts on the "busy" cpu have gone up to >>> around 140K. Interesting, but now I'm at a bit of a loss as to where >>> the bottleneck could be. I figured it was Interrupts, but if the CPU >>> is handling more right now, perhaps the problem is elsewhere? >> How many context switches per second do you have during your test on >> the target? >> >> Once in scst-devel mailing list there was a thread about observation >> that SRP target driver produces 10 context switches per command. See >> http://sourceforge.net/mailarchive/message.php?msg_id=e2e108260802070110q1fa084a1j54945d06c16c94f2%40mail.gmail.com >> >> >> If it is so in your case as well, it would very well explain your >> issue. 10 CS/cmd is a definite overkill, it should be 1 or, at max, 2 >> CS/cmd. >> >> BTW, I suppose you don't use the debug SCST build, do you? >> >> Vlad From tom at opengridcomputing.com Mon Oct 6 09:12:46 2008 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 6 Oct 2008 11:12:46 -0500 Subject: [ofa-general] [PATCH 00/03] RDMA Transport Support for 9P Message-ID: <1223309569-12572-1-git-send-email-tom@opengridcomputing.com> Roland: This patchset implements an RDMA transport provider for the v9fs (Plan 9 filesystem). Could you take a look at it and let us know what you think? Thanks, Tom Here is the original posting... Eric: This patch series implements an RDMA Transport provider for 9P and is relative to your for-next branch. The RDMA support is built on the OpenFabrics API and uses SEND and RECV to exchange data. This patch series has been tested with dbench and iozone. Signed-off-by: Tom Tucker Signed-off-by: Latchesar Ionkov [PATCH 01/03] 9prdma: RDMA Transport Support for 9P net/9p/trans_rdma.c | 996 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 996 insertions(+), 0 deletions(-) [PATCH 02/03] 9prdma: Makefile change for the RDMA transport net/9p/Makefile | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) [PATCH 03/03] 9prdma: Kconfig changes for the RDMA transport net/9p/Kconfig | 6 ++++++ 1 files changed, 6 insertions(+), 0 deletions(-) From tom at opengridcomputing.com Mon Oct 6 09:12:48 2008 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 6 Oct 2008 11:12:48 -0500 Subject: [ofa-general] [PATCH 02/03] 9prdma: Makefile change for the RDMA transport In-Reply-To: <1223309569-12572-2-git-send-email-tom@opengridcomputing.com> References: <1223309569-12572-1-git-send-email-tom@opengridcomputing.com> <1223309569-12572-2-git-send-email-tom@opengridcomputing.com> Message-ID: <1223309569-12572-3-git-send-email-tom@opengridcomputing.com> This adds a make rule for the 9pnet_rdma module that implements the RDMA transport. Signed-off-by: Tom Tucker Signed-off-by: Latchesar Ionkov --- net/9p/Makefile | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/net/9p/Makefile b/net/9p/Makefile index 5192194..bc909ab 100644 --- a/net/9p/Makefile +++ b/net/9p/Makefile @@ -1,5 +1,6 @@ obj-$(CONFIG_NET_9P) := 9pnet.o obj-$(CONFIG_NET_9P_VIRTIO) += 9pnet_virtio.o +obj-$(CONFIG_NET_9P_RDMA) += 9pnet_rdma.o 9pnet-objs := \ mod.o \ @@ -12,3 +13,6 @@ obj-$(CONFIG_NET_9P_VIRTIO) += 9pnet_virtio.o 9pnet_virtio-objs := \ trans_virtio.o \ + +9pnet_rdma-objs := \ + trans_rdma.o \ From tom at opengridcomputing.com Mon Oct 6 09:12:49 2008 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 6 Oct 2008 11:12:49 -0500 Subject: [ofa-general] [PATCH 03/03] 9prdma: Kconfig changes for the RDMA transport In-Reply-To: <1223309569-12572-3-git-send-email-tom@opengridcomputing.com> References: <1223309569-12572-1-git-send-email-tom@opengridcomputing.com> <1223309569-12572-2-git-send-email-tom@opengridcomputing.com> <1223309569-12572-3-git-send-email-tom@opengridcomputing.com> Message-ID: <1223309569-12572-4-git-send-email-tom@opengridcomputing.com> This patch adds a config option for the 9P RDMA transport. Signed-off-by: Tom Tucker Signed-off-by: Latchesar Ionkov --- net/9p/Kconfig | 6 ++++++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/net/9p/Kconfig b/net/9p/Kconfig index ff34c5a..c42c0c4 100644 --- a/net/9p/Kconfig +++ b/net/9p/Kconfig @@ -20,6 +20,12 @@ config NET_9P_VIRTIO This builds support for a transports between guest partitions and a host partition. +config NET_9P_RDMA + depends on NET_9P && INFINIBAND && EXPERIMENTAL + tristate "9P RDMA Transport (Experimental)" + help + This builds support for a RDMA transport. + config NET_9P_DEBUG bool "Debug information" depends on NET_9P From tom at opengridcomputing.com Mon Oct 6 09:12:47 2008 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 6 Oct 2008 11:12:47 -0500 Subject: [ofa-general] [PATCH 01/03] 9prdma: RDMA Transport Support for 9P In-Reply-To: <1223309569-12572-1-git-send-email-tom@opengridcomputing.com> References: <1223309569-12572-1-git-send-email-tom@opengridcomputing.com> Message-ID: <1223309569-12572-2-git-send-email-tom@opengridcomputing.com> This file implements the RDMA transport provider for 9P. It allows mounts to be performed over iWARP and IB capable network interfaces and uses the OpenFabrics API to perform I/O. Signed-off-by: Tom Tucker Signed-off-by: Latchesar Ionkov --- net/9p/trans_rdma.c | 1025 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 1025 insertions(+), 0 deletions(-) diff --git a/net/9p/trans_rdma.c b/net/9p/trans_rdma.c new file mode 100644 index 0000000..f919768 --- /dev/null +++ b/net/9p/trans_rdma.c @@ -0,0 +1,1025 @@ +/* + * linux/fs/9p/trans_rdma.c + * + * RDMA transport layer based on the trans_fd.c implementation. + * + * Copyright (C) 2008 by Tom Tucker + * Copyright (C) 2006 by Russ Cox + * Copyright (C) 2004-2005 by Latchesar Ionkov + * Copyright (C) 2004-2008 by Eric Van Hensbergen + * Copyright (C) 1997-2002 by Ron Minnich + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to: + * Free Software Foundation + * 51 Franklin Street, Fifth Floor + * Boston, MA 02111-1301 USA + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define P9_PORT 5640 +#define P9_RDMA_SQ_DEPTH 32 +#define P9_RDMA_RQ_DEPTH 32 +#define P9_RDMA_SEND_SGE 4 +#define P9_RDMA_RECV_SGE 4 +#define P9_RDMA_IRD 0 +#define P9_RDMA_ORD 0 +#define P9_RDMA_TIMEOUT 30000 /* 30 seconds */ +#define P9_RDMA_MAXSIZE (4*4096) /* Min SGE is 4, so we can + * safely advertise a maxsize + * of 64k */ + +#define P9_RDMA_MAX_SGE (P9_RDMA_MAXSIZE >> PAGE_SHIFT) +/** + * struct p9_trans_rdma - RDMA transport instance + * + * @state: tracks the transport state machine for connection setup and tear down + * @cm_id: The RDMA CM ID + * @pd: Protection Domain pointer + * @qp: Queue Pair pointer + * @cq: Completion Queue pointer + * @lkey: The local access only memory region key + * @next_tag: The next tag for tracking rpc + * @timeout: Number of uSecs to wait for connection management events + * @sq_depth: The depth of the Send Queue + * @sq_count: Number of WR on the Send Queue + * @rq_depth: The depth of the Receive Queue. NB: I _think_ that 9P is + * purely req/rpl (i.e. no unaffiliated replies, but I'm not sure, so + * I'm allowing this to be tweaked separately. + * @addr: The remote peer's address + * @req_lock: Protects the active request list + * @req_list: List of sent RPC awaiting replies + * @send_wait: Wait list when the SQ fills up + * @cm_done: Completion event for connection management tracking + */ +struct p9_trans_rdma { + enum { + P9_RDMA_INIT, + P9_RDMA_ADDR_RESOLVED, + P9_RDMA_ROUTE_RESOLVED, + P9_RDMA_CONNECTED, + P9_RDMA_FLUSHING, + P9_RDMA_CLOSING, + P9_RDMA_CLOSED, + } state; + struct rdma_cm_id *cm_id; + struct ib_pd *pd; + struct ib_qp *qp; + struct ib_cq *cq; + struct ib_mr *dma_mr; + u32 lkey; + atomic_t next_tag; + long timeout; + int sq_depth; + atomic_t sq_count; + int rq_depth; + struct sockaddr_in addr; + + spinlock_t req_lock; + struct list_head req_list; + + wait_queue_head_t send_wait; + struct completion cm_done; + struct p9_idpool *tagpool; +}; + +/** + * p9_rdma_context - Keeps track of in-process WR + * + * @wc_op: Mellanox's broken HW doesn't provide the original WR op + * when the CQE completes in error. This forces apps to keep track of + * the op themselves. Yes, it's a Pet Peeve of mine ;-) + * @busa: Bus address to unmap when the WR completes + * @req: Keeps track of requests (send) + * @rcall: Keepts track of replies (receive) + */ +struct p9_rdma_req; +struct p9_rdma_context { + enum ib_wc_opcode wc_op; + dma_addr_t busa; + union { + struct p9_rdma_req *req; + struct p9_fcall *rcall; + }; +}; + +#define P9_RDMA_REQ_FLUSHING 0 +#define P9_RDMA_REQ_COMPLETE 1 +/** + * struct p9_rdma_req - tracks the request and reply fcall structures. + * + * @req_lock: protects req_list + * @tcall: request &p9_fcall structure + * @rcall: response &p9_fcall structure + * @err: error state + * @cb: callback for when response is received + * @cba: argument to pass to callback + * @flush: flag to indicate RPC has been flushed + * @req_list: list link for higher level objects to chain requests + * + */ +struct p9_rdma_req { + struct p9_fcall *tcall; + struct p9_fcall *rcall; + struct completion done; + struct list_head list; + u16 tag; + int err; + atomic_t ref; + unsigned long flags; + spinlock_t lock; +}; + +static struct p9_rdma_req *rdma_req_get(void) +{ + struct p9_rdma_req *req; + req = kzalloc(sizeof *req, GFP_KERNEL); + if (req) { + atomic_set(&req->ref, 1); + init_completion(&req->done); + INIT_LIST_HEAD(&req->list); + spin_lock_init(&req->lock); + req->flags = 0; + } + return req; +} + +static void rdma_req_put(struct p9_rdma_req *req) +{ + if (req && atomic_dec_and_test(&req->ref)) + kfree(req); +} + +/** + * p9_rdma_opts - Collection of mount options + * + * @sq_depth: The requested depth of the SQ. This really doesn't need + * to be any deeper than the number of threads used in the client + * @rq_depth: The depth of the RQ. Should be greater than or equal to SQ depth + * @timeout: Time to wait in msecs for CM events + */ +struct p9_rdma_opts { + short port; + int sq_depth; + int rq_depth; + long timeout; +}; + +/* + * Option Parsing (code inspired by NFS code) + */ + +enum { + /* Options that take integer arguments */ + Opt_port, Opt_rq_depth, Opt_sq_depth, Opt_timeout, Opt_err, +}; + +static match_table_t tokens = { + {Opt_port, "port=%u"}, + {Opt_sq_depth, "sq=%u"}, + {Opt_rq_depth, "rq=%u"}, + {Opt_timeout, "timeout=%u"}, + {Opt_err, NULL}, +}; + +static int +rdma_rpc(struct p9_trans *t, struct p9_fcall *tc, struct p9_fcall **rc); + +/** + * parse_options - parse mount options into session structure + * @options: options string passed from mount + * @opts: transport-specific structure to parse options into + * + * Returns 0 upon success, -ERRNO upon failure + */ +static int parse_opts(char *params, struct p9_rdma_opts *opts) +{ + char *p; + substring_t args[MAX_OPT_ARGS]; + int option; + char *options; + int ret; + + opts->port = P9_PORT; + opts->sq_depth = P9_RDMA_SQ_DEPTH; + opts->rq_depth = P9_RDMA_RQ_DEPTH; + opts->timeout = P9_RDMA_TIMEOUT; + + if (!params) + return 0; + + options = kstrdup(params, GFP_KERNEL); + if (!options) { + P9_DPRINTK(P9_DEBUG_ERROR, + "failed to allocate copy of option string\n"); + return -ENOMEM; + } + + while ((p = strsep(&options, ",")) != NULL) { + int token; + int r; + if (!*p) + continue; + token = match_token(p, tokens, args); + r = match_int(&args[0], &option); + if (r < 0) { + P9_DPRINTK(P9_DEBUG_ERROR, + "integer field, but no integer?\n"); + ret = r; + continue; + } + switch (token) { + case Opt_port: + opts->port = option; + break; + case Opt_sq_depth: + opts->sq_depth = option; + break; + case Opt_rq_depth: + opts->rq_depth = option; + break; + case Opt_timeout: + opts->timeout = option; + break; + default: + continue; + } + } + /* RQ must be at least as large as the SQ */ + opts->rq_depth = max(opts->rq_depth, opts->sq_depth); + kfree(options); + return 0; +} + +/* + * Queues the request to the list of active requests on the transport + */ +static void enqueue_request(struct p9_trans_rdma *rdma, + struct p9_rdma_context *c) +{ + struct p9_rdma_req *req = c->req; + unsigned long flags; + atomic_inc(&req->ref); + spin_lock_irqsave(&rdma->req_lock, flags); + list_add_tail(&req->list, &rdma->req_list); + spin_unlock_irqrestore(&rdma->req_lock, flags); +} + +static void dequeue_request(struct p9_trans_rdma *rdma, struct p9_rdma_req *req) +{ + unsigned long flags; + spin_lock_irqsave(&rdma->req_lock, flags); + list_del(&req->list); + spin_unlock_irqrestore(&rdma->req_lock, flags); + rdma_req_put(req); +} + +/* + * Searches the list of requests on the transport and returns the request + * with the matching tag + */ +static struct p9_rdma_req * +find_req(struct p9_trans_rdma *rdma, u16 tag) +{ + unsigned long flags; + struct p9_rdma_req *req; + int found = 0; + + spin_lock_irqsave(&rdma->req_lock, flags); + list_for_each_entry(req, &rdma->req_list, list) { + if (req->tag == tag) { + found = 1; + atomic_inc(&req->ref); + break; + } + } + spin_unlock_irqrestore(&rdma->req_lock, flags); + + return found ? req : 0; +} + +static int +p9_cm_event_handler(struct rdma_cm_id *id, + struct rdma_cm_event *event) +{ + struct p9_trans *t = id->context; + struct p9_trans_rdma *rdma = t->priv; + switch (event->event) { + case RDMA_CM_EVENT_ADDR_RESOLVED: + BUG_ON(rdma->state != P9_RDMA_INIT); + rdma->state = P9_RDMA_ADDR_RESOLVED; + break; + + case RDMA_CM_EVENT_ROUTE_RESOLVED: + BUG_ON(rdma->state != P9_RDMA_ADDR_RESOLVED); + rdma->state = P9_RDMA_ROUTE_RESOLVED; + break; + + case RDMA_CM_EVENT_ESTABLISHED: + BUG_ON(rdma->state != P9_RDMA_ROUTE_RESOLVED); + rdma->state = P9_RDMA_CONNECTED; + break; + + case RDMA_CM_EVENT_DISCONNECTED: + if (rdma) + rdma->state = P9_RDMA_CLOSED; + if (t) + t->status = Disconnected; + break; + + case RDMA_CM_EVENT_TIMEWAIT_EXIT: + break; + + case RDMA_CM_EVENT_ADDR_CHANGE: + case RDMA_CM_EVENT_ROUTE_ERROR: + case RDMA_CM_EVENT_DEVICE_REMOVAL: + case RDMA_CM_EVENT_MULTICAST_JOIN: + case RDMA_CM_EVENT_MULTICAST_ERROR: + case RDMA_CM_EVENT_REJECTED: + case RDMA_CM_EVENT_CONNECT_REQUEST: + case RDMA_CM_EVENT_CONNECT_RESPONSE: + case RDMA_CM_EVENT_CONNECT_ERROR: + case RDMA_CM_EVENT_ADDR_ERROR: + case RDMA_CM_EVENT_UNREACHABLE: + t->status = Disconnected; + rdma_disconnect(rdma->cm_id); + break; + default: + BUG(); + } + complete(&rdma->cm_done); + return 0; +} + +static void process_request(struct p9_trans *trans, struct p9_rdma_req *req) +{ + int ecode; + struct p9_str *ename; + + if (req->rcall->id == P9_RERROR) { + ecode = req->rcall->params.rerror.errno; + ename = &req->rcall->params.rerror.error; + + P9_DPRINTK(P9_DEBUG_MUX, "Rerror %.*s\n", ename->len, + ename->str); + + if (trans->extended) + req->err = -ecode; + + if (!req->err) { + req->err = p9_errstr2errno(ename->str, ename->len); + + /* string match failed */ + if (!req->err) { + PRINT_FCALL_ERROR("unknown error", req->rcall); + req->err = -ESERVERFAULT; + } + } + } else if (req->tcall && req->rcall->id != req->tcall->id + 1) { + P9_DPRINTK(P9_DEBUG_ERROR, + "fcall mismatch: expected %d, got %d\n", + req->tcall->id + 1, req->rcall->id); + if (!req->err) + req->err = -EIO; + } +} + +static void +handle_recv(struct p9_trans *trans, struct p9_trans_rdma *rdma, + struct p9_rdma_context *c, enum ib_wc_status status, u32 byte_len) +{ + struct p9_rdma_req *req; + unsigned long flags; + int err; + + req = NULL; + err = -EIO; + ib_dma_unmap_single(rdma->cm_id->device, c->busa, trans->msize, + DMA_FROM_DEVICE); + + if (status != IB_WC_SUCCESS) + goto err_out; + + err = p9_deserialize_fcall(c->rcall->sdata, byte_len, + c->rcall, trans->extended); + if (err < 0) + goto err_out; + +#ifdef CONFIG_NET_9P_DEBUG + if ((p9_debug_level&P9_DEBUG_FCALL) == P9_DEBUG_FCALL) { + char buf[150]; + + p9_printfcall(buf, sizeof(buf), c->rcall, + trans->extended); + printk(KERN_NOTICE ">>> %p %s\n", trans, buf); + } +#endif + req = find_req(rdma, c->rcall->tag); + if (req) { + spin_lock_irqsave(&req->lock, flags); + if (!test_bit(P9_RDMA_REQ_FLUSHING, &req->flags)) { + set_bit(P9_RDMA_REQ_COMPLETE, &req->flags); + req->rcall = c->rcall; + process_request(trans, req); + complete(&req->done); + } + spin_unlock_irqrestore(&req->lock, flags); + rdma_req_put(req); + } + return; + + err_out: + P9_DPRINTK(P9_DEBUG_ERROR, "req %p err %d status %d\n", + req, err, status); + rdma->state = P9_RDMA_FLUSHING; + trans->status = Disconnected; + return; +} + +static void +handle_send(struct p9_trans *trans, struct p9_trans_rdma *rdma, + struct p9_rdma_context *c, enum ib_wc_status status, u32 byte_len) +{ + ib_dma_unmap_single(rdma->cm_id->device, + c->busa, c->req->tcall->size, + DMA_TO_DEVICE); +} + +static void qp_event_handler(struct ib_event *event, void *context) +{ + P9_DPRINTK(P9_DEBUG_ERROR, "QP event %d context %p\n", + event->event, context); +} + +static void cq_comp_handler(struct ib_cq *cq, void *cq_context) +{ + struct p9_trans *trans = cq_context; + struct p9_trans_rdma *rdma = trans->priv; + int ret; + struct ib_wc wc; + + ib_req_notify_cq(rdma->cq, IB_CQ_NEXT_COMP); + while ((ret = ib_poll_cq(cq, 1, &wc)) > 0) { + struct p9_rdma_context *c = (void *)wc.wr_id; + + switch (c->wc_op) { + case IB_WC_RECV: + handle_recv(trans, rdma, c, wc.status, wc.byte_len); + break; + + case IB_WC_SEND: + handle_send(trans, rdma, c, wc.status, wc.byte_len); + atomic_dec(&rdma->sq_count); + wake_up(&rdma->send_wait); + rdma_req_put(c->req); + break; + + default: + printk(KERN_ERR "9prdma: unexpected completion type, " + "c->wc_op=%d, wc.opcode=%d, status=%d\n", + c->wc_op, wc.opcode, wc.status); + break; + } + kfree(c); + } +} + +static void cq_event_handler(struct ib_event *e, void *v) +{ + P9_DPRINTK(P9_DEBUG_ERROR, "CQ event %d context %p\n", + e->event, v); +} + +static void rdma_destroy_trans(struct p9_trans *t) +{ + struct p9_trans_rdma *rdma = t->priv; + + if (!rdma) + return; + + if (rdma->dma_mr && !IS_ERR(rdma->dma_mr)) + ib_dereg_mr(rdma->dma_mr); + + if (rdma->qp && !IS_ERR(rdma->qp)) + ib_destroy_qp(rdma->qp); + + if (rdma->pd && !IS_ERR(rdma->pd)) + ib_dealloc_pd(rdma->pd); + + if (rdma->cq && !IS_ERR(rdma->cq)) + ib_destroy_cq(rdma->cq); + + if (rdma->cm_id && !IS_ERR(rdma->cm_id)) + rdma_destroy_id(rdma->cm_id); + + if (rdma->tagpool) + p9_idpool_destroy(rdma->tagpool); + + kfree(rdma); +} + +static int +post_recv(struct p9_trans *trans, struct p9_rdma_context *c) +{ + struct p9_trans_rdma *rdma = trans->priv; + struct ib_recv_wr wr, *bad_wr; + struct ib_sge sge; + int ret; + + c->busa = ib_dma_map_single(rdma->cm_id->device, + c->rcall->sdata, trans->msize, + DMA_FROM_DEVICE); + if (ib_dma_mapping_error(rdma->cm_id->device, c->busa)) + goto error; + + sge.addr = c->busa; + sge.length = trans->msize; + sge.lkey = rdma->lkey; + + wr.next = NULL; + c->wc_op = IB_WC_RECV; + wr.wr_id = (u64)c; + wr.sg_list = &sge; + wr.num_sge = 1; + ret = ib_post_recv(rdma->qp, &wr, &bad_wr); + return ret; + + error: + P9_DPRINTK(P9_DEBUG_ERROR, "EIO\n"); + return -EIO; +} + +static struct p9_fcall *alloc_fcall(struct p9_trans *trans) +{ + struct p9_fcall *fc; + fc = kmalloc(sizeof(struct p9_fcall) + trans->msize, GFP_KERNEL); + if (fc) + fc->sdata = fc + 1; + return fc; +} + +static u16 p9_get_tag(struct p9_trans *trans) +{ + int tag; + struct p9_trans_rdma *rdma; + + rdma = trans->priv; + tag = p9_idpool_get(rdma->tagpool); + if (tag < 0) + return P9_NOTAG; + else + return (u16) tag; +} + +static void p9_put_tag(struct p9_trans *trans, u16 tag) +{ + struct p9_trans_rdma *rdma; + + rdma = trans->priv; + if (tag != P9_NOTAG && p9_idpool_check(tag, rdma->tagpool)) + p9_idpool_put(tag, rdma->tagpool); +} + +static int send_request(struct p9_trans *trans, struct p9_rdma_context *c) +{ + struct p9_trans_rdma *rdma = trans->priv; + struct ib_send_wr wr, *bad_wr; + struct ib_sge sge; + + c->busa = ib_dma_map_single(rdma->cm_id->device, + c->req->tcall->sdata, c->req->tcall->size, + DMA_TO_DEVICE); + if (ib_dma_mapping_error(rdma->cm_id->device, c->busa)) + goto error; + +#ifdef CONFIG_NET_9P_DEBUG + if ((p9_debug_level&P9_DEBUG_FCALL) == P9_DEBUG_FCALL) { + char buf[150]; + + p9_printfcall(buf, sizeof(buf), c->req->tcall, trans->extended); + printk(KERN_NOTICE "<<< %p %s\n", trans, buf); + } +#endif + sge.addr = c->busa; + sge.length = c->req->tcall->size; + sge.lkey = rdma->lkey; + + wr.next = NULL; + c->wc_op = IB_WC_SEND; + wr.wr_id = (u64)c; + wr.opcode = IB_WR_SEND; + wr.send_flags = IB_SEND_SIGNALED; + wr.sg_list = &sge; + wr.num_sge = 1; + + if (atomic_inc_return(&rdma->sq_count) >= rdma->sq_depth) + wait_event_interruptible + (rdma->send_wait, + (atomic_read(&rdma->sq_count) < rdma->sq_depth)); + + return ib_post_send(rdma->qp, &wr, &bad_wr); + + error: + P9_DPRINTK(P9_DEBUG_ERROR, "EIO\n"); + return -EIO; +} + +static void flush_request(struct p9_trans *trans, struct p9_rdma_req *req) +{ + struct p9_trans_rdma *rdma = trans->priv; + struct p9_fcall *tc; + struct p9_fcall *rc; + unsigned long flags; + int err; + + /* Check if we received a response despite being interrupted. */ + spin_lock_irqsave(&req->lock, flags); + if (!test_bit(P9_RDMA_REQ_COMPLETE, &req->flags)) + set_bit(P9_RDMA_REQ_FLUSHING, &req->flags); + spin_unlock_irqrestore(&req->lock, flags); + if (!test_bit(P9_RDMA_REQ_FLUSHING, &req->flags)) + goto out; + + tc = p9_create_tflush(req->tag); + if (!tc) { + rdma->state = P9_RDMA_FLUSHING; + goto out; + } + + clear_thread_flag(TIF_SIGPENDING); + err = rdma_rpc(trans, tc, &rc); + kfree(tc); +out: + return; +} + +/** + * rdma_rpc- sends 9P request and waits until a response is available. + * The function can be interrupted. + * @t: transport data + * @tc: request to be sent + * @rc: pointer where a pointer to the response is stored + * + */ +static int +rdma_rpc(struct p9_trans *t, struct p9_fcall *tc, struct p9_fcall **rc) +{ + int err = -ENOMEM, sigpending; + u16 tag; + struct p9_rdma_req *req; + struct p9_trans_rdma *rdma; + struct p9_rdma_context *req_context = NULL; + struct p9_rdma_context *rpl_context = NULL; + unsigned long flags; + + if (t && t->status != Disconnected) + rdma = t->priv; + else + return -EREMOTEIO; + + /* Initialize the request */ + req = rdma_req_get(); + if (!req) + goto err_close_0; + req->tcall = tc; + if (req->tcall->id == P9_TVERSION) + tag = P9_NOTAG; + else + tag = p9_get_tag(t); + + req->tag = tag; + p9_set_tag(req->tcall, tag); + + /* Allocate an fcall for the reply */ + rpl_context = kmalloc(sizeof *rpl_context, GFP_KERNEL); + if (!rpl_context) + goto err_close_1; + + rpl_context->rcall = alloc_fcall(t); + if (!rpl_context->rcall) { + kfree(rpl_context); + goto err_close_1; + } + + /* + * Post a receive buffer for this request. We don't know + * which request this reply buffer will end up servicing, but + * we allocate it here to ensure that there is a receive + * buffer available for every request + */ + err = post_recv(t, rpl_context); + if (err) { + kfree(rpl_context->rcall); + kfree(rpl_context); + goto err_close_1; + } + + /* Post the request */ + req_context = kmalloc(sizeof *req_context, GFP_KERNEL); + if (!req_context) + goto err_close_1; + req_context->req = req; + enqueue_request(rdma, req_context); + atomic_inc(&req->ref); + err = send_request(t, req_context); + if (err) { + dequeue_request(rdma, req); + rdma_req_put(req); + kfree(req_context); + goto err_close_1; + } + + sigpending = 0; + if (signal_pending(current)) { + sigpending = 1; + clear_thread_flag(TIF_SIGPENDING); + } + + /* Wait for the response */ + err = wait_for_completion_interruptible(&req->done); + + /* Take request off the active queue */ + dequeue_request(rdma, req); + + /* If not, we need to flush request at the server. */ + if (err == -ERESTARTSYS && rdma->state == P9_RDMA_CONNECTED && + tc->id != P9_TFLUSH) { + flush_request(t, req); + sigpending = 1; + } + + if (sigpending) { + spin_lock_irqsave(¤t->sighand->siglock, flags); + recalc_sigpending(); + spin_unlock_irqrestore(¤t->sighand->siglock, flags); + } + + /* If we got disconnected while waiting, return an error */ + if (rdma->state != P9_RDMA_CONNECTED) { + P9_DPRINTK(P9_DEBUG_ERROR, "EIO\n"); + err = -EIO; + goto err_close_1; + } + + if (rc) + *rc = req->rcall; + else + kfree(req->rcall); + + /* Return the RPC request's error if there is one */ + if (req->err < 0) + err = req->err; + + rdma_req_put(req); + p9_put_tag(t, tag); + return err; + + err_close_1: + rdma_req_put(req); + p9_put_tag(t, tag); + + err_close_0: + spin_lock_irqsave(&rdma->req_lock, flags); + if (rdma->state < P9_RDMA_CLOSING) { + rdma->state = P9_RDMA_CLOSING; + spin_unlock_irqrestore(&rdma->req_lock, flags); + rdma_disconnect(rdma->cm_id); + } else + spin_unlock_irqrestore(&rdma->req_lock, flags); + return err; +} + +static void rdma_close(struct p9_trans *trans) +{ + struct p9_trans_rdma *rdma = trans->priv; + if (!rdma) + return; + + trans->status = Disconnected; + rdma_disconnect(rdma->cm_id); + rdma_destroy_trans(trans); +} + +/** + * alloc_trans - Allocate and initialize the rdma transport structure + * @msize: MTU + * @dotu: Extension attribute + * @opts: Mount options structure + */ +static struct p9_trans * +alloc_trans(int msize, unsigned char dotu, struct p9_rdma_opts *opts) +{ + struct p9_trans *trans; + struct p9_trans_rdma *rdma; + + trans = kmalloc(sizeof(struct p9_trans), GFP_KERNEL); + if (!trans) + return NULL; + + trans->msize = msize; + trans->extended = dotu; + trans->rpc = rdma_rpc; + trans->close = rdma_close; + + rdma = trans->priv = kzalloc(sizeof(struct p9_trans_rdma), GFP_KERNEL); + if (!rdma) { + kfree(trans); + return NULL; + } + + rdma->sq_depth = opts->sq_depth; + rdma->rq_depth = opts->rq_depth; + rdma->timeout = opts->timeout; + spin_lock_init(&rdma->req_lock); + INIT_LIST_HEAD(&rdma->req_list); + init_waitqueue_head(&rdma->send_wait); + init_completion(&rdma->cm_done); + atomic_set(&rdma->sq_count, 0); + atomic_set(&rdma->next_tag, 1); + + trans->priv = rdma; + + return trans; +} + +/** + * trans_create_rdma - Transport method for creating atransport instance + * @addr: IP address string + * @args: Mount options string + * @msize: Max message size + * @dotu: Protocol extension flag + */ +static struct p9_trans * +rdma_create_trans(const char *addr, char *args, int msize, unsigned char dotu) +{ + int err; + struct p9_trans *trans; + struct p9_rdma_opts opts; + struct p9_trans_rdma *rdma; + struct rdma_conn_param conn_param; + struct ib_qp_init_attr qp_attr; + struct ib_device_attr devattr; + + /* Parse the transport specific mount options */ + err = parse_opts(args, &opts); + if (err < 0) + return ERR_PTR(err); + + /* Create and initialize the RDMA transport structure */ + trans = alloc_trans(msize, dotu, &opts); + if (!trans) + return ERR_PTR(-ENOMEM); + + /* Create the RDMA CM ID */ + rdma = trans->priv; + rdma->cm_id = rdma_create_id(p9_cm_event_handler, trans, RDMA_PS_TCP); + if (IS_ERR(rdma->cm_id)) + goto error; + + /* Resolve the server's address */ + rdma->addr.sin_family = AF_INET; + rdma->addr.sin_addr.s_addr = in_aton(addr); + rdma->addr.sin_port = htons(opts.port); + err = rdma_resolve_addr(rdma->cm_id, NULL, + (struct sockaddr *)&rdma->addr, + rdma->timeout); + if (err) + goto error; + wait_for_completion_interruptible(&rdma->cm_done); + if (rdma->state != P9_RDMA_ADDR_RESOLVED) + goto error; + + /* Resolve the route to the server */ + err = rdma_resolve_route(rdma->cm_id, rdma->timeout); + if (err) + goto error; + wait_for_completion_interruptible(&rdma->cm_done); + if (rdma->state != P9_RDMA_ROUTE_RESOLVED) + goto error; + + /* Query the device attributes */ + err = ib_query_device(rdma->cm_id->device, &devattr); + if (err) + goto error; + + /* Create the Completion Queue */ + rdma->cq = ib_create_cq(rdma->cm_id->device, cq_comp_handler, + cq_event_handler, trans, + opts.sq_depth + opts.rq_depth + 1, 0); + if (IS_ERR(rdma->cq)) + goto error; + ib_req_notify_cq(rdma->cq, IB_CQ_NEXT_COMP); + + /* Create the Protection Domain */ + rdma->pd = ib_alloc_pd(rdma->cm_id->device); + if (IS_ERR(rdma->pd)) + goto error; + + /* Cache the DMA lkey in the transport */ + rdma->dma_mr = NULL; + if (0 == (devattr.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)) { + rdma->dma_mr = ib_get_dma_mr(rdma->pd, IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(rdma->dma_mr)) + goto error; + rdma->lkey = rdma->dma_mr->lkey; + } else + rdma->lkey = rdma->cm_id->device->local_dma_lkey; + + /* Create the Queue Pair */ + memset(&qp_attr, 0, sizeof qp_attr); + qp_attr.event_handler = qp_event_handler; + qp_attr.qp_context = trans; + qp_attr.cap.max_send_wr = opts.sq_depth; + qp_attr.cap.max_recv_wr = opts.rq_depth; + qp_attr.cap.max_send_sge = P9_RDMA_SEND_SGE; + qp_attr.cap.max_recv_sge = P9_RDMA_RECV_SGE; + qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR; + qp_attr.qp_type = IB_QPT_RC; + qp_attr.send_cq = rdma->cq; + qp_attr.recv_cq = rdma->cq; + err = rdma_create_qp(rdma->cm_id, rdma->pd, &qp_attr); + if (err) + goto error; + rdma->qp = rdma->cm_id->qp; + + /* Request a connection */ + memset(&conn_param, 0, sizeof(conn_param)); + conn_param.private_data = NULL; + conn_param.private_data_len = 0; + conn_param.responder_resources = P9_RDMA_IRD; + conn_param.initiator_depth = P9_RDMA_ORD; + err = rdma_connect(rdma->cm_id, &conn_param); + if (err) + goto error; + wait_for_completion_interruptible(&rdma->cm_done); + if (rdma->state != P9_RDMA_CONNECTED) + goto error; + + rdma->tagpool = p9_idpool_create(); + if (IS_ERR(rdma->tagpool)) { + rdma->tagpool = NULL; + goto error; + } + + return trans; + +error: + rdma_destroy_trans(trans); + return ERR_PTR(-ENOTCONN); +} + +static struct p9_trans_module p9_rdma_trans = { + .name = "rdma", + .maxsize = P9_RDMA_MAXSIZE, + .def = 0, + .create = rdma_create_trans, +}; + +/** + * p9_trans_rdma_init - Register the 9P RDMA transport driver + */ +static int __init p9_trans_rdma_init(void) +{ + v9fs_register_trans(&p9_rdma_trans); + return 0; +} + +static void __exit p9_trans_rdma_exit(void) +{ + v9fs_unregister_trans(&p9_rdma_trans); +} + +module_init(p9_trans_rdma_init); +module_exit(p9_trans_rdma_exit); + +MODULE_AUTHOR("Tom Tucker "); +MODULE_DESCRIPTION("RDMA Transport for 9P"); +MODULE_LICENSE("Dual BSD/GPL"); From rdreier at cisco.com Mon Oct 6 09:33:08 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 06 Oct 2008 09:33:08 -0700 Subject: [ofa-general] [PATCH 00/03] RDMA Transport Support for 9P In-Reply-To: <1223309569-12572-1-git-send-email-tom@opengridcomputing.com> (Tom Tucker's message of "Mon, 6 Oct 2008 11:12:46 -0500") References: <1223309569-12572-1-git-send-email-tom@opengridcomputing.com> Message-ID: > This patchset implements an RDMA transport provider for the > v9fs (Plan 9 filesystem). Could you take a look at it and let us > know what you think? I sent comments on the initial posting I saw on lkml ... did they not make it to you? > [PATCH 01/03] 9prdma: RDMA Transport Support for 9P > [PATCH 02/03] 9prdma: Makefile change for the RDMA transport > [PATCH 03/03] 9prdma: Kconfig changes for the RDMA transport one meta-comment I didn't send last time: the patches are small enough that I would just send it all in one patch, since it makes sense to apply it that way anyway. - R. From akepner at sgi.com Mon Oct 6 09:32:54 2008 From: akepner at sgi.com (akepner at sgi.com) Date: Mon, 6 Oct 2008 09:32:54 -0700 Subject: [ofa-general] Re: Continue of "defer skb_orphan() until irqs enabled" In-Reply-To: References: <48DA643E.9040605@Voltaire.COM> <20080924162034.GE15133@sgi.com> <20080924171135.GF15133@sgi.com> <20080924191623.GJ15133@sgi.com> <20080925114414.GA25044@mtls03> Message-ID: <20081006163254.GA972@sgi.com> Sorry for the delay in getting back to you on this, but we've done our testing and haven't found any problems. As I mentioned, we're using OFED 1.3.1, so the patch had to be tweaked a bit. The patch we used follows. --- a/drivers/infiniband/ulp/ipoib/ipoib.h 2008-09-09 15:53:24.856316458 -0700 +++ e/drivers/infiniband/ulp/ipoib/ipoib.h 2008-09-29 10:19:00.833519991 -0700 @@ -345,10 +345,9 @@ struct ipoib_ethtool_st { }; /* - * Device private locking: tx_lock protects members used in TX fast - * path (and we use LLTX so upper layers don't do extra locking). - * lock protects everything else. lock nests inside of tx_lock (ie - * tx_lock must be acquired first if needed). + * Device private locking: network stack tx_lock protects members used + * in TX fast path, lock protects everything else. lock nests inside + * of tx_lock (ie tx_lock must be acquired first if needed). */ struct ipoib_dev_priv { spinlock_t lock; @@ -397,7 +396,6 @@ struct ipoib_dev_priv { struct ipoib_vmap rx_vmap_ring; struct ipoib_sg_rx_buf *rx_ring; - spinlock_t tx_lock; struct ipoib_vmap tx_vmap_ring; struct ipoib_tx_buf *tx_ring; unsigned tx_head; --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2008-09-09 15:53:24.856316458 -0700 +++ e/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2008-09-26 13:38:00.066208156 -0700 @@ -776,7 +776,8 @@ void ipoib_cm_handle_tx_wc(struct net_de dev_kfree_skb_any(tx_req->skb); - spin_lock_irqsave(&priv->tx_lock, flags); + netif_tx_lock(dev); + ++tx->tx_tail; if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) && netif_queue_stopped(dev) && @@ -791,7 +792,7 @@ void ipoib_cm_handle_tx_wc(struct net_de "(status=%d, wrid=%d vend_err %x)\n", wc->status, wr_id, wc->vendor_err); - spin_lock(&priv->lock); + spin_lock_irqsave(&priv->lock, flags); neigh = tx->neigh; if (neigh) { @@ -811,10 +812,10 @@ void ipoib_cm_handle_tx_wc(struct net_de clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags); - spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->lock, flags); } - spin_unlock_irqrestore(&priv->tx_lock, flags); + netif_tx_unlock(dev); } int ipoib_cm_dev_open(struct net_device *dev) @@ -1134,7 +1135,6 @@ static void ipoib_cm_tx_destroy(struct i { struct ipoib_dev_priv *priv = netdev_priv(p->dev); struct ipoib_cm_tx_buf *tx_req; - unsigned long flags; unsigned long begin; ipoib_dbg(priv, "Destroy active connection 0x%x head 0x%x tail 0x%x\n", @@ -1165,12 +1165,12 @@ timeout: DMA_TO_DEVICE); dev_kfree_skb_any(tx_req->skb); ++p->tx_tail; - spin_lock_irqsave(&priv->tx_lock, flags); + netif_tx_lock_bh(p->dev); if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) && netif_queue_stopped(p->dev) && test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) netif_wake_queue(p->dev); - spin_unlock_irqrestore(&priv->tx_lock, flags); + netif_tx_unlock_bh(p->dev); } if (p->qp) @@ -1187,6 +1187,7 @@ static int ipoib_cm_tx_handler(struct ib struct ipoib_dev_priv *priv = netdev_priv(tx->dev); struct net_device *dev = priv->dev; struct ipoib_neigh *neigh; + unsigned long flags; int ret; switch (event->event) { @@ -1205,8 +1206,8 @@ static int ipoib_cm_tx_handler(struct ib case IB_CM_REJ_RECEIVED: case IB_CM_TIMEWAIT_EXIT: ipoib_dbg(priv, "CM error %d.\n", event->event); - spin_lock_irq(&priv->tx_lock); - spin_lock(&priv->lock); + netif_tx_lock_bh(dev); + spin_lock_irqsave(&priv->lock, flags); neigh = tx->neigh; if (neigh) { @@ -1224,8 +1225,8 @@ static int ipoib_cm_tx_handler(struct ib queue_work(ipoib_workqueue, &priv->cm.reap_task); } - spin_unlock(&priv->lock); - spin_unlock_irq(&priv->tx_lock); + spin_unlock_irqrestore(&priv->lock, flags); + netif_tx_unlock_bh(dev); break; default: break; @@ -1279,19 +1280,24 @@ static void ipoib_cm_tx_start(struct wor struct ib_sa_path_rec pathrec; u32 qpn; - spin_lock_irqsave(&priv->tx_lock, flags); - spin_lock(&priv->lock); + netif_tx_lock_bh(dev); + spin_lock_irqsave(&priv->lock, flags); + while (!list_empty(&priv->cm.start_list)) { p = list_entry(priv->cm.start_list.next, typeof(*p), list); list_del_init(&p->list); neigh = p->neigh; qpn = IPOIB_QPN(neigh->neighbour->ha); memcpy(&pathrec, &p->path->pathrec, sizeof pathrec); - spin_unlock(&priv->lock); - spin_unlock_irqrestore(&priv->tx_lock, flags); + + spin_unlock_irqrestore(&priv->lock, flags); + netif_tx_unlock_bh(dev); + ret = ipoib_cm_tx_init(p, qpn, &pathrec); - spin_lock_irqsave(&priv->tx_lock, flags); - spin_lock(&priv->lock); + + netif_tx_lock_bh(dev); + spin_lock_irqsave(&priv->lock, flags); + if (ret) { neigh = p->neigh; if (neigh) { @@ -1305,44 +1311,52 @@ static void ipoib_cm_tx_start(struct wor kfree(p); } } - spin_unlock(&priv->lock); - spin_unlock_irqrestore(&priv->tx_lock, flags); + + spin_unlock_irqrestore(&priv->lock, flags); + netif_tx_unlock_bh(dev); } static void ipoib_cm_tx_reap(struct work_struct *work) { struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv, cm.reap_task); + struct net_device *dev = priv->dev; struct ipoib_cm_tx *p; + unsigned long flags; + + netif_tx_lock_bh(dev); + spin_lock_irqsave(&priv->lock, flags); - spin_lock_irq(&priv->tx_lock); - spin_lock(&priv->lock); while (!list_empty(&priv->cm.reap_list)) { p = list_entry(priv->cm.reap_list.next, typeof(*p), list); list_del(&p->list); - spin_unlock(&priv->lock); - spin_unlock_irq(&priv->tx_lock); + spin_unlock_irqrestore(&priv->lock, flags); + netif_tx_unlock_bh(dev); ipoib_cm_tx_destroy(p); - spin_lock_irq(&priv->tx_lock); - spin_lock(&priv->lock); + netif_tx_lock_bh(dev); + spin_lock_irqsave(&priv->lock, flags); } - spin_unlock(&priv->lock); - spin_unlock_irq(&priv->tx_lock); + + spin_unlock_irqrestore(&priv->lock, flags); + netif_tx_unlock_bh(dev); } static void ipoib_cm_skb_reap(struct work_struct *work) { struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv, cm.skb_task); + struct net_device *dev = priv->dev; struct sk_buff *skb; - + unsigned long flags; unsigned mtu = priv->mcast_mtu; - spin_lock_irq(&priv->tx_lock); - spin_lock(&priv->lock); + netif_tx_lock_bh(dev); + spin_lock_irqsave(&priv->lock, flags); + while ((skb = skb_dequeue(&priv->cm.skb_queue))) { - spin_unlock(&priv->lock); - spin_unlock_irq(&priv->tx_lock); + spin_unlock_irqrestore(&priv->lock, flags); + netif_tx_unlock_bh(dev); + if (skb->protocol == htons(ETH_P_IP)) icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu)); #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) @@ -1350,11 +1364,13 @@ static void ipoib_cm_skb_reap(struct wor icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu, priv->dev); #endif dev_kfree_skb_any(skb); - spin_lock_irq(&priv->tx_lock); - spin_lock(&priv->lock); + + netif_tx_lock_bh(dev); + spin_lock_irqsave(&priv->lock, flags); } - spin_unlock(&priv->lock); - spin_unlock_irq(&priv->tx_lock); + + spin_unlock_irqrestore(&priv->lock, flags); + netif_tx_unlock_bh(dev); } void ipoib_cm_skb_too_long(struct net_device *dev, struct sk_buff *skb, --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2008-09-09 15:53:24.856316458 -0700 +++ e/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2008-09-29 13:21:15.017397128 -0700 @@ -457,10 +457,9 @@ static void ipoib_ib_tx_timer_func(unsig { struct net_device *dev = (struct net_device *)dev_ptr; struct ipoib_dev_priv *priv = netdev_priv(dev); - unsigned long flags; unsigned int wrid; - spin_lock_irqsave(&priv->tx_lock, flags); + netif_tx_lock(dev); if (((int)priv->tx_tail - (int)priv->tx_head < 0) && time_after(jiffies, dev->trans_start + 10) && priv->tx_outstanding < ipoib_sendq_size && @@ -479,17 +478,16 @@ static void ipoib_ib_tx_timer_func(unsig } } poll_tx(priv); - spin_unlock_irqrestore(&priv->tx_lock, flags); + netif_tx_unlock(dev); mod_timer(&priv->poll_timer, jiffies + HZ / 2); } static void flush_tx_queue(struct ipoib_dev_priv *priv) { - unsigned long flags; unsigned int wrid; - spin_lock_irqsave(&priv->tx_lock, flags); + netif_tx_lock_bh(priv->dev); wrid = priv->tx_head & (ipoib_sendq_size - 1); priv->tx_ring[wrid].skb = NULL; if (!post_zlen_send_wr(priv, wrid)) { @@ -499,7 +497,7 @@ static void flush_tx_queue(struct ipoib_ ipoib_warn(priv, "post_zlen failed\n"); poll_tx(priv); - spin_unlock_irqrestore(&priv->tx_lock, flags); + netif_tx_unlock_bh(priv->dev); } static inline int post_send(struct ipoib_dev_priv *priv, @@ -645,17 +643,20 @@ static void __ipoib_reap_ah(struct net_d struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_ah *ah, *tah; LIST_HEAD(remove_list); + unsigned long flags; + + netif_tx_lock_bh(dev); + spin_lock_irqsave(&priv->lock, flags); - spin_lock_irq(&priv->tx_lock); - spin_lock(&priv->lock); list_for_each_entry_safe(ah, tah, &priv->dead_ahs, list) if ((int) priv->tx_tail - (int) ah->last_send >= 0) { list_del(&ah->list); ib_destroy_ah(ah->ah); kfree(ah); } - spin_unlock(&priv->lock); - spin_unlock_irq(&priv->tx_lock); + + spin_unlock_irqrestore(&priv->lock, flags); + netif_tx_unlock_bh(dev); } void ipoib_reap_ah(struct work_struct *work) @@ -789,6 +790,14 @@ void ipoib_drain_cq(struct net_device *d { struct ipoib_dev_priv *priv = netdev_priv(dev); int i, n; + + /* + * We call completion handling routines that expect to be + * called from the BH-disabled NAPI poll context, so disable + * BHs here too. + */ + local_bh_disable(); + do { n = ib_poll_cq(priv->rcq, IPOIB_NUM_WC, priv->ibwc); for (i = 0; i < n; ++i) { @@ -813,6 +822,8 @@ void ipoib_drain_cq(struct net_device *d } } } while (n == IPOIB_NUM_WC); + + local_bh_enable(); } int ipoib_ib_dev_stop(struct net_device *dev, int flush) @@ -821,7 +832,6 @@ int ipoib_ib_dev_stop(struct net_device struct ib_qp_attr qp_attr; unsigned long begin; int i; - unsigned long flags; int timer_works; timer_works = test_and_clear_bit(IPOIB_FLAG_TIME_ON, &priv->flags); @@ -872,9 +882,9 @@ int ipoib_ib_dev_stop(struct net_device } if ((int) priv->tx_tail - (int) priv->tx_head < 0) { - spin_lock_irqsave(&priv->tx_lock, flags); + netif_tx_lock_bh(dev); poll_tx(priv); - spin_unlock_irqrestore(&priv->tx_lock, flags); + netif_tx_unlock_bh(dev); } ipoib_drain_cq(dev); --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-09-09 15:53:24.856316458 -0700 +++ e/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-09-29 10:27:00.046616538 -0700 @@ -357,9 +357,10 @@ void ipoib_flush_paths(struct net_device struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_path *path, *tp; LIST_HEAD(remove_list); + unsigned long flags; - spin_lock_irq(&priv->tx_lock); - spin_lock(&priv->lock); + netif_tx_lock_bh(dev); + spin_lock_irqsave(&priv->lock, flags); list_splice(&priv->path_list, &remove_list); INIT_LIST_HEAD(&priv->path_list); @@ -370,15 +371,16 @@ void ipoib_flush_paths(struct net_device list_for_each_entry_safe(path, tp, &remove_list, list) { if (path->query) ib_sa_cancel_query(path->query_id, path->query); - spin_unlock(&priv->lock); - spin_unlock_irq(&priv->tx_lock); + spin_unlock_irqrestore(&priv->lock, flags); + netif_tx_unlock_bh(dev); wait_for_completion(&path->done); path_free(dev, path); - spin_lock_irq(&priv->tx_lock); - spin_lock(&priv->lock); + netif_tx_lock_bh(dev); + spin_lock_irqsave(&priv->lock, flags); } - spin_unlock(&priv->lock); - spin_unlock_irq(&priv->tx_lock); + + spin_unlock_irqrestore(&priv->lock, flags); + netif_tx_unlock_bh(dev); } static void path_rec_completion(int status, @@ -547,6 +549,7 @@ static void neigh_add_path(struct sk_buf struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_path *path; struct ipoib_neigh *neigh; + unsigned long flags; neigh = ipoib_neigh_alloc(skb->dst->neighbour, skb->dev); if (!neigh) { @@ -555,11 +558,7 @@ static void neigh_add_path(struct sk_buf return; } - /* - * We can only be called from ipoib_start_xmit, so we're - * inside tx_lock -- no need to save/restore flags. - */ - spin_lock(&priv->lock); + spin_lock_irqsave(&priv->lock, flags); path = __path_find(dev, skb->dst->neighbour->ha + 4); if (!path) { @@ -606,7 +605,7 @@ static void neigh_add_path(struct sk_buf __skb_queue_tail(&neigh->queue, skb); } - spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->lock, flags); return; err_list: @@ -618,7 +617,7 @@ err_drop: ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); - spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->lock, flags); } static void ipoib_path_lookup(struct sk_buff *skb, struct net_device *dev) @@ -642,12 +641,9 @@ static void unicast_arp_send(struct sk_b { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_path *path; + unsigned long flags; - /* - * We can only be called from ipoib_start_xmit, so we're - * inside tx_lock -- no need to save/restore flags. - */ - spin_lock(&priv->lock); + spin_lock_irqsave(&priv->lock, flags); path = __path_find(dev, phdr->hwaddr + 4); if (!path) { @@ -658,7 +654,7 @@ static void unicast_arp_send(struct sk_b __skb_queue_tail(&path->queue, skb); if (path_rec_start(dev, path)) { - spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->lock, flags); path_free(dev, path); return; } else @@ -668,7 +664,7 @@ static void unicast_arp_send(struct sk_b dev_kfree_skb_any(skb); } - spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->lock, flags); return; } @@ -687,7 +683,7 @@ static void unicast_arp_send(struct sk_b dev_kfree_skb_any(skb); } - spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->lock, flags); } static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev) @@ -696,23 +692,10 @@ static int ipoib_start_xmit(struct sk_bu struct ipoib_neigh *neigh; unsigned long flags; - if (unlikely(!spin_trylock_irqsave(&priv->tx_lock, flags))) - return NETDEV_TX_LOCKED; - - /* - * Check if our queue is stopped. Since we have the LLTX bit - * set, we can't rely on netif_stop_queue() preventing our - * xmit function from being called with a full queue. - */ - if (unlikely(netif_queue_stopped(dev))) { - spin_unlock_irqrestore(&priv->tx_lock, flags); - return NETDEV_TX_BUSY; - } - if (likely(skb->dst && skb->dst->neighbour)) { if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) { ipoib_path_lookup(skb, dev); - goto out; + return NETDEV_TX_OK; } neigh = *to_ipoib_neigh(skb->dst->neighbour); @@ -722,7 +705,7 @@ static int ipoib_start_xmit(struct sk_bu skb->dst->neighbour->ha + 4, sizeof(union ib_gid))) || (neigh->dev != dev))) { - spin_lock(&priv->lock); + spin_lock_irqsave(&priv->lock, flags); /* * It's safe to call ipoib_put_ah() inside * priv->lock here, because we know that @@ -733,26 +716,26 @@ static int ipoib_start_xmit(struct sk_bu ipoib_put_ah(neigh->ah); list_del(&neigh->list); ipoib_neigh_free(dev, neigh); - spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->lock, flags); ipoib_path_lookup(skb, dev); - goto out; + return NETDEV_TX_OK; } if (ipoib_cm_get(neigh)) { if (ipoib_cm_up(neigh)) { ipoib_cm_send(dev, skb, ipoib_cm_get(neigh)); - goto out; + return NETDEV_TX_OK; } } else if (neigh->ah) { ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(skb->dst->neighbour->ha)); - goto out; + return NETDEV_TX_OK; } if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) { - spin_lock(&priv->lock); + spin_lock_irqsave(&priv->lock, flags); __skb_queue_tail(&neigh->queue, skb); - spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->lock, flags); } else { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb); @@ -781,16 +764,13 @@ static int ipoib_start_xmit(struct sk_bu IPOIB_GID_RAW_ARG(phdr->hwaddr + 4)); dev_kfree_skb_any(skb); ++priv->stats.tx_dropped; - goto out; + return NETDEV_TX_OK; } unicast_arp_send(skb, dev, phdr); } } -out: - spin_unlock_irqrestore(&priv->tx_lock, flags); - return NETDEV_TX_OK; } @@ -1088,7 +1068,7 @@ static void ipoib_setup(struct net_devic dev->addr_len = INFINIBAND_ALEN; dev->type = ARPHRD_INFINIBAND; dev->tx_queue_len = ipoib_sendq_size * 2; - dev->features = NETIF_F_VLAN_CHALLENGED | NETIF_F_LLTX; + dev->features = NETIF_F_VLAN_CHALLENGED; memcpy(dev->broadcast, ipv4_bcast_addr, INFINIBAND_ALEN); @@ -1097,7 +1077,6 @@ static void ipoib_setup(struct net_devic priv->dev = dev; spin_lock_init(&priv->lock); - spin_lock_init(&priv->tx_lock); mutex_init(&priv->mcast_mutex); mutex_init(&priv->vlan_mutex); --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2008-09-09 15:53:24.860316701 -0700 +++ e/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2008-09-29 13:34:30.137656855 -0700 @@ -71,14 +71,13 @@ static void ipoib_mcast_free(struct ipoi struct net_device *dev = mcast->dev; struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_neigh *neigh, *tmp; - unsigned long flags; int tx_dropped = 0; ipoib_dbg_mcast(netdev_priv(dev), "deleting multicast group " IPOIB_GID_FMT "\n", IPOIB_GID_ARG(mcast->mcmember.mgid)); - spin_lock_irqsave(&priv->lock, flags); + spin_lock_irq(&priv->lock); list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) { /* @@ -92,7 +91,7 @@ static void ipoib_mcast_free(struct ipoi ipoib_neigh_free(dev, neigh); } - spin_unlock_irqrestore(&priv->lock, flags); + spin_unlock_irq(&priv->lock); if (mcast->ah) ipoib_put_ah(mcast->ah); @@ -102,9 +101,9 @@ static void ipoib_mcast_free(struct ipoi dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); } - spin_lock_irqsave(&priv->tx_lock, flags); + netif_tx_lock_bh(dev); priv->stats.tx_dropped += tx_dropped; - spin_unlock_irqrestore(&priv->tx_lock, flags); + netif_tx_unlock_bh(dev); kfree(mcast); } @@ -259,10 +258,10 @@ static int ipoib_mcast_join_finish(struc } /* actually send any queued packets */ - spin_lock_irq(&priv->tx_lock); + netif_tx_lock_bh(dev); while (!skb_queue_empty(&mcast->pkt_queue)) { struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue); - spin_unlock_irq(&priv->tx_lock); + netif_tx_unlock_bh(dev); skb->dev = dev; @@ -273,9 +272,9 @@ static int ipoib_mcast_join_finish(struc if (dev_queue_xmit(skb)) ipoib_warn(priv, "dev_queue_xmit failed to requeue packet\n"); - spin_lock_irq(&priv->tx_lock); + netif_tx_lock_bh(dev); } - spin_unlock_irq(&priv->tx_lock); + netif_tx_unlock_bh(dev); return 0; } @@ -302,12 +301,12 @@ ipoib_mcast_sendonly_join_complete(int s IPOIB_GID_ARG(mcast->mcmember.mgid), status); /* Flush out any queued packets */ - spin_lock_irq(&priv->tx_lock); + netif_tx_lock_bh(dev); while (!skb_queue_empty(&mcast->pkt_queue)) { ++priv->stats.tx_dropped; dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue)); } - spin_unlock_irq(&priv->tx_lock); + netif_tx_unlock_bh(dev); /* Clear the busy flag so we try again */ status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, @@ -520,18 +519,16 @@ static int create_own_ah(struct ipoib_de void destroy_own_ah(struct ipoib_dev_priv *priv) { - unsigned long flags; - if (!priv->own_ah) { ipoib_dbg(priv, "own ah already destroyed\n"); return; } else ipoib_dbg(priv, "destroying own ah\n"); - spin_lock_irqsave(&priv->tx_lock, flags); + netif_tx_lock_bh(priv->dev); ib_destroy_ah(priv->own_ah); priv->own_ah = NULL; - spin_unlock_irqrestore(&priv->tx_lock, flags); + netif_tx_unlock_bh(priv->dev); } void ipoib_mcast_join_task(struct work_struct *work) @@ -687,12 +684,9 @@ void ipoib_mcast_send(struct net_device { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_mcast *mcast; + unsigned long flags; - /* - * We can only be called from ipoib_start_xmit, so we're - * inside tx_lock -- no need to save/restore flags. - */ - spin_lock(&priv->lock); + spin_lock_irqsave(&priv->lock, flags); if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags) || !priv->broadcast || @@ -763,7 +757,7 @@ out: } unlock: - spin_unlock(&priv->lock); + spin_unlock_irqrestore(&priv->lock, flags); } void ipoib_mcast_dev_flush(struct net_device *dev) From tom at opengridcomputing.com Mon Oct 6 11:12:25 2008 From: tom at opengridcomputing.com (Tom Tucker) Date: Mon, 06 Oct 2008 13:12:25 -0500 Subject: [ofa-general] [PATCH 00/03] RDMA Transport Support for 9P In-Reply-To: References: <1223309569-12572-1-git-send-email-tom@opengridcomputing.com> Message-ID: <48EA5509.9@opengridcomputing.com> Roland Dreier wrote: > > This patchset implements an RDMA transport provider for the > > v9fs (Plan 9 filesystem). Could you take a look at it and let us > > know what you think? > > I sent comments on the initial posting I saw on lkml ... did they not > make it to you? > No, I just missed it. Sorry. I just responded to your comments, > > [PATCH 01/03] 9prdma: RDMA Transport Support for 9P > > [PATCH 02/03] 9prdma: Makefile change for the RDMA transport > > [PATCH 03/03] 9prdma: Kconfig changes for the RDMA transport > > one meta-comment I didn't send last time: the patches are small enough > that I would just send it all in one patch, since it makes sense to > apply it that way anyway. > Ok, makes my life easy. > - R. From cameron at harr.org Mon Oct 6 12:34:16 2008 From: cameron at harr.org (Cameron Harr) Date: Mon, 06 Oct 2008 13:34:16 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48EA3706.2080700@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EA3706.2080700@vlnb.net> Message-ID: <48EA6838.40706@harr.org> Vladislav Bolkhovitin wrote: > Cameron Harr wrote: >> Vlad, >> Thanks for the suggestion. As I look via vmstat, my CSw/s rate is >> fairly constant around 280K when scst_threads=1 (per Vu's suggestion) >> and pops up to ~330-340K CSw/s when scst_threads is set to 8. I'm >> currently doing 512B writes, and this gives me about a 4:1 ratio of >> context switches to IOPs with 1 SCST thread (70K IOPs) and around >> 4.5:1 when there are 8 SCST threads (75K IOPs). > > This is still too high. Considering that each CS is about 1 > microsecond you can estimate how many IOPS's it costs you. Dropping scst_threads down to 2, from 8, with 2 initiators, seems to make a fairly significant difference, propelling me to a little over 100K IOPs and putting the CS rate around 2:1, sometimes lower. 2 threads gave the best performance compared to 1, 4 and 8. From cameron at harr.org Mon Oct 6 13:11:50 2008 From: cameron at harr.org (Cameron Harr) Date: Mon, 06 Oct 2008 14:11:50 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48E6A372.8000702@mellanox.com> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E6A372.8000702@mellanox.com> Message-ID: <48EA7106.1050601@harr.org> Vu Pham wrote: > Cameron Harr wrote: >> Vu Pham wrote: >>> >>>> Alternatively, is there anything in the SCST layer I should tweak. I'm >>>> still running rev 245 of that code (kinda old, but works with OFED >>>> 1.3.1 >>>> w/o hacks). > > With blockio I get the best performance + stability with scst_threads=1 I got best performance with threads=2 or 3, and I've noticed that the srpt_thread is often at 99%, though if I increase/decrease the "thread=?" parameter for ib_srpt, it doesn't seem to make a difference. A second initiator doesn't seem to help much either, with a single initiator writing to two targets, can now usually get between 95K and 105K IOPs. >>>>> >>>>> My target server (with DAS) contains 8 2.8 GHz CPU cores and can >>>>> sustain over 200K IOPs locally, but only around 73K IOPs over SRP. >>> >>> Is this number from one initiator or multiple? >> One initiator. At first I thought it might be a limitation of the >> SRP, and added a second initiator, but the aggregate performance of >> the two was about equal to that of a single initiator. > > Try again with scst_threads=1. I expect that you can get ~140K with > two initiators > Unfortunately, I'm nowhere close that high, though I am significantly higher than before. 2 initiators does seem to reduce the context switching rate however, which is good. >>>>> Looking at /proc/interrupts, I see that the mlx_core (comp) device >>>>> is pushing about 135K Int/s on 1 of 2 CPUs. All CPUs are enabled >>>>> for that PCI-E slot, but it only ever uses 2 of the CPUs, and only >>>>> 1 at a time. None of the other CPUs has an interrupt rate more >>>>> than about 40-50K/s. >>> The number of interrupt can be cut down if there are more >>> completions to be processed by sw. ie. please test with multiple QPs >>> between one initiator vs. your target and multiple initiators vs. >>> your target Interrupts are still pretty high (around 160K/s now), but that seems to not be my bottleneck. Context switching seems to be about 2-2.5 for every IOP and sometimes less - not perfect, but not horrible either. > > ib_srpt process completions in event callback handler. With more QPs > there are more completions pending per interrupt instead of one > completion event per interrupt. > You can have multiple QPs between initiator vs. target by using > different initiator_id_ext ie. > echo id_ext=xxx,ioc_guid=yyy,....initiator_ext=1 > > /sys/class/infiniband_srp/.../add_target > echo id_ext=xxx,ioc_guid=yyy,....initiator_ext=2 > > /sys/class/infiniband_srp/.../add_target > echo id_ext=xxx,ioc_guid=yyy,....initiator_ext=3 > > /sys/class/infiniband_srp/.../add_target This doesn't seem to net much of an improvement, though I understand the reasoning behind it. My hunch is there's another bottleneck now to look for. Cameron From cameron at harr.org Mon Oct 6 15:00:47 2008 From: cameron at harr.org (Cameron Harr) Date: Mon, 06 Oct 2008 16:00:47 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48EA6838.40706@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EA3706.2080700@vlnb.net> <48EA6838.40706@harr.org> Message-ID: <48EA8A8F.2000301@harr.org> Cameron Harr wrote: >> This is still too high. Considering that each CS is about 1 >> microsecond you can estimate how many IOPS's it costs you. > > Dropping scst_threads down to 2, from 8, with 2 initiators, seems to > make a fairly significant difference, propelling me to a little over > 100K IOPs and putting the CS rate around 2:1, sometimes lower. 2 > threads gave the best performance compared to 1, 4 and 8. Just as a status update, I've gotten my best performance with scst_threads=3 on 2 initiators, and using a separate QP for each drive an initiator is writing to. I'm getting pretty consistent 112-115K IOPs using two initiators, each writing with 2 processes to the same 2 physical targets, using 512B blocks. Adding the second initiator only bumps me up by about 20K IOPs, but as all the CPUs are pegged around 99%, I'll take that as a bottleneck. Also, as a note from Vlad's advice, the CS rate is now around 70K/s on 115K IOPs, so it's not too bad. Interrupts (where this thread started), are around 200K/s - a lot higher than I thought they'd go, but I'm not complaining. :) Thanks for the help. Cameron From kliteyn at dev.mellanox.co.il Mon Oct 6 16:09:46 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 07 Oct 2008 01:09:46 +0200 Subject: [ofa-general] [PATCH 0/6] opensm: Unicast Routing Cache In-Reply-To: References: <48E96928.8030200@dev.mellanox.co.il> Message-ID: <48EA9ABA.6010509@dev.mellanox.co.il> Hi Hal, Hal Rosenstock wrote: > Hi Yevgeny, > > On Sun, Oct 5, 2008 at 9:26 PM, Yevgeny Kliteynik > wrote: >> Hi Sasha, >> >> The following series of 6 patches implements unicast routing cache >> in OpenSM. >> >> This implementation (v2, previous version was sent before OFED 1.3) >> was rewritten from scratch: >> - no caching of existing connectivity >> - no caching of existing lid matrices >> - each switch has an LFT buffer that contains the result of >> the last routing engine execution (instead of one buffer >> in ucast_mgr) >> - links/ports/nodes changes are spotted during the discovery >> - only the links/ports/nodes that went down are cached >> - when switch goes down, caching its lid matrices and LFT >> >> In one of the following cases we can use cached routing >> - there is no topology change >> - one or more CAs disappeared >> - one or more leaf switches disappeared >> In these cases cached routing is written to the switches as is >> (unless the switch doesn't exist). >> If there is any other topology change, existing cache is invalidated >> and the routing engine(s) run as usual. > > Glad to see this! > > A few comments/questions: > > It seems that there is a LFT cache per switch. This seems to be a big > memory penalty to me (in large subnets). So I have two questions > related to this: > Can this only be done this way when cached routing is being used ? Actually, I was thinking about something else: Currently we have switch LFT implemented as osm_fwd_tbl_t. I can remove the unnecessary complexity of the osm_fwd_tbl_t by replacing it with a simple uint8_t array (same as LFT buffer). Then by simple comparison I will check whether the recently calculated LFT matches the switch's LFT, and if there is a match, then lft_buf can be freed. In this case only the switches that have LFT different from the recently calculated LFT will have both tables, which would be rare and temporary - on the next heavy sweep the LFTs would match, and lft_buf would be freed. Effectively, it won't have memory penalty. It can be done in a separate patch. > Also, when cached routing is being used, is this only needed for leaf switches ? No, it is needed for all the switches, because cache can also handle non-leaf switch fast reset. > I'm wondering when there is a cached node match whether the available > peer ports/neighbors are validated (or something equivalent) to know > caching is valid ? It might also include whether a switch is still a > leaf switch (which may be redundant as that should show up as a peer > port/neighbor change). It looks like the structure is there for this > but I didn't review the code in detail. If I understood your question correctly, then yes, such validation is done by osm_ucast_cache_validate() function. Can you describe in more details the case that you are asking about? > Are you sure all the memory allocation failures are handled properly > within the routing cache code ? What I mean is that NULL is returned > and does this always result in a caching not used/routing recalculated > ? Also, in that case, should some log message be indicated rather than > hiding this ? I will check it. > Nit: doc/current-routing.txt should also be updated for this feature. OK, separate patch. -- Yevgeny > -- Hal > >> The patches are: >> - patch 1/6: move lft_buf from ucast_mgr to osm_switch >> - patch 2/6: Add "-A" or "--ucast_cache" option to opensm >> - patch 3/6: adding osm_ucast_cache.{c,h} files (this is >> the cache implementation itself) >> - patch 4/6: adding new cache files to makefile >> - patch 5/6: integrating unicast cache into the discovery >> and ucast manager >> - patch 6/6: man entry for cached routing >> >> -- Yevgeny >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> > From vst at vlnb.net Tue Oct 7 02:16:39 2008 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Tue, 07 Oct 2008 13:16:39 +0400 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48EA8A8F.2000301@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EA3706.2080700@vlnb.net> <48EA6838.40706@harr.org> <48EA8A8F.2000301@harr.org> Message-ID: <48EB28F7.7070301@vlnb.net> Cameron Harr wrote: > Cameron Harr wrote: >>> This is still too high. Considering that each CS is about 1 >>> microsecond you can estimate how many IOPS's it costs you. >> Dropping scst_threads down to 2, from 8, with 2 initiators, seems to >> make a fairly significant difference, propelling me to a little over >> 100K IOPs and putting the CS rate around 2:1, sometimes lower. 2 >> threads gave the best performance compared to 1, 4 and 8. > > Just as a status update, I've gotten my best performance with > scst_threads=3 on 2 initiators, and using a separate QP for each drive > an initiator is writing to. I'm getting pretty consistent 112-115K IOPs > using two initiators, each writing with 2 processes to the same 2 > physical targets, using 512B blocks. Adding the second initiator only > bumps me up by about 20K IOPs, but as all the CPUs are pegged around > 99%, I'll take that as a bottleneck. Also, as a note from Vlad's advice, > the CS rate is now around 70K/s on 115K IOPs, so it's not too bad. > Interrupts (where this thread started), are around 200K/s - a lot higher > than I thought they'd go, but I'm not complaining. :) Actually, what you did is tune your workload so it put nicely on all the participating threads and CPU cores, so all the threads stay each on its own CPU core and gracefully pass commands during processing to each other being busy almost all the time. I.e. you put your system in some kind of resonance. If you change your workload just a bit or Linux scheduler changed in the next kernel version, your tuning would be destroyed. So, I wouldn't overestimate your results. As I already wrote, the only real fix is to remove all the unneeded context switches between threads during commands processing. This fix would work not only on carefully tuned artificial workloads, but on real life ones too. 5-10 threads participating in a single command processing reminds me the famous set of histories about how many people of some kind is necessary to change a burnt out lamp ;) > Thanks for the help. > Cameron > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From kliteyn at dev.mellanox.co.il Tue Oct 7 02:31:38 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 07 Oct 2008 11:31:38 +0200 Subject: [ofa-general] [PATCH] ibutils: mkey handling basic modeling Message-ID: <48EB2C7A.5030200@dev.mellanox.co.il> [On behalf of Vladimir Zdornov] A basic implementation of mkey mechanism. All nodes share a common timer which decrements the counters on the registered nodes every 1 sec. Signed-off-by: Vladimir Zdornov --- ibmgtsim/src/sma.cpp | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++ ibmgtsim/src/sma.h | 53 +++++++++++++++++ 2 files changed, 204 insertions(+), 0 deletions(-) diff --git a/ibmgtsim/src/sma.cpp b/ibmgtsim/src/sma.cpp index 446ec56..ea82d77 100644 --- a/ibmgtsim/src/sma.cpp +++ b/ibmgtsim/src/sma.cpp @@ -45,6 +45,11 @@ * *********/ +#include +#include +#include +#include + #include "msgmgr.h" #include "simmsg.h" #include "sim.h" @@ -89,6 +94,85 @@ ibms_dump_mad( } } +// CLASS SMATimer + +SMATimer::SMATimer(int t):time(t) +{ + pthread_mutex_init(&timerMutex, NULL); + tid = pthread_create(&th, NULL, SMATimer::timerRun, this); + if (tid) { + MSGREG(err0, 'E', "Couldn't create timer thread $ !", "processMad"); + MSGSND(err0, tid); + } + MSGREG(inf1, 'V', "SMA timer on!", "processMad"); + MSGSND(inf1); +} + +void SMATimer::terminate() +{ + pthread_cancel(th); +} + +void* SMATimer::timerRun(void* p) +{ + SMATimer* p_timer = (SMATimer*) p; + while(1) { + MSGREG(inf1, 'V', "Sleeping for $ secs !", "processMad"); + MSGSND(inf1,p_timer->time); + + sleep(p_timer->time); + pthread_mutex_lock(&p_timer->timerMutex); + unsigned i=0; + while (iL.size()) { + int res = p_timer->L[i].f(p_timer->L[i].data); + if (!res) + p_timer->L.erase(p_timer->L.begin()+i); + else + i++; + } + pthread_mutex_unlock(&p_timer->timerMutex); + } +} + +void SMATimer::reg (reg_t r) +{ + pthread_mutex_lock(&timerMutex); + L.push_back(r); + pthread_mutex_unlock(&timerMutex); +} + +void SMATimer::unreg (void* data) +{ + pthread_mutex_lock(&timerMutex); + for (unsigned i=0;imut); + pT->counter--; + if (pT->counter > 0) { + pthread_mutex_unlock(&pT->mut); + return 1; + } + // Need to zero m_key + pT->pInfo->m_key = 0; + pT->timerOn = 0; + pthread_mutex_unlock(&pT->mut); + + return 0; +} + void IBMSSma::initSwitchInfo() { IBNode *pNodeData; @@ -392,6 +476,14 @@ IBMSSma::IBMSSma(IBMSNode *pSNode, list_uint16 mgtClasses) : initPortInfo(); //Init VL Arbitration ports vector size and table pSimNode->vlArbPortEntry.resize(pSimNode->nodeInfo.num_ports + 1); + //Init ports' timing vector + vPT.resize(pSimNode->nodeInfo.num_ports + 1); + for (unsigned i=0;inodeInfo.node_type == IB_NODE_TYPE_SWITCH) || i!=0) { + vPT[i].pInfo = &(pSimNode->nodePortsInfo[i]); + pthread_mutex_init(&vPT[i].mut, NULL); + vPT[i].timerOn = 0; + } initVlArbitTable(); initPKeyTables(); @@ -1482,6 +1574,7 @@ int IBMSSma::madValidation(ibms_mad_msg_t &madMsg) int IBMSSma::processMad(uint8_t inPort, ibms_mad_msg_t &madMsg) { + MSG_ENTER_FUNC; ibms_mad_msg_t respMadMsg; @@ -1519,6 +1612,64 @@ int IBMSSma::processMad(uint8_t inPort, ibms_mad_msg_t &madMsg) memcpy(pRespSmp, pReqSmp, sizeof(ib_smp_t)); } + // perform m_key check + unsigned mPort = inPort; + if (pSimNode->nodeInfo.node_type == IB_NODE_TYPE_SWITCH) + mPort = 0; + + ib_net64_t m_key1 = ((ib_smp_t*)(&madMsg.header))->m_key; + + pthread_mutex_lock(&vPT[mPort].mut); + ib_net64_t m_key2 = vPT[mPort].pInfo->m_key; + + MSGREG(inf21, 'I', "Mkeys current: $, received: $!", "processMad"); + MSGSND(inf21, cl_ntoh64(m_key2), cl_ntoh64(m_key1)); + + if (m_key2 && (m_key1 != m_key2) && madMsg.header.method == IB_MAD_METHOD_SET) { + MSGREG(inf22, 'I', "Mkeys mismatch!", "processMad"); + MSGSND(inf22); + + // Timer already running + if (vPT[mPort].timerOn) { + MSGREG(inf2, 'I', "Timer already on, counter: $!", "processMad"); + MSGSND(inf2,vPT[mPort].counter); + + MSG_EXIT_FUNC; + pthread_mutex_unlock(&vPT[mPort].mut); + return 0; + } + // Start the timer + else { + vPT[mPort].counter = cl_ntoh16(vPT[mPort].pInfo->m_key_lease_period); + vPT[mPort].timerOn = 1; + reg_t tmp; + tmp.f = cbMkey; + tmp.data = &vPT[mPort]; + + MSGREG(inf2, 'I', "Starting timer with counter $!", "processMad"); + MSGSND(inf2,vPT[mPort].counter); + + pthread_mutex_unlock(&vPT[mPort].mut); + mkeyTimer.reg(tmp); + + + MSG_EXIT_FUNC; + + return 0; + } + } + + if ((m_key1 == m_key2) && (vPT[mPort].timerOn)) { + MSGREG(inf2, 'I', "Stopping timer!", "processMad"); + MSGSND(inf2); + + vPT[mPort].timerOn = 0; + pthread_mutex_unlock(&vPT[mPort].mut); + mkeyTimer.unreg(&vPT[mPort]); + } + + pthread_mutex_unlock(&vPT[mPort].mut); + switch (attributeId) { case IB_MAD_ATTR_NODE_DESC: MSGREG(inf5, 'I', "Attribute being handled is $ !", "processMad"); diff --git a/ibmgtsim/src/sma.h b/ibmgtsim/src/sma.h index 4241ca0..c222175 100644 --- a/ibmgtsim/src/sma.h +++ b/ibmgtsim/src/sma.h @@ -47,6 +47,9 @@ #ifndef SMA_H #define SMA_H +#include +#include +#include #include #include "simmsg.h" #include "server.h" @@ -83,8 +86,58 @@ ibms_dump_mad( const ibms_mad_msg_t &madMsg, const uint8_t dir); * *********/ +// Returns 0 if entry should be removed and 1 otherwise +typedef int (*cbFunc)(void*); + +typedef struct reg_ +{ + cbFunc f; + void* data; +} reg_t; + +class SMATimer +{ + // Mutex of the timer + pthread_mutex_t timerMutex; + // Timer thread function + static void* timerRun(void* p); + // Thread id + int tid; + // Thread + pthread_t th; + // Registered objects list + vector L; + // Sleep time + int time; + + public: + SMATimer(int time); + void terminate(); + // Timer registration function + void reg(reg_t r); + // Removes registered object identified by POINTER + void unreg(void* data); +}; + +#define T_FREQ 1 + +typedef struct portTiming_ +{ + ib_port_info_t* pInfo; + unsigned counter; + int timerOn; + pthread_mutex_t mut; +} portTiming; + class IBMSSma : IBMSMadProcessor { + // m_key callback function + static int cbMkey(void* data); + // M_key timer + static SMATimer mkeyTimer; + + vector vPT; + /* init functions of node structures */ void initSwitchInfo(); void initNodeInfo(); -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Tue Oct 7 02:31:17 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 07 Oct 2008 11:31:17 +0200 Subject: [ofa-general] [PATCH] ibutils: Enhanced credit loop analysis Message-ID: <48EB2C65.5040708@dev.mellanox.co.il> [On behalf of Vladimir Zdornov] Enhanced credit loop analysis. Multiple VL/SL support added. Loops are detected using DFS. Signed-off-by: Vladimir Zdornov --- ibdm/datamodel/CredLoops.cpp | 643 ++++++++++++------------------------------ ibdm/datamodel/Fabric.cpp | 214 ++++++++++++++ ibdm/datamodel/Fabric.h | 104 +++++++- ibdm/src/osm_check.cpp | 214 +++++++++------ 4 files changed, 619 insertions(+), 556 deletions(-) diff --git a/ibdm/datamodel/CredLoops.cpp b/ibdm/datamodel/CredLoops.cpp index 30cff4b..70e5879 100644 --- a/ibdm/datamodel/CredLoops.cpp +++ b/ibdm/datamodel/CredLoops.cpp @@ -46,55 +46,67 @@ * */ -#define RT_NOT_USED 0 -#define RT_USED 1 -#define RT_VISITED 2 -#define RT_LOOP_TRACED 4 - ////////////////////////////////////////////////////////////////////////////// -// Allocate the tables on the switches -int -CrdLoopInitRtTbls(IBFabric *p_fabric) { - IBNode *p_node; - - // Go over all SW nodes in the fabric and build a table - // of input to output ports links. Each element should track - // effect and traversal flags. - for( map_str_pnode::iterator nI = p_fabric->NodeByName.begin(); - nI != p_fabric->NodeByName.end(); - nI++) { +// Apply DFS on a dependency graph - p_node = (*nI).second; - if (p_node->type != IB_SW_NODE) continue; +int CrdLoopDFS(VChannel* ch) { + // Already been there + if (ch->getFlag() == Closed) + return 0; + // Credit loop + if (ch->getFlag() == Open) { + return 1; + } + // Mark as open + ch->setFlag(Open); + // Make recursive steps + for (int i=0; igetDependSize();i++) { + VChannel* next = ch->getDependency(i); + if (next) { + if (CrdLoopDFS(next)) + return 1; + } + } + // Mark as closed + ch->setFlag(Closed); + return 0; +} - uint8_t *p_tbl = - new uint8_t[p_node->numPorts*p_node->numPorts]; +////////////////////////////////////////////////////////////////////////////// - memset(p_tbl, RT_NOT_USED, - sizeof(uint8_t)*p_node->numPorts*p_node->numPorts); +// Go over CA's apply DFS on the dependency graphs starting from CA's port - if (! p_tbl) { - cout << "-F- Fail to allocate memory for port routing table" << endl; - exit(2); - } +int CrdLoopFindLoops(IBFabric* p_fabric) { + unsigned int lidStep = 1 << p_fabric->lmc; - // We use the appData1 of the node to store the routing links - // info - if (p_node->appData1.ptr) { - cout << "-W- Application Data Pointer already set for node:" - << p_node->name << endl; - delete [] p_tbl; - } else { - p_node->appData1.ptr = (void *)p_tbl; + // go over all CA ports in the fabric + for (int i = p_fabric->minLid; i <= p_fabric->maxLid; i += lidStep) { + IBPort *p_Port = p_fabric->PortByLid[i]; + if (!p_Port || (p_Port->p_node->type == IB_SW_NODE)) continue; + // Go over all CA's channels and find untouched one + for (int j=0;j < p_fabric->getNumSLs(); j++) { + dfs_t state = p_Port->channels[j]->getFlag(); + if (state == Open) { + cout << "-E- open channel outside of DFS" << endl; + return 1; + } + // Already processed, continue + if (state == Closed) + continue; + // Found starting point + if (CrdLoopDFS(p_Port->channels[j])) + return 1; } } return 0; } + ////////////////////////////////////////////////////////////////////////////// // Trace a route from slid to dlid by LFT +// Add dependency edges int CrdLoopMarkRouteByLFT ( IBFabric *p_fabric, unsigned int sLid , unsigned int dLid @@ -102,99 +114,97 @@ int CrdLoopMarkRouteByLFT ( IBPort *p_port = p_fabric->getPortByLid(sLid); IBNode *p_node; - IBPort *p_remotePort; + IBPort *p_portNext; unsigned int lidStep = 1 << p_fabric->lmc; - int inPortNum = 0, outPortNum = 0; - uint8_t *p_tbl; - int hopCnt = 0; + int outPortNum = 0, inputPortNum = 0, hopCnt = 0; + bool done; // make sure: - if (! p_port) { + if (!p_port) { cout << "-E- Provided source:" << sLid << " lid is not mapped to a port!" << endl; return(1); } - // if the port is not a switch - go to the next switch: - if (p_port->p_node->type != IB_SW_NODE) { - // try the next one: - if (!p_port->p_remotePort) { - cout << "-E- Provided starting point is not connected !" - << "lid:" << sLid << endl; - return 1; - } - inPortNum = p_port->p_remotePort->num; - p_node = p_port->p_remotePort->p_node; - } else { - // it is a switch : - p_node = p_port->p_node; - } + // Retrieve the relevant SL + uint8_t SL, VL; + SL = VL = p_port->p_node->getPSLForLid(dLid); + if (!p_port->p_remotePort) { + cout << "-E- Provided starting point is not connected !" + << "lid:" << sLid << endl; + return 1; + } - // verify we are finally of a switch: - if (p_node->type != IB_SW_NODE) { - cout << "-E- Provided starting point is not connected to a switch !" - << "lid:" << sLid << endl; - return 1; + if (SL == IB_SLT_UNASSIGNED) { + cout << "-E- SL to destination is unassigned !" + << "slid: " << sLid << "dlid:" << dLid << endl; + return 1; } - // traverse: - int done = 0; + // check if we are done: + done = ((p_port->p_remotePort->base_lid <= dLid) && + (p_port->p_remotePort->base_lid+lidStep - 1 >= dLid)); while (!done) { - // calc next node: - outPortNum = p_node->getLFTPortForLid(dLid); - if (outPortNum == IB_LFT_UNASSIGNED) { - cout << "-E- Unassigned LFT for lid:" << dLid << " Dead end at:" << p_node->name << endl; - return 1; - } - - // get the port on the other side - p_port = p_node->getPort(outPortNum); - - if (! (p_port && - p_port->p_remotePort && - p_port->p_remotePort->p_node)) { - cout << "-E- Dead end at:" << p_node->name << endl; - return 1; - } - - // Track it please: - p_tbl = (uint8_t *)p_node->appData1.ptr; - if (! p_tbl) { - cout << "-F- Got a non initialized routing table pointer!" << endl; - exit(2); - } - - // cout << "-V- Add usage Node:" << p_node->name - // << " In:" << inPortNum << " to:" << outPortNum << endl; - - p_tbl[(inPortNum - 1)*p_node->numPorts + outPortNum - 1] = RT_USED; + // Get the node on the remote side + p_node = p_port->p_remotePort->p_node; + // Get remote port's number + inputPortNum = p_port->p_remotePort->num; + // Get number of ports on the remote side + int numPorts = p_node->numPorts; + // Init vchannel's number of possible dependencies + p_port->channels[VL]->setDependSize((numPorts+1)*p_fabric->getNumVLs()); + + // Get port num of the next hop + outPortNum = p_node->getLFTPortForLid(dLid); + // Get VL of the next hop + int nextVL = p_node->getSLVL(inputPortNum,outPortNum,SL); + + if (outPortNum == IB_LFT_UNASSIGNED) { + cout << "-E- Unassigned LFT for lid:" << dLid << " Dead end at:" << p_node->name << endl; + return 1; + } - p_remotePort = p_port->p_remotePort; - inPortNum = p_remotePort->num; + if (nextVL == IB_SLT_UNASSIGNED) { + cout << "-E- Unassigned SL2VL entry, iport: "<< inputPortNum<<", oport:"<base_lid <= dLid) && - (p_remotePort->base_lid+lidStep - 1 >= dLid)); + // get the next port on the other side + p_portNext = p_node->getPort(outPortNum); - p_node = p_remotePort->p_node; - if (hopCnt++ > 256) { - cout << "-E- Aborting after 256 hops - loop in LFT?" << endl; - return 1; - } + if (! (p_portNext && + p_portNext->p_remotePort && + p_portNext->p_remotePort->p_node)) { + cout << "-E- Dead end at:" << p_node->name << endl; + return 1; + } + // Now add an edge + p_port->channels[VL]->setDependency(outPortNum*p_fabric->getNumVLs()+nextVL,p_portNext->channels[nextVL]); + // Advance + p_port = p_portNext; + VL = nextVL; + if (hopCnt++ > 256) { + cout << "-E- Aborting after 256 hops - loop in LFT?" << endl; + return 1; + } + //Check if done + done = ((p_port->p_remotePort->base_lid <= dLid) && + (p_port->p_remotePort->base_lid+lidStep - 1 >= dLid)); } return 0; } -////////////////////////////////////////////////////////////////////////////// +///////////////////////////////////////////////////////////////////////////// + +// Go over all CA to CA paths and connect dependant vchannel by an edge -// Go over all CA to CA paths and mark the output links -// input links connections on these paths in the routing tables. int -CrdLoopPopulateRtTbls(IBFabric *p_fabric) { +CrdLoopConnectDepend(IBFabric* p_fabric) +{ unsigned int lidStep = 1 << p_fabric->lmc; - int anyError = 0, paths = 0; + int anyError = 0; unsigned int i,j; // go over all ports in the fabric @@ -215,22 +225,20 @@ CrdLoopPopulateRtTbls(IBFabric *p_fabric) { if (! p_dstPort) continue; if (p_dstPort->p_node->type == IB_SW_NODE) continue; - unsigned int dLid = p_dstPort->base_lid; - // go over all LMC combinations: - for (unsigned int l = 0; l < lidStep; l++) { - paths++; - - // Trace the path but record the input to output ports used. - if (CrdLoopMarkRouteByLFT(p_fabric, sLid + l, dLid + l)) { - cout << "-E- Fail to find a path from:" - << p_srcPort->p_node->name << "/" << p_srcPort->num - << " to:" << p_dstPort->p_node->name << "/" << p_dstPort->num - << endl; - anyError++; - } - } // all LMC lids + for (unsigned int l1 = 0; l1 < lidStep; l1++) { + for (unsigned int l2 = 0; l2 < lidStep; l2++) { + // Trace the path but record the input to output ports used. + if (CrdLoopMarkRouteByLFT(p_fabric, sLid + l1, dLid + l2)) { + cout << "-E- Fail to find a path from:" + << p_srcPort->p_node->name << "/" << p_srcPort->num + << " to:" << p_dstPort->p_node->name << "/" << p_dstPort->num + << endl; + anyError++; + } + }// all LMC lids 2 */ + } // all LMC lids 1 */ } // all targets } // all sources @@ -239,316 +247,34 @@ CrdLoopPopulateRtTbls(IBFabric *p_fabric) { return 1; } - cout << "-I- Marked " << paths << " CA to CA Paths" << endl; return 0; } ////////////////////////////////////////////////////////////////////////////// -// BFS from all CA's and require all inputs for an -// output node to be marked visited to go through it. -int -CrdLoopBfsFromCAs(IBFabric *p_fabric) { - int loops = 0; - list< IBPort *> thisStepPorts, nextStepPorts; - - // go over all CA nodes and track the input ports for next step - IBNode *p_node; - IBPort *p_port; - - for( map_str_pnode::iterator nI = p_fabric->NodeByName.begin(); - nI != p_fabric->NodeByName.end(); - nI++) { - - p_node = (*nI).second; - if (p_node->type != IB_CA_NODE) continue; - - // get the remote input port - for (unsigned int pn = 1; pn <= p_node->numPorts; pn++) { - p_port = p_node->getPort(pn); - - if (p_port && p_port->p_remotePort) { - // add to the list - thisStepPorts.push_back(p_port->p_remotePort); - } - } - } - - // while you have next step ports - while ( ! thisStepPorts.empty()) { - loops++; - - nextStepPorts.clear(); - - // go over all this step ports - while (! thisStepPorts.empty()) { - p_port = thisStepPorts.front(); - thisStepPorts.pop_front(); - - p_node = p_port->p_node ; - - if (p_node->type != IB_SW_NODE) continue; - - uint8_t *p_tbl = (uint8_t *)p_node->appData1.ptr; - int inPortNum = p_port->num; - - // go over all the out ports marked by this input port: - for (unsigned int outPortNum = 1; - outPortNum <= p_node->numPorts; outPortNum++) { - int idx = (inPortNum - 1)*p_node->numPorts + outPortNum - 1; - // check if port was marked as used: - if (p_tbl[idx] == RT_USED) { - // zero the port USED: - p_tbl[idx] = (RT_USED | RT_VISITED); - - // now check if all the effecting ports are cleard: - int foundUnVisited = 0; - for (unsigned int pn = 0; !foundUnVisited && (pn < p_node->numPorts); - pn++) { - idx = pn*p_node->numPorts + outPortNum - 1; - if (p_tbl[idx] == RT_USED) foundUnVisited = 1; - } - - // only when we do not have a marked but not visited we - // can progress to next port: - if (!foundUnVisited) { - // we only add ports if the are now unvisited: - IBPort *p_oPort = p_node->getPort(outPortNum); - if (p_oPort && p_oPort->p_remotePort) { - nextStepPorts.push_back(p_oPort->p_remotePort); - } - } - } - } - } // all this step ports - - // Copy next step ports to cur ports: - thisStepPorts = nextStepPorts; - } - - cout << "-I- Propagted ranking through Fabric in:" - << loops << " BFS steps" << endl; - return 0; -} - -////////////////////////////////////////////////////////////////////////////// - -// Dump Routing Tables: -int -CrdLoopDumpRtTbls(IBFabric *p_fabric) { - // go over all switches in the fabric - IBNode *p_node; - - // Go over all SW nodes in the fabric and build a table - // of input to output ports links. Each element should track - // effect and traversal flags. - for( map_str_pnode::iterator nI = p_fabric->NodeByName.begin(); - nI != p_fabric->NodeByName.end(); - nI++) { - - p_node = (*nI).second; - if (p_node->type != IB_SW_NODE) continue; - - cout << "---- RT TBL DUMP -----" << endl; - cout << "SW:" << p_node->name << endl; - - uint8_t *p_tbl = (uint8_t*)p_node->appData1.ptr; - - // header - cout << "I\\O "; - for (unsigned int outPortNum = 1; outPortNum <= p_node->numPorts; - outPortNum++) - cout << setw(3) << outPortNum << " "; - cout << endl; - - // Now go over all out ports and check all input port. - for (unsigned int inPortNum = 1; inPortNum <= p_node->numPorts; - inPortNum++) { - cout << setw(3) << inPortNum << " "; - // go over all the out ports marked by this input port: - for (unsigned int outPortNum = 1; outPortNum <= p_node->numPorts; - outPortNum++) { - int idx = (inPortNum - 1)*p_node->numPorts + outPortNum - 1; - if (p_tbl[idx] == RT_USED) - cout << setw(3) << "USE "; - else if (p_tbl[idx] == (RT_USED | RT_VISITED)) - cout << setw(3) << "VIS "; - else { - cout << setw(3) << " "; - } - } - cout << endl; - } - } - return(0); -} - -////////////////////////////////////////////////////////////////////////////// - -// Trace a loop through a given node ports pair -// We DFS fowrard and report all nodes of all the loops found. -int -CrdLoopTraceLoop(IBFabric *p_fabric, - IBNode *p_endNode, - int inPortNum, - IBNode *p_startNode, - int outPortNum, - string path = string(""), - int hops = 0, - int doNotPrintPath = 0 - ) { - - // find the other end of the link if any - IBPort *p_port = p_startNode->getPort(outPortNum); - - // we need to have a port and remote port - if (! p_port || !p_port->p_remotePort) return 0; - - IBNode *p_remNode = p_port->p_remotePort->p_node; - - // we never go through CAs - if (p_remNode->type != IB_SW_NODE) return 0; - - uint8_t *p_tbl = (uint8_t*)p_remNode->appData1.ptr; - - // if it is the target end node and port - if (p_remNode == p_endNode && p_port->p_remotePort->num == inPortNum) { - // print the path - cout << "--------------------------------------------" << endl; - cout << "-E- Found a credit loop on:" << p_endNode->name - << " from port:" << inPortNum << " to port:" - << outPortNum << endl; - if (! doNotPrintPath) { - cout << path << endl; - cout << p_endNode->name << " " << inPortNum << endl; - } - return(1); - } else { - // track the number of downwards paths found. - int numPaths = 0; - static char buf[128]; - - // we will track where we come from - sprintf(buf, "%s %u -> ", - p_remNode->name.c_str(),p_port->p_remotePort->num); - - // it is possible we already visited this node since we trace a - // loop that is different then our own. - if (path.find(buf) != string::npos) { - if (! doNotPrintPath) - cout << "-W- Marking a 'scroll' side loop at:" - << p_remNode->name << "/" << p_port->p_remotePort->num << endl; - - // to avoid going into this scroll again - // we encode a return code that should mark the - // path as a scroll: - return -1; - } - - // abort if hops count is bigger then 1000 - if (hops > 1000) { - if (! doNotPrintPath) - cout << "-W- Aborting path:" << path << endl; - return 0; - } - - // add yourself to the path - string fwdPath = path + string("\n") + string(buf); - - // go over all out ports not aleady marked routed from this in port - for (unsigned int pn = 1; pn <= p_remNode->numPorts; pn++) { - int idx = (p_port->p_remotePort->num - 1)*p_remNode->numPorts + pn - 1; - - // do we have a used but not visited connection: - if (p_tbl[idx] == RT_USED) { - // traverse forward - sprintf(buf, "%u", pn); - int foundPaths = - CrdLoopTraceLoop(p_fabric, p_endNode, inPortNum, - p_remNode, pn, fwdPath + string(buf), hops++, - doNotPrintPath); - - // we might have encountered a scroll (return value < 0) - // so we sould ignore it in the global count. - if (foundPaths > 0) numPaths += foundPaths; - - // if found a loop or a scroll downwards mark the local port pair. - if (foundPaths) { - p_tbl[idx] = RT_LOOP_TRACED & RT_USED; - } - } - } - return(numPaths); - } -} - -////////////////////////////////////////////////////////////////////////////// - -// Report all Switch ports that are still marked as not -// fully visited. -int -CrdLoopReportLoops(IBFabric *p_fabric, int doNotPrintPath) { - int anyError = 0; - - // go over all switches in the fabric looking for used link to link - // that was not marked as visited. - IBNode *p_node; - - // Go over all SW nodes in the fabric and build a table - // of input to output ports links. Each element should track - // effect and traversal flags. - for( map_str_pnode::iterator nI = p_fabric->NodeByName.begin(); - nI != p_fabric->NodeByName.end(); - nI++) { - - p_node = (*nI).second; - if (p_node->type != IB_SW_NODE) continue; - - uint8_t *p_tbl = (uint8_t*)p_node->appData1.ptr; - - // Now go over all out ports and check all input port. - for (unsigned int inPortNum = 1; inPortNum <= p_node->numPorts; - inPortNum++) { - // go over all the out ports marked by this input port: - for (unsigned int outPortNum = 1; outPortNum <= p_node->numPorts; - outPortNum++) { - int idx = (inPortNum - 1)*p_node->numPorts + outPortNum - 1; - - if (p_tbl[idx] == RT_USED) { - char buf[16]; - sprintf(buf, " %u", outPortNum); - int loops = CrdLoopTraceLoop(p_fabric, p_node, inPortNum, - p_node, outPortNum, - p_node->name + string(buf), - 0, doNotPrintPath - ); - anyError += loops; - } - } - } - } - if (anyError) cout << "--------------------------------------" << endl; - return anyError; -} - -////////////////////////////////////////////////////////////////////////////// - // Prepare the data model int CrdLoopPrepare(IBFabric *p_fabric) { - IBNode *p_node; - - // Go over all SW nodes in the fabric and cleanup - for( map_str_pnode::iterator nI = p_fabric->NodeByName.begin(); - nI != p_fabric->NodeByName.end(); - nI++) { - - p_node = (*nI).second; - if (p_node->type != IB_SW_NODE) continue; + unsigned int lidStep = 1 << p_fabric->lmc; - if (p_node->appData1.ptr) { - p_node->appData1.ptr = NULL; - } + // go over all ports in the fabric + for (int i = p_fabric->minLid; i <= p_fabric->maxLid; i += lidStep) { + IBPort *p_Port = p_fabric->PortByLid[i]; + if (!p_Port) continue; + IBNode *p_node = p_Port->p_node; + int nL; + if (p_node->type == IB_CA_NODE) + nL = p_fabric->getNumSLs(); + else + nL = p_fabric->getNumVLs(); + // Go over all node's ports + for (int k=0;kPorts.size();k++) { + IBPort* p_Port = p_node->Ports[k]; + // Init virtual channel array + p_Port->channels.resize(nL); + for (int j=0;jchannels[j] = new VChannel; + } } return 0; } @@ -556,23 +282,25 @@ CrdLoopPrepare(IBFabric *p_fabric) { // Cleanup the data model int CrdLoopCleanup(IBFabric *p_fabric) { - IBNode *p_node; - - // Go over all SW nodes in the fabric and cleanup - for( map_str_pnode::iterator nI = p_fabric->NodeByName.begin(); - nI != p_fabric->NodeByName.end(); - nI++) { - - p_node = (*nI).second; - if (p_node->type != IB_SW_NODE) continue; + unsigned int lidStep = 1 << p_fabric->lmc; - if (p_node->appData1.ptr) { - uint8_t *p_tbl = (uint8_t *)p_node->appData1.ptr; - delete [] p_tbl; - p_node->appData1.ptr = NULL; - } + // go over all ports in the fabric + for (int i = p_fabric->minLid; i <= p_fabric->maxLid; i += lidStep) { + IBPort *p_Port = p_fabric->PortByLid[i]; + if (!p_Port) continue; + IBNode *p_node = p_Port->p_node; + int nL; + if (p_node->type == IB_CA_NODE) + nL = p_fabric->getNumSLs(); + else + nL = p_fabric->getNumVLs(); + // Go over all node's ports + for (int k=0;kPorts.size();k++) { + IBPort* p_Port = p_node->Ports[k]; + for (int j=0;jchannels[j]; + } } - return 0; } ////////////////////////////////////////////////////////////////////////////// @@ -580,45 +308,30 @@ CrdLoopCleanup(IBFabric *p_fabric) { // Top Level Subroutine: int CrdLoopAnalyze(IBFabric *p_fabric) { + int res=0; - cout << "-I- Analyzing Fabric for Credit Loops (one VL used)." << endl; - - CrdLoopPrepare(p_fabric); - - // Allocate routing tables on all switches (appData1.ptr) - CrdLoopInitRtTbls(p_fabric); - - // Go over all CA to CA paths and mark the output links - // input links connections on these paths - if (CrdLoopPopulateRtTbls(p_fabric)) { - cout << "-E- Fail to populate the Routing Tables." << endl; - return(1); + cout << "-I- Analyzing Fabric for Credit Loops "<<(int)p_fabric->getNumSLs()<<" SLs, "<<(int)p_fabric->getNumVLs()<< " VLs used..."; + // Init data structures + if (CrdLoopPrepare(p_fabric)) { + cout << "-E- Fail to prepare data structures." << endl; + return(1); } - - // CrdLoopDumpRtTbls(p_fabric); - - // Start BFS from all CA's and require all inputs for an - // output node to be marked visited to go through it. - if (CrdLoopBfsFromCAs(p_fabric)) { - cout << "-E- Fail to BFS from all CA nodes through the Routing Tables." << endl; - return(1); - } - - // Report all Switch ports that are still marked as not - // fully visited. - int numLoopPorts; - int doNotPrintPath = 1; - if ((numLoopPorts = CrdLoopReportLoops(p_fabric, doNotPrintPath))) { - cout << "-E- Found:" << numLoopPorts - << " Credit Loops" << endl; - } else { - cout << "-I- No credit loops found." << endl; + // Create the dependencies + if (CrdLoopConnectDepend(p_fabric)) { + cout << "-E- Fail to build dependency graphs." << endl; + return(1); } + // Find the loops if exist + res = CrdLoopFindLoops(p_fabric); + if (!res) + cout << " no credit loops found" << endl; + else + cout << endl << "-E- credit loops in routing"<field(2).c_str(), NULL, 10); + maxLid = lid > maxLid ? lid:maxLid; + } + /*else + { + cout << "-E- Wrong file format:" << fn.c_str() << endl; + anyErr++; + }*/ + } + f.close(); + + // Make second pass and build the tables + f.open(fn.c_str(),ifstream::in); + if (f.fail()) + { + cout << "-E- Fail to open file:" << fn.c_str() << endl; + return 1; + } + + while (f.good()) { + f.getline(sLine,1024); + + p_rexRes = slLine.apply(sLine); + if (p_rexRes) + { + uint64_t guid = strtoull(p_rexRes->field(1).c_str(), NULL, 16); + unsigned int lid = strtoull(p_rexRes->field(2).c_str(), NULL, 10); + uint8_t sl = strtoull(p_rexRes->field(3).c_str(), NULL, 10); + + IBNode* p_node = getNodeByGuid(guid); + if (!p_node) + { + cout << "-E- Fail to find node with guid:" + << guid << endl; + anyErr++; + } + else + { + // Update number of used SLs + numSLs = sl+1 > numSLs ? sl+1:numSLs; + // Insert table entry + p_node->setPSLForLid(lid,maxLid,sl); + } + delete p_rexRes; + } + /*else + { + cout << "-E- Wrong file format:" << fn.c_str() << endl; + anyErr++; + }*/ + } + cout << "-I- Defined "<< (int)numSLs << " SLs in use" <field(1).c_str(), NULL, 16); + unsigned int iport = strtoull(p_rexRes->field(2).c_str(), NULL, 10); + unsigned int oport = strtoull(p_rexRes->field(3).c_str(), NULL, 10); + + IBNode* p_node = getNodeByGuid(guid); + if (!p_node) + { + cout << "-E- Fail to find node with guid:" + << guid << endl; + anyErr++; + } + else + { + for (int i=0;ifield(4+i).c_str(), NULL, 16); + numVLs = numVLs > vl+1 ? numVLs : vl+1; + // Set the table entry + p_node->setSLVL(iport,oport,i,vl); + } + } + delete p_rexRes; + } + /*else + { + cout << "-E- Wrong file format:" << fn.c_str() << endl; + anyErr++; + }*/ + } + cout << "-I- Defined "<< (int)numVLs << " VLs in use" < vec_vec_int; typedef vector vec_byte; typedef vector vec_word; typedef vector vec_vec_byte; +typedef vector vec3_byte; typedef vector vec_pport; typedef vector vec_pnode; +typedef vector vec_pvch; typedef map< string, class IBSysPort *, strless > map_str_psysport; typedef map< string, class IBNode *, strless > map_str_pnode; typedef map< string, class IBSystem *, strless > map_str_psys; typedef map< uint64_t, class IBPort *, less > map_guid_pport; typedef map< uint64_t, class IBNode *, less > map_guid_pnode; typedef map< uint64_t, class IBSystem *, less > map_guid_psys; +typedef map< string, int, strless > map_str_int; typedef list list_pnode; typedef list list_psystem; typedef list list_int; @@ -121,6 +127,9 @@ typedef set< uint16_t, less< uint16_t > > set_uint16; #define FABU_LOG_VERBOSE 0x4 #define IBNODE_UNASSIGNED_RANK 0xFF +// DFS constants type +typedef enum {Untouched,Open,Closed} dfs_t; + // // GLOBALS // @@ -198,6 +207,48 @@ static inline string guid2str(uint64_t guid) { }; // +// Virtual Channel class +// Used for credit loops verification +// + +class VChannel { + vec_pvch depend; // Vector of dependencies + dfs_t flag; // DFS state + public: + // Constructor + VChannel() { + flag = Untouched; + }; + //Getters/Setters + inline void setDependSize(int numDepend) { + if (depend.size() != (unsigned)numDepend) { + depend.resize(numDepend); + for (int i=0;iSL mapping) of this node (for CAs only) + vec3_byte SLVL; // SL2VL table of this node (for switches only) vec_word MFT; // The Multicast forwarding table PrivateAppData appData1; // Application Private Data #1 PrivateAppData appData2; // Application Private Data #2 @@ -316,6 +370,18 @@ class IBNode { // Get the LFT for a given lid int getLFTPortForLid (unsigned int lid); + // Set the PSL table + void setPSLForLid (unsigned int lid, unsigned int maxLid, uint8_t sl); + + // Add entry to SL2VL table + void setSLVL(unsigned int iport,unsigned int oport,uint8_t sl, uint8_t vl); + + // Get the PSL table for a given lid + uint8_t getPSLForLid(unsigned int lid); + + // Get the SL2VL table entry + uint8_t getSLVL(unsigned int iport, unsigned int oport, uint8_t sl); + // Set the Multicast FDB table void setMFTPortForMLid(unsigned int lid, unsigned int portNum); @@ -423,6 +489,8 @@ class IBFabric { unsigned int lmc; // LMC value used uint8_t defAllPorts; // If not zero all ports (unconn) are declared uint8_t subnCANames; // The Subnet.lst has host names for CA's + uint8_t numSLs; // Number of used SLs + uint8_t numVLs; // Number of used VLs set_uint16 mcGroups; // A set of all active multicast groups // Constructor @@ -430,6 +498,8 @@ class IBFabric { maxLid = 0; defAllPorts = 1; subnCANames = 1; + numSLs = 1; + numVLs = 1; lmc = 0; minLid = 0; PortByLid.push_back(NULL); // make sure we always have one for LID=0 @@ -493,6 +563,12 @@ class IBFabric { // Parse OpenSM FDB dump file int parseFdbFile(string fn); + // Parse PSL mapping + int parsePSLFile(string fn); + + // Parse SLVL mapping + int parseSLVLFile(string fn); + // Parse an OpenSM MCFDBs file and set the MFT table accordingly int parseMCFdbFile(string fn); @@ -513,6 +589,22 @@ class IBFabric { return (PortByLid[lid]); }; + inline void setNumSLs(uint8_t nSL) { + numSLs=nSL; + }; + + inline uint8_t getNumSLs() { + return numSLs; + }; + + inline void setNumVLs(uint8_t nVL) { + numVLs=nVL; + }; + + inline uint8_t getNumVLs() { + return numVLs; + }; + // dump out the contents of the entire fabric void dump(ostream &sout); diff --git a/ibdm/src/osm_check.cpp b/ibdm/src/osm_check.cpp index e1e03e3..f1d2895 100644 --- a/ibdm/src/osm_check.cpp +++ b/ibdm/src/osm_check.cpp @@ -45,7 +45,7 @@ void show_usage() { cout << "Usage: there are two modes: Design/Verify" << endl; cout << " Design: ibdmchk [-v][-h][-e][-u][-l ][-r ] -t -n -p " << endl; - cout << " Verify: ibdmchk [-v][-h][-l ][-r ] [-s ] [-f ] [-m ]" << endl; + cout << " Verify: ibdmchk [-v][-h][-l ][-r ] [-s ] [-f ] [-m ] [-c ] [-d ]\n\n" << endl; } void @@ -72,7 +72,7 @@ show_help() { << " -p|--port = the port number by which the SM nodes is attached to the fabric.\n" << "\n" << " Options:\n" - << " -v|--verbose = verbsoe mode\n" + << " -v|--verbose = verbose mode\n" << " -h|--help = provide this help message\n" << " -l|--lmc = LMC value > 0 means assigning 2^lmc lids to each port.\n" << " -e|--enh = use enhanced routing algorithm when LMC > 0 and report the resulting paths\n" @@ -82,11 +82,15 @@ show_help() { << "\n" << " CLUSTER VERIFICATION:\n" << " Usage:\n" - << " ibdmchk [-v][-h][-r ] [-s ] [-f ] [-l ]\n\n" + << " ibdmchk [-v][-h][-r ] [-s ] [-f ] [-m ] [-l ] [-c ] [-d ]\n\n" << " Description:\n" << " After the cluster is built and OpenSM is run (using flag -D 0x43) it reports the\n" << " subnet and FDB tables into the files osm-subnet.lst, osm.fdbs and osm.fdbs in\n" - << " /var/log/ (or subnet.lst, osm.fdbs and osm.mcfdbs into /tmp in older versions).\n" + << " /var/log/ (or subnet.lst, osm.fdbs and osm.mcfdbs into /tmp in older versions).\n" + << " If more than one SL is known to be used additional file holding CAxCA->SL mapping \n" + << " is generated (format: 0xsrc_guid dlid sl) . In this case the SL2VL mapping is \n" + << " optionally supplied in an additional file (format: 0xsw_guid inport outport 0x(sl0)(sl1),\n" + << " 0x(sl2)(sl3),...). Without SL2VL mapping file an identity mapping is used.\n" << " Based on these files the utility checks all CA to CA connectivity. Further analysis\n" << " for credit deadlock potential is performed and reported. \n" << " In case of an LMC > 0 it reports histograms for how many systems and nodes\n" @@ -96,13 +100,15 @@ show_help() { << " -l|--lmc = The LMC value used while running OpenSM. Mandatory if not the default 0.\n" << "\n" << " Options:\n" - << " -v|--verbose = verbsoe mode\n" + << " -v|--verbose = verbose mode\n" << " -h|--help = provide this help message\n" << " -s|--subnet = OpenSM subnet.lst file (/var/log/osm-subnet.lst or /tmp/subnet.lst)\n" << " -f|--fdb = OpenSM dump of Ucast LFDB. Use -D 0x41 to generate it.\n" << " (default is /var/log/osm.fdbs or /tmp/osm.fdbs).\n" << " -m|--mcfdb = OpenSM dump of Multicast LFDB. Use -D 0x41 to generate it.\n" << " (default is /var/log/osm.mcfdbs or /tmp/osm.mcfdbs).\n" + << " -c|--psl = CAxCA->SL mapping. \n" + << " -d|--slvl = SL2VL mapping. \n" << " -r|--roots = a file holding all root nodes guids (one per line).\n" << "\n" << "Author: Eitan Zahavi, Mellanox Technologies LTD.\n" @@ -189,6 +195,8 @@ int main (int argc, char **argv) { string mcFdbFile = string(""); string TopoFile = string(""); string SmNodeName = string(""); + string PSLFile = string(""); + string SLVLFile = string(""); string RootsFileName = string(""); int EnhancedRouting = 0; int lmc = 0; @@ -197,11 +205,11 @@ int main (int argc, char **argv) { int AllPaths = 0; /* - * Parseing of Command Line + * Parsing of Command Line */ char next_option; - const char * const short_option = "vhl:s:f:m:el:t:p:n:uar:"; + const char * const short_option = "vhl:s:f:m:el:t:p:n:uar:c:d:"; /* In the array below, the 2nd parameter specified the number of arguments as follows: @@ -217,13 +225,15 @@ int main (int argc, char **argv) { { "subnet", 1, NULL, 's'}, { "fdb", 1, NULL, 'f'}, { "mcfdb", 1, NULL, 'm'}, + { "psl", 1, NULL, 'c'}, + { "slvl", 1, NULL, 'd'}, { "topology", 1, NULL, 't'}, - { "node", 1, NULL, 'n'}, + { "node", 1, NULL, 'n'}, { "port", 1, NULL, 'p'}, - { "enh", 0, NULL, 'e'}, - { "updn", 0, NULL, 'u'}, - { "roots", 1, NULL, 'r'}, - { "all", 0, NULL, 'a'}, + { "enh", 0, NULL, 'e'}, + { "updn", 0, NULL, 'u'}, + { "roots", 1, NULL, 'r'}, + { "all", 0, NULL, 'a'}, { NULL, 0, NULL, 0 } /* Required at the end of the array */ }; @@ -274,6 +284,20 @@ int main (int argc, char **argv) { subnetFile = string(optarg); break; + case 'c': + /* + Specifies CAxCA->SL + */ + PSLFile = string(optarg); + break; + + case 'd': + /* + Specifies SL->VL + */ + SLVLFile = string(optarg); + break; + case 't': /* Specifies Topology @@ -342,6 +366,7 @@ int main (int argc, char **argv) { printf(" SM Node ........ %s\n", SmNodeName.c_str()); printf(" SM Port ........ %u\n", SmPortNum); printf(" LMC ............ %u\n", lmc); + if (EnhancedRouting && lmc > 0) printf(" Using Enhanced Routing\n"); if (RootsFileName.size()) @@ -352,75 +377,74 @@ int main (int argc, char **argv) { exit(1); } - // get the SM Port - IBNode *p_smNode = fabric.getNode(SmNodeName); - if (! p_smNode ) { - cout << "-E- Fail to find SM node:" << SmNodeName << endl; - exit(1); - } + // get the SM Port + IBNode *p_smNode = fabric.getNode(SmNodeName); + if (! p_smNode ) { + cout << "-E- Fail to find SM node:" << SmNodeName << endl; + exit(1); + } - IBPort *p_smPort = p_smNode->getPort(SmPortNum); - if (! p_smPort) { - cout << "-E- Fail to find SM Port: " << SmNodeName - << "/" << SmPortNum << endl; - exit(1); + IBPort *p_smPort = p_smNode->getPort(SmPortNum); + if (! p_smPort) { + cout << "-E- Fail to find SM Port: " << SmNodeName + << "/" << SmPortNum << endl; + exit(1); + } + + // assign lids + if (SubnMgtAssignLids(p_smPort,lmc)) { + cout << "-E- Fail to assign LIDs." << endl; + exit(1); + } + + // we need to run min hop calculation anyway + if (SubnMgtCalcMinHopTables(&fabric)) { + cout << "-E- Fail to update Min Hops Tables." << endl; + exit(1); + } + + if (UseUpDown) { + list rootNodes; + if (RootsFileName.size()) { + rootNodes = ParseRootNodeNamesFile(&fabric, RootsFileName); + } + else { + rootNodes = SubnMgtFindRootNodesByMinHop(&fabric); + } + + if (!rootNodes.empty()) { + cout << "-I- Recognized " << rootNodes.size() << " root nodes:" << endl; + for (list ::iterator nI = rootNodes.begin(); + nI != rootNodes.end(); nI++) { + cout << " " << (*nI)->name << endl; } + cout << "---------------------------------------------------------------------------\n" << endl; - // assign lids - if (SubnMgtAssignLids(p_smPort,lmc)) { - cout << "-E- Fail to assign LIDs." << endl; + map_pnode_int nodesRank; + SubnRankFabricNodesByRootNodes(&fabric, rootNodes, nodesRank); + + if (SubnMgtCalcUpDnMinHopTbls(&fabric, nodesRank)) { + cout << "-E- Fail to update Min Hops Tables." << endl; exit(1); } + } else { + cout << "-E- Fail to recognize any root nodes. Up/Down is not active!" << endl; + } + } + + if (!EnhancedRouting) { + + if (SubnMgtOsmRoute(&fabric)) { + cout << "-E- Fail to update LFT Tables." << endl; + exit(1); + } + } else { + if (SubnMgtOsmEnhancedRoute(&fabric)) { + cout << "-E- Fail to update LFT Tables." << endl; + exit(1); + } + } - // we need to run min hop calculation anyway - if (SubnMgtCalcMinHopTables(&fabric)) { - cout << "-E- Fail to update Min Hops Tables." << endl; - exit(1); - } - - if (UseUpDown) { - list rootNodes; - if (RootsFileName.size()) - { - rootNodes = ParseRootNodeNamesFile(&fabric, RootsFileName); - } - else - { - rootNodes = SubnMgtFindRootNodesByMinHop(&fabric); - } - - if (!rootNodes.empty()) { - cout << "-I- Recognized " << rootNodes.size() << " root nodes:" << endl; - for (list ::iterator nI = rootNodes.begin(); - nI != rootNodes.end(); nI++) { - cout << " " << (*nI)->name << endl; - } - cout << "---------------------------------------------------------------------------\n" << endl; - - map_pnode_int nodesRank; - SubnRankFabricNodesByRootNodes(&fabric, rootNodes, nodesRank); - - if (SubnMgtCalcUpDnMinHopTbls(&fabric, nodesRank)) { - cout << "-E- Fail to update Min Hops Tables." << endl; - exit(1); - } - } else { - cout << "-E- Fail to recognize any root nodes. Up/Down is not active!" << endl; - } - } - - if (!EnhancedRouting) { - - if (SubnMgtOsmRoute(&fabric)) { - cout << "-E- Fail to update LFT Tables." << endl; - exit(1); - } - } else { - if (SubnMgtOsmEnhancedRoute(&fabric)) { - cout << "-E- Fail to update LFT Tables." << endl; - exit(1); - } - } } else { int anyMissingFile = 0; @@ -463,6 +487,14 @@ int main (int argc, char **argv) { printf(" FDB File = %s\n", fdbFile.c_str()); printf(" MCFDB File = %s\n", mcFdbFile.c_str()); printf(" Subnet File = %s\n", subnetFile.c_str()); + if (PSLFile.size() >0) + printf(" SL File = %s\n", PSLFile.c_str()); + else + printf(" SL File = NONE, using single SL\n"); + if (SLVLFile.size() >0) + printf(" SLVL File = %s\n", SLVLFile.c_str()); + else + printf(" SLVL File = NONE, using identity mapping\n"); printf(" LMC = %u\n", lmc); printf("-------------------------------------------------\n"); @@ -478,6 +510,18 @@ int main (int argc, char **argv) { exit(1); } + if (PSLFile.size() && fabric.parsePSLFile(PSLFile)) { + cout << "-E- Fail to parse SL file:" << PSLFile << endl; + exit(1); + } + + fabric.setNumVLs(fabric.getNumSLs()); + + if (SLVLFile.size() && fabric.parseSLVLFile(SLVLFile)) { + cout << "-E- Fail to parse SL file:" << SLVLFile << endl; + exit(1); + } + if (fabric.parseMCFdbFile(mcFdbFile)) { cout << "-E- Fail to parse multicast fdb file:" << mcFdbFile << endl; exit(1); @@ -509,20 +553,20 @@ int main (int argc, char **argv) { int anyErr = 0; if (RootsFileName.size()) - { - if (TopoFile.size()) { - rootNodes = ParseRootNodeNamesFile(&fabric, RootsFileName); + if (TopoFile.size()) + { + rootNodes = ParseRootNodeNamesFile(&fabric, RootsFileName); + } + else + { + rootNodes = ParseRootNodeGuidsFile(&fabric, RootsFileName); + } } - else + else { - rootNodes = ParseRootNodeGuidsFile(&fabric, RootsFileName); + rootNodes = SubnMgtFindRootNodesByMinHop(&fabric); //(Causes segmentation fault with DOR) } - } - else - { - rootNodes = SubnMgtFindRootNodesByMinHop(&fabric); - } if (!rootNodes.empty()) { cout << "-I- Recognized " << rootNodes.size() << " root nodes:" << endl; -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Tue Oct 7 02:31:34 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 07 Oct 2008 11:31:34 +0200 Subject: [ofa-general] [PATCH] ibutils: Optimal permutation routing in fat-tree Message-ID: <48EB2C76.7070003@dev.mellanox.co.il> [On behalf of Vladimir Zdornov] Optimal permutation routing in fat-tree. Fat-tree is regarded to as high-order Clos network. Each Clos is routed using graph coloring of a bipartite graph. The result of the routing procedure affects ascending only (descending fdb entries stay unchanged). Signed-off-by: Vladimir Zdornov --- ibdm/datamodel/Bipartite.cc | 672 +++++++++++++++++ ibdm/datamodel/Bipartite.h | 194 +++++ ibdm/datamodel/FatTree.cpp | 1743 ++++++++++++++++++++++++------------------- ibdm/datamodel/Makefile.am | 4 +- ibdm/datamodel/RouteSys.cc | 258 +++++++ ibdm/datamodel/RouteSys.h | 98 +++ ibdm/datamodel/SubnMgt.h | 4 + ibdm/datamodel/ibdm.i | 3 + 8 files changed, 2226 insertions(+), 750 deletions(-) create mode 100644 ibdm/datamodel/Bipartite.cc create mode 100644 ibdm/datamodel/Bipartite.h create mode 100644 ibdm/datamodel/RouteSys.cc create mode 100644 ibdm/datamodel/RouteSys.h diff --git a/ibdm/datamodel/Bipartite.cc b/ibdm/datamodel/Bipartite.cc new file mode 100644 index 0000000..760b040 --- /dev/null +++ b/ibdm/datamodel/Bipartite.cc @@ -0,0 +1,672 @@ +/* + * Copyright (c) 2008 Mellanox Technologies LTD. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include "Bipartite.h" + +//CLASS edge/////////////////////////////////// + +bool edge::isMatched() { + vertex* ver1 = (vertex*)v1; + vertex* ver2 = (vertex*)v2; + + if (((this == ver1->getPartner()) && (this != ver2->getPartner())) || ((this == ver2->getPartner()) && (this != ver1->getPartner()))) + cout << "-E- Error in edge matching" << endl; + + return (this == ver1->getPartner()) && (this == ver2->getPartner()); +} + +//CLASS vertex///////////////////////////////// + +vertex::vertex(int n, side sd, int rad):id(n),s(sd),radix(rad) +{ + connections = new edge*[radix]; + pred = new edge*[radix]; + succ = new edge*[radix]; + + partner = NULL; + for (int i=0; iv1 == this) { + my_idx = e->idx1; + other_idx = e->idx2; + v = (vertex*)(e->v2); + } + else if (e->v2 == this) { + my_idx = e->idx2; + other_idx = e->idx1; + v = (vertex*)(e->v1); + } + else { + cout << "-E- Edge not connected to current vertex" << endl; + return; + } + + if (my_idx >= radix || other_idx >= radix) { + cout << "-E- Edge index illegal" << endl; + return; + } + + // Now disconnect + connections[my_idx] = NULL; + maxUsed--; + v->connections[other_idx] = NULL; + v->maxUsed--; +} + +void vertex::pushConnection(edge* e) +{ + maxUsed++; + if (maxUsed == radix) { + cout << "-E- Can't push connection to vertex: " << id << ", exceeding radix!" << endl; + return; + } + // Mark our side + if (e->v1 == NULL) { + e->v1 = this; + e->idx1 = maxUsed; + } + else if (e->v2 == NULL) { + e->v2 = this; + e->idx2 = maxUsed; + } + else { + cout << "-E- Can't push connection both edges are already filled" << endl; + return; + } + + if (maxUsed >= radix) { + cout << "-E- maxUsed illegal" << endl; + return; + } + + // Now connect + connections[maxUsed] = e; +} + +edge* vertex::popConnection() +{ + // Look for a connection + int i=0; + while ((iv1 == this) { + vertex* v = (vertex*)(tmp->v2); + v->connections[tmp->idx2] = NULL; + } + else if (tmp->v2 == this) { + vertex* v = (vertex*)(tmp->v1); + v->connections[tmp->idx1] = NULL; + } + else { + cout << "-E- Edge not connected to current vertex" << endl; + return NULL; + } + + if (tmp->idx1 >= radix || tmp->idx2 >= radix) { + cout << "-E- Edge index illegal" << endl; + return NULL; + } + + return tmp; +} + +// For unmacthed vertex, find an unmatched neighbor and match the pair +bool vertex::match() +{ + // Already matched + if (partner) + return false; + // Look for neighbor + for (int i=0; iotherSide(this)); + if (!v->partner) { + // Match + partner = connections[i]; + v->partner = connections[i]; + return true; + } + } + } + return false; +} + +edge* vertex::getPartner() const +{ + return partner; +} + +bool vertex::getInLayers() const +{ + return inLayers; +} + +void vertex::setInLayers(bool b) +{ + inLayers = b; +} + +void vertex::resetLayersInfo() +{ + inLayers = false; + succCount = predCount = 0; + for (int i=0; i& l) +{ + // No partner + if (!partner) + return; + vertex* p = (vertex*)(partner->otherSide(this)); + // Partner already in one of the layers + if (p->inLayers) + return; + // Add partner to the layer + l.push_front(p); + p->inLayers = true; + // Update pred/succ relations + if (succCount >= radix) { + cout << "-E- More successors than radix" << endl; + return; + } + succ[succCount] = partner; + succCount++; + + if (p->predCount >= radix) { + cout << "-E- More predecessors than radix" << endl; + return; + } + p->pred[p->predCount] = partner; + p->predCount++; +} + +bool vertex::addNonPartnersLayers(list& l) +{ + vertex* prtn = NULL; + bool res = false; + + if (partner) + prtn = (vertex*)(partner->otherSide(this)); + + for (int i=0; iotherSide(this)); + if ((v != prtn) && (!v->inLayers)) { + // free vertex found + if (!v->partner) + res = true; + // Add vertex to the layer + l.push_front(v); + v->inLayers = true; + // Update pred/succ relations + if (succCount >= radix) { + cout << "-E- More successors than radix" << endl; + return false; + } + succ[succCount] = connections[i]; + succCount++; + + if (v->predCount >= radix) { + cout << "-E- More predecessors than radix" << endl; + return false; + } + v->pred[v->predCount] = connections[i]; + v->predCount++; + } + } + return res; +} + +vertex* vertex::getPredecessor() const +{ + vertex* v = NULL; + // Find a valid predecessor still in layers + for (int i=0; iotherSide(this)); + if (v2->inLayers) { + v = v2; + break; + } + } + } + return v; +} + +// Flip edge status on augmenting path +void vertex::flipPredEdge(int idx) +{ + int i=0; + // Find an edge to a predecessor + for (i=0; iv1; + vertex* v2 = (vertex*)pred[i]->v2; + if (v1->getInLayers() && v2->getInLayers()) + break; + } + + if (i == radix) { + cout << "-E- Could find predecessor for flip" << endl; + return; + } + // The predecessor vertex + vertex* v = (vertex*) pred[i]->otherSide(this); + + // Unmatch edge + if (idx) + v->partner = NULL; + // Match edge + else { + partner = pred[i]; + v->partner = pred[i]; + } +} + +// Removing vertex from layers and updating successors/predecessors +void vertex::unLink(list& l) +{ + inLayers = false; + for (int i=0; iotherSide(this); + if (v->inLayers) { + v->predCount--; + // v has no more predecessors, schedule for removal from layers... + if (!v->predCount) + l.push_back(v); + succ[i] = NULL; + } + } + } + succCount = 0; +} + +//CLASS Bipartite + +// C'tor + +Bipartite::Bipartite(int s, int r):size(s),radix(r) +{ + leftSide = new vertex*[size]; + rightSide = new vertex*[size]; + + for (int i=0; ireqDat; +} + +///////////////////////////////////////////////////////////// + +// Add connection between the nodes to the graph + +void Bipartite::connectNodes(int n1, int n2, inputData reqDat) +{ + if ((n1 >= size) || (n2 >= size)) { + cout << "-E- Node index exceeds size" << endl; + return; + } + // Create new edge + edge* newEdge = new edge; + + // Init edge fields and add to it the edges list + newEdge->it = List.insert(List.end(), newEdge); + newEdge->reqDat = reqDat; + newEdge->v1 = newEdge->v2 = NULL; + + // Connect the vertices + leftSide[n1]->pushConnection(newEdge); + rightSide[n2]->pushConnection(newEdge); +} + +//////////////////////////////////////////////////////////////// + +// Find maximal matching + +void Bipartite::maximalMatch() +{ + // Invoke match on left-side vertices + for (int i=0; imatch(); +} + +//////////////////////////////////////////////////////////////// + +// Find maximum matching + +// Hopcroft-Karp algorithm +Bipartite* Bipartite::maximumMatch() +{ + // First find a maximal match + maximalMatch(); + + list::iterator it2 = List.begin(); + list l1, l2; + list::iterator it; + + // Loop on algorithm phases + while (1) { + bool hardStop = false; + // First reset layers info + for (int i=0; iresetLayersInfo(); + rightSide[i]->resetLayersInfo(); + } + // Add free left-side vertices to l1 (building layer 0) + l1.clear(); + for (int i=0; igetPartner()) { + l1.push_front(leftSide[i]); + leftSide[i]->setInLayers(true); + } + } + // This is our termination condition + // Maximum matching achieved + if (l1.empty()) + break; + // Loop on building layers + while (1) { + bool stop = false; + l2.clear(); + // Add all non-partners of vertices in l1 to layers (l2) + // We signal to stop if a free (right-side) vertex is entering l2 + for (it = l1.begin(); it != l1.end(); it++) + if ((*it)->addNonPartnersLayers(l2)) + stop = true; + // Found free vertex among right-side vertices + if (stop) { + // There are augmenting paths, apply them + augment(l2); + break; + } + // This is a terminal condition + if (l2.empty()) { + hardStop = true; + break; + } + // Add partners of vertices in l2 to layers (l1) + l1.clear(); + for (it = l2.begin(); it!= l2.end(); it++) + (*it)->addPartnerLayers(l1); + // This is a terminal condition + if (l1.empty()) { + hardStop = true; + break; + } + } + // Maximum matching achieved + if (hardStop) + break; + } + // Build the matching graph + Bipartite* M = new Bipartite(size, 1); + // Go over all edges and move matched one to the new graph + it2 = List.begin(); + while (it2 != List.end()) { + edge* e = (edge*)(*it2); + if (e->isMatched()) { + vertex* v1 = (vertex*)(e->v1); + vertex* v2 = (vertex*)(e->v2); + // Unlink vertices + ((vertex*)(e->v1))->delConnection(e); + // Update edges list + it2 = List.erase(it2); + // Add link to the new graph + if (v1->getSide() == LEFT) + M->connectNodes(v1->getID(), v2->getID(), e->reqDat); + else + M->connectNodes(v2->getID(), v1->getID(), e->reqDat); + // Free memory + delete e; + } + else + it2++; + } + return M; +} + +////////////////////////////////////////////////////////////////////// + +// Apply augmenting paths on the matching + +void Bipartite::augment(list& l) +{ + // Use delQ to store vertex scheduled for the removal from layers + list delQ; + // Remove non-free vertices + list::iterator it = l.begin(); + while (it != l.end()) { + if ((*it)->getPartner()) { + delQ.push_front((*it)); + it = l.erase(it); + } + else + it++; + } + // Get rid of non-free vertices + while (!delQ.empty()) { + vertex* fr = delQ.front(); + delQ.pop_front(); + fr->unLink(delQ); + } + + if (l.empty()) { + cout << "-E- No free vertices left!" << endl; + return; + } + // Augment and reset pred/succ relations + while (!l.empty()) { + vertex* curr = l.front(); + l.pop_front(); + int idx = 0; + // Backtrace the path and augment + int length=0; + while (1) { + delQ.push_front(curr); + // Its either the end of a path or an error + if (!curr->getPredecessor()) { + if (!idx && length) { + cout << "-E- This vertex must have predecessor" << endl; + return; + } + break; + } + // Flip edge status + curr->flipPredEdge(idx); + // Move back + curr = curr->getPredecessor(); + idx = (idx+1)%2; + length++; + } + // Now clean the delQ + while (!delQ.empty()) { + vertex* fr = delQ.front(); + delQ.pop_front(); + fr->unLink(delQ); + } + } +} + +//////////////////////////////////////////////////////////////////////// + +// Perform Euler decomposition + +void Bipartite::decompose(Bipartite** bp1, Bipartite** bp2) +{ + if (radix < 2) { + cout << "-E- Radix value illegal: " << radix << endl; + return; + } + + // Create new graphs + Bipartite* arr[2]; + arr[0] = new Bipartite(size, radix/2); + arr[1] = new Bipartite(size, radix/2); + + // Continue as long as unassigned edges left + while (!List.empty()) { + int idx = 0; + edge* e = (edge*)List.front(); + vertex* current = (vertex*)e->v1; + e = current->popConnection(); + + while (e) { + // Connect nodes in the new graphs + vertex* v1 = (vertex*)e->v1; + vertex* v2 = (vertex*)e->v2; + if (v1->getSide() == LEFT) + arr[idx]->connectNodes(v1->getID(), v2->getID(), e->reqDat); + else + arr[idx]->connectNodes(v2->getID(), v1->getID(), e->reqDat); + idx = (idx+1)%2; + // Remove edge from list + List.erase(e->it); + // Pick next vertex + current = (vertex*) e->otherSide(current); + // Free memory + delete e; + // Pick next edge + e = current->popConnection(); + } + } + *bp1 = arr[0]; + *bp2 = arr[1]; +} diff --git a/ibdm/datamodel/Bipartite.h b/ibdm/datamodel/Bipartite.h new file mode 100644 index 0000000..cdac81a --- /dev/null +++ b/ibdm/datamodel/Bipartite.h @@ -0,0 +1,194 @@ +/* + * Copyright (c) 2008 Mellanox Technologies LTD. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +/* + * Fabric Utilities Project + * + * Bipartite Graph Header file + * + * Author: Vladimir Zdornov, Mellanox Technologies + * + */ + +#ifndef IBDM_BIPARTITE_H_ +#define IBDM_BIPARTITE_H_ + +#include +#include "RouteSys.h" + +using namespace std; + +typedef list::iterator peIter; +typedef list peList; + +typedef enum side_ {LEFT=0, RIGHT} side; + +class edge +{ + public: + // Vertices + void* v1; + void* v2; + // Connection indices + int idx1; + int idx2; + + // Edge list iterator + peIter it; + + // Request data + inputData reqDat; + + // C'tor + edge():v1(NULL),v2(NULL),idx1(-1),idx2(-1){}; + + // Get the vertex on the other side of the edge + void* otherSide(const void* v) { + if (v == v1) + return v2; + if (v == v2) + return v1; + return NULL; + } + + // Check whether the edge is matched + bool isMatched(); +}; + +class vertex +{ + // ID + int id; + // Side (0 - left, 1 - right) + side s; + // Array of connected edges + edge** connections; + // Initial number of neighbors + int radix; + // Current number of neighbors + int maxUsed; + + // Matching fields + + // Edge leading to the partner (NULL if none) + edge* partner; + // Array of layers predecessors + edge** pred; + // Number of predecessors + int predCount; + // Array of layers successors + edge** succ; + // Number of successors + int succCount; + // Denotes whether vertex is currently in layers + bool inLayers; + + public: + // C'tor + vertex(int n, side sd, int rad); + // D'tor + ~vertex(); + // Getters + Setters + int getID() const; + side getSide() const; + edge* getPartner() const; + vertex* getPredecessor() const; + bool getInLayers() const; + void setInLayers(bool b); + // Reset matching info + void resetLayersInfo(); + // Adding partner to match layers + void addPartnerLayers(list& l); + // Adding non-partners to match layers + // Return true if one of the neighbors was free + bool addNonPartnersLayers(list& l); + // Add new connection (this vertex only) + void pushConnection(edge* e); + // Remove given connection (vertices on both sides) + void delConnection(edge* e); + // Flip predecessor edge status + // idx represents the layer index % 2 in the augmenting backward traversal (starting from 0) + void flipPredEdge(int idx); + // Remove vertex from layers, update predecessors and successors + void unLink(list& l); + // Get SOME connection and remove it (from both sides) + // If no connections present NULL is returned + edge* popConnection(); + // Match vertex to SOME unmatched neighbor + // In case of success true is returned, false in case of failure + bool match(); +}; + +class Bipartite +{ + // Number of vertices on each side + int size; + // Number of edges per vertex + int radix; + // Vertices arrays + vertex** leftSide; + vertex** rightSide; + + peIter it; + // Edges list + peList List; + + // Apply augmenting paths + void augment(list& l); + + public: + // C'tor + Bipartite(int s, int r); + // D'tor + ~Bipartite(); + int getRadix() const; + // Set iterator to first edge (returns false if list is empty) + bool setIterFirst(); + // Set iterator to next edge (return false if list end is reached) + bool setIterNext(); + // Get inputData pointed by iterator + inputData getReqDat(); + // Add connection between two nodes + void connectNodes(int n1, int n2, inputData reqDat); + // Find maximal matching on current graph + void maximalMatch(); + // Find maximum matching and remove it from the graph + // We use a variant of Hopcroft-Karp algorithm + Bipartite* maximumMatch(); + // Decompose bipartite into to edge disjoint radix/2-regular bps + // We use a variant of Euler Decomposition + void decompose(Bipartite** bp1, Bipartite** bp2); +}; + +#endif diff --git a/ibdm/datamodel/FatTree.cpp b/ibdm/datamodel/FatTree.cpp index 55ff116..9fb2858 100644 --- a/ibdm/datamodel/FatTree.cpp +++ b/ibdm/datamodel/FatTree.cpp @@ -43,10 +43,12 @@ FatTree Utilities: #include #include "Fabric.h" #include "SubnMgt.h" +#include "RouteSys.h" + ////////////////////////////////////////////////////////////////////////////// // Build a Fat Tree data structure for the given topology. // Prerequisites: Ranking performed and stored at p_node->rank. -// Ranking is such that roots are marked with rank=0 and leaf switches with +/// Ranking is such that roots are marked with rank=0 and leaf switches with // highest value. // // The algorithm BFS from an arbitrary leaf switch. @@ -70,169 +72,188 @@ FatTree Utilities: // for comparing tupples struct FatTreeTuppleLess : public binary_function { - bool operator()(const vec_byte& x, const vec_byte& y) const { - if (x.size() > y.size()) return false; - if (y.size() > x.size()) return true; + bool operator()(const vec_byte& x, const vec_byte& y) const { + if (x.size() > y.size()) return false; + if (y.size() > x.size()) return true; - for (unsigned int i = 0 ; i < x.size() ; i++) + for (unsigned int i = 0 ; i < x.size() ; i++) { - if (x[i] > y[i]) return false; - if (x[i] < y[i]) return true; + if (x[i] > y[i]) return false; + if (x[i] < y[i]) return true; } - return false; - } + return false; + } }; typedef map< IBNode *, vec_byte, less< IBNode *> > map_pnode_vec_byte; typedef vector< list< int > > vec_list_int; class FatTreeNode { - IBNode *p_node; // points to the fabric node for this node - vec_list_int childPorts; // port nums connected to child by changing digit - vec_list_int parentPorts;// port nums connected to parent by changing digit + IBNode *p_node; // points to the fabric node for this node + vec_list_int childPorts; // port nums connected to child by changing digit + vec_list_int parentPorts;// port nums connected to parent by changing digit public: - FatTreeNode(IBNode *p_node); - FatTreeNode(){p_node = NULL;}; - int numParents(); - int numChildren(); - int numParentGroups(); - int numChildGroups(); - friend class FatTree; + FatTreeNode(IBNode *p_node); + FatTreeNode(){p_node = NULL;}; + int numParents(); + int numChildren(); + int numParentGroups(); + int numChildGroups(); + bool goingDown(int lid); + friend class FatTree; }; FatTreeNode::FatTreeNode(IBNode *p_n) { - p_node = p_n; - list< int > emptyList; - for (unsigned int pn = 0; pn <= p_node->numPorts; pn++) - { + p_node = p_n; + list< int > emptyList; + for (unsigned int pn = 0; pn <= p_node->numPorts; pn++) + { childPorts.push_back(emptyList); parentPorts.push_back(emptyList); - } + } } // get the total number of children a switch have int FatTreeNode::numChildren() { - int s = 0; - for (int i = 0; i < childPorts.size(); i++) - s += childPorts[i].size(); - return s; + int s = 0; + for (int i = 0; i < childPorts.size(); i++) + s += childPorts[i].size(); + return s; } // get the total number of children a switch have int FatTreeNode::numParents() { - int s = 0; - for (int i = 0; i < parentPorts.size(); i++) - s += parentPorts[i].size(); - return s; + int s = 0; + for (int i = 0; i < parentPorts.size(); i++) + s += parentPorts[i].size(); + return s; } // get the total number of children groups int FatTreeNode::numChildGroups() { - int s = 0; - for (int i = 0; i < childPorts.size(); i++) - if (childPorts[i].size()) s++; - return s; + int s = 0; + for (int i = 0; i < childPorts.size(); i++) + if (childPorts[i].size()) s++; + return s; } int FatTreeNode::numParentGroups() { - int s = 0; - for (int i = 0; i < parentPorts.size(); i++) - if (parentPorts[i].size()) s++; - return s; + int s = 0; + for (int i = 0; i < parentPorts.size(); i++) + if (parentPorts[i].size()) s++; + return s; +} + +// Check whether there is downwards path towards the given lid +bool FatTreeNode::goingDown(int lid) +{ + int portNum = p_node->getLFTPortForLid(lid); + if (portNum == IB_LFT_UNASSIGNED) + return false; + for (int i=0; i::iterator lI = childPorts[i].begin();lI != childPorts[i].end(); lI++) { + if (portNum == *lI) + return true; + } + return false; } typedef map< vec_byte, class FatTreeNode, FatTreeTuppleLess > map_tupple_ftnode; class FatTree { - // the node tupple is built out of the following: - // d[0] = rank - // d[1..N-1] = ID digits - IBFabric *p_fabric; // The fabric we attach to - map_pnode_vec_byte TuppleByNode; - map_tupple_ftnode NodeByTupple; - vec_int LidByIdx; // store target HCA lid by its index - unsigned int N; // number of levels in the fabric - - // obtain the Fat Tree node for a given IBNode - FatTreeNode* getFatTreeNodeByNode(IBNode *p_node); - - // get the first lowest level switch while making sure all HCAs - // are connected to same rank - // return NULL if this check is not met or no ranking available - IBNode *getLowestLevelSwitchNode(); - - // get a free tupple given the reference one and the index to change: - vec_byte getFreeTupple(vec_byte refTupple, unsigned int changeIdx); - - // convert tupple to string - string getTuppleStr(vec_byte tupple); - - // simply dump out the FatTree data: - void dump(); - - // track a connection to remote switch - int trackConnection( - FatTreeNode *p_ftNode, - vec_byte tupple, // the connected node tupple - unsigned int rank, // rank of the local node - unsigned int remRank, // rank of the remote node - unsigned int portNum, // the port number connecting to the remote node - unsigned int remDigit // the digit which changed on the remote node - ); - - // set of coefficients that represent the structure - int maxHcasPerLeafSwitch; - vec_int childrenPerRank; // not valid for leafs - vec_int parentsPerRank; - vec_int numSwInRank; // number of switches for that level - vec_int downByRank; // number of remote child switches s at rank - vec_int upByRank; // number of remote parent switches at rank - - // extract fat tree coefficients and update validity flag - // return 0 if OK - int extractCoefficients(); + // the node tupple is built out of the following: + // d[0] = rank + // d[1..N-1] = ID digits + IBFabric *p_fabric; // The fabric we attach to + map_pnode_vec_byte TuppleByNode; + map_tupple_ftnode NodeByTupple; + vec_int LidByIdx; // store target HCA lid by its index + unsigned int N; // number of levels in the fabric + map_str_int IdxByName; + + // obtain the Fat Tree node for a given IBNode + FatTreeNode* getFatTreeNodeByNode(IBNode *p_node); + + // get the first lowest level switch while making sure all HCAs + // are connected to same rank + // return NULL if this check is not met or no ranking available + IBNode *getLowestLevelSwitchNode(); + + // get a free tupple given the reference one and the index to change: + vec_byte getFreeTupple(vec_byte refTupple, unsigned int changeIdx); + + // convert tupple to string + string getTuppleStr(vec_byte tupple); + + // simply dump out the FatTree data: + void dump(); + + // track a connection to remote switch + int trackConnection( + FatTreeNode *p_ftNode, + vec_byte tupple, // the connected node tupple + unsigned int rank, // rank of the local node + unsigned int remRank, // rank of the remote node + unsigned int portNum, // the port number connecting to the remote node + unsigned int remDigit // the digit which changed on the remote node + ); + + // set of coefficients that represent the structure + int maxHcasPerLeafSwitch; + vec_int childrenPerRank; // not valid for leafs + vec_int parentsPerRank; + vec_int numSwInRank; // number of switches for that level + vec_int downByRank; // number of remote child switches s at rank + vec_int upByRank; // number of remote parent switches at rank + + // extract fat tree coefficients and update validity flag + // return 0 if OK + int extractCoefficients(); public: - // construct the fat tree by matching the topology to it. - // note that this might return an invalid tree for routing - // as indicated by isValid flag - FatTree(IBFabric *p_fabric); + // construct the fat tree by matching the topology to it. + // note that this might return an invalid tree for routing + // as indicated by isValid flag + FatTree(IBFabric *p_fabric); + + // true if the fabric can be mapped to a fat tree + bool isValid; - // true if the fabric can be mapped to a fat tree - bool isValid; + // propagate FDB assignments going up the tree ignoring the out port + int assignLftUpWards(FatTreeNode *p_ftNode, uint16_t dLid, int outPortNum, int switchPathOnly); - // propagate FDB assignments going up the tree ignoring the out port - int assignLftUpWards(FatTreeNode *p_ftNode, uint16_t dLid, - int outPortNum, int switchPathOnly); + // set FDB values as given in the input + int forceLftUpWards(FatTreeNode *p_ftNode, uint16_t dLid, vec_int ports); - // propagate FDB assignments going down the tree - int - assignLftDownWards(FatTreeNode *p_ftNode, uint16_t dLid, - int outPortNum, int switchPathOnly); + // propagate FDB assignments going down the tree + int assignLftDownWards(FatTreeNode *p_ftNode, uint16_t dLid, int outPortNum, int switchPathOnly, int downOnly); - // route the fat tree - int route(); + // route the fat tree + int route(); - // create the file ftree.hcas with the list of HCA port names - // and LIDs in the correct order - void dumpHcaOrder(); + // route requested permutation in the fat tree + int permRoute(vector src, vector dst); + + // create the file ftree.hcas with the list of HCA port names + // and LIDs in the correct order + void dumpHcaOrder(); }; FatTreeNode* FatTree::getFatTreeNodeByNode(IBNode *p_node) { - FatTreeNode* p_ftNode; - vec_byte tupple(N, 0); - tupple = TuppleByNode[p_node]; - p_ftNode = &NodeByTupple[tupple]; - return p_ftNode; + FatTreeNode* p_ftNode; + vec_byte tupple(N, 0); + tupple = TuppleByNode[p_node]; + p_ftNode = &NodeByTupple[tupple]; + return p_ftNode; } // get the first lowest level switch while making sure all HCAs @@ -240,124 +261,124 @@ FatTreeNode* FatTree::getFatTreeNodeByNode(IBNode *p_node) { // return NULL if this check is not met or no ranking available IBNode *FatTree::getLowestLevelSwitchNode() { - unsigned int leafRank = 0; - IBNode *p_leafSwitch = NULL; - IBPort *p_port; - - // go over all HCAs and track the rank of the node connected to them - for( map_str_pnode::iterator nI = p_fabric->NodeByName.begin(); - nI != p_fabric->NodeByName.end(); - nI++) - { + unsigned int leafRank = 0; + IBNode *p_leafSwitch = NULL; + IBPort *p_port; + + // go over all HCAs and track the rank of the node connected to them + for( map_str_pnode::iterator nI = p_fabric->NodeByName.begin(); + nI != p_fabric->NodeByName.end(); + nI++) + { IBNode *p_node = (*nI).second; if (p_node->type != IB_CA_NODE) continue; for (unsigned int pn = 1; pn <= p_node->numPorts; pn++) - { - p_port = p_node->getPort(pn); - if (p_port && p_port->p_remotePort) - { - IBNode *p_remNode = p_port->p_remotePort->p_node; - - if (p_remNode->type != IB_SW_NODE) continue; - - // is the remote node ranked? - if (!p_remNode->rank) continue; - - // must be identical for all leaf switches: - if (!leafRank) - { - leafRank = p_remNode->rank; - p_leafSwitch = p_remNode; - } - else - { - // get the lowest name - if (p_remNode->name < p_leafSwitch->name ) - p_leafSwitch = p_remNode; - - if (p_remNode->rank != leafRank) - { - cout << "-E- Given topology is not a fat tree. HCA:" - << p_remNode->name - << " found not on lowest level!" << endl; - return(NULL); - } - } - } - } - } - return(p_leafSwitch); + { + p_port = p_node->getPort(pn); + if (p_port && p_port->p_remotePort) + { + IBNode *p_remNode = p_port->p_remotePort->p_node; + + if (p_remNode->type != IB_SW_NODE) continue; + + // is the remote node ranked? + if (!p_remNode->rank) continue; + + // must be identical for all leaf switches: + if (!leafRank) + { + leafRank = p_remNode->rank; + p_leafSwitch = p_remNode; + } + else + { + // get the lowest name + if (p_remNode->name < p_leafSwitch->name ) + p_leafSwitch = p_remNode; + + if (p_remNode->rank != leafRank) + { + cout << "-E- Given topology is not a fat tree. HCA:" + << p_remNode->name + << " found not on lowest level!" << endl; + return(NULL); + } + } + } + } + } + return(p_leafSwitch); } // get a free tupple given the reference one and the index to change: // also track the max digit allocated per index vec_byte FatTree::getFreeTupple(vec_byte refTupple, unsigned int changeIdx) { - vec_byte res = refTupple; - int rank = changeIdx - 1; - for (uint8_t i = 0; i < 255; i++) - { + vec_byte res = refTupple; + int rank = changeIdx - 1; + for (uint8_t i = 0; i < 255; i++) + { res[changeIdx] = i; map_tupple_ftnode::const_iterator tI = NodeByTupple.find(res); if (tI == NodeByTupple.end()) - return res; - } - cout << "ABORT: fail to get free tupple! (in 255 indexies)" << endl; - abort(); + return res; + } + cout << "ABORT: fail to get free tupple! (in 255 indexies)" << endl; + abort(); } string FatTree::getTuppleStr(vec_byte tupple) { - char buf[128]; - buf[0] = '\0'; - for (unsigned int i = 0; i < tupple.size(); i++) - { + char buf[128]; + buf[0] = '\0'; + for (unsigned int i = 0; i < tupple.size(); i++) + { if (i) strcat(buf,"."); sprintf(buf, "%s%d", buf, tupple[i]); - } - return(string(buf)); + } + return(string(buf)); } // track connection going up or down by registering the port in the // correct fat tree node childPorts and parentPorts int FatTree::trackConnection( - FatTreeNode *p_ftNode, // the connected node - vec_byte tupple, // the connected node tupple - unsigned int rank, // rank of the local node - unsigned int remRank, // rank of the remote node - unsigned int portNum, // the port number connecting to the remote node - unsigned int remDigit // the digit of the tupple changing to the remote node - ) + FatTreeNode *p_ftNode, // the connected node + vec_byte tupple, // the connected node tupple + unsigned int rank, // rank of the local node + unsigned int remRank, // rank of the remote node + unsigned int portNum, // the port number connecting to the remote node + unsigned int remDigit // the digit of the tupple changing to the remote node + ) { - if ( rank < remRank ) - { + if ( rank < remRank ) + { // going down // make sure we have enough entries in the vector if (remDigit >= p_ftNode->childPorts.size()) - { - list< int > emptyPortList; - for (unsigned int i = p_ftNode->childPorts.size(); - i <= remDigit; i++) + { + list< int > emptyPortList; + for (unsigned int i = p_ftNode->childPorts.size(); + i <= remDigit; i++) p_ftNode->childPorts.push_back(emptyPortList); - } + } p_ftNode->childPorts[remDigit].push_back(portNum); - } - else - { + } + else + { // going up // make sure we have enough entries in the vector if (remDigit >= p_ftNode->parentPorts.size()) - { - list< int > emptyPortList; - for (unsigned int i = p_ftNode->parentPorts.size(); - i <= remDigit; i++) + { + list< int > emptyPortList; + for (unsigned int i = p_ftNode->parentPorts.size(); + i <= remDigit; i++) p_ftNode->parentPorts.push_back(emptyPortList); - } + } p_ftNode->parentPorts[remDigit].push_back(portNum); - } + } - return(0); + return(0); } // Extract fat tree coefficiants and double check its @@ -365,19 +386,19 @@ int FatTree::trackConnection( int FatTree::extractCoefficients() { - // Go over all levels of the tree. - // Collect number of nodes per each level - // Require the number of children is equal - // Require the number of parents is equal - - int prevLevel = -1; - int anyErr = 0; - - // go over all nodes - for (map_tupple_ftnode::iterator tI = NodeByTupple.begin(); - tI != NodeByTupple.end(); - tI++) - { + // Go over all levels of the tree. + // Collect number of nodes per each level + // Require the number of children is equal + // Require the number of parents is equal + + int prevLevel = -1; + int anyErr = 0; + + // go over all nodes + for (map_tupple_ftnode::iterator tI = NodeByTupple.begin(); + tI != NodeByTupple.end(); + tI++) + { FatTreeNode *p_ftNode = &((*tI).second); int level = (*tI).first[0]; bool isFirstInLevel; @@ -386,118 +407,118 @@ FatTree::extractCoefficients() prevLevel = level; if (isFirstInLevel) - { - numSwInRank.push_back(1); - parentsPerRank.push_back(p_ftNode->numParents()); - childrenPerRank.push_back(p_ftNode->numChildren()); - downByRank.push_back(p_ftNode->numChildGroups()); - upByRank.push_back(p_ftNode->numParentGroups()); - } + { + numSwInRank.push_back(1); + parentsPerRank.push_back(p_ftNode->numParents()); + childrenPerRank.push_back(p_ftNode->numChildren()); + downByRank.push_back(p_ftNode->numChildGroups()); + upByRank.push_back(p_ftNode->numParentGroups()); + } else - { - numSwInRank[level]++; - if (parentsPerRank[level] != p_ftNode->numParents()) - { - if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) - cout << "-E- node:" << p_ftNode->p_node->name - << " has unequal number of parent ports to its level" - << endl; - anyErr++; - } - - // we do not require symmetrical routing for leafs - if (level < N-1) - { - if (childrenPerRank[level] != p_ftNode->numChildren()) - { - if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) - cout << "-E- node:" << p_ftNode->p_node->name << - " has unequal number of child ports to its level" << endl; - anyErr++; - } - } - } - } + { + numSwInRank[level]++; + if (parentsPerRank[level] != p_ftNode->numParents()) + { + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + cout << "-E- node:" << p_ftNode->p_node->name + << " has unequal number of parent ports to its level" + << endl; + anyErr++; + } + + // we do not require symmetrical routing for leafs + if (level < N-1) + { + if (childrenPerRank[level] != p_ftNode->numChildren()) + { + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + cout << "-E- node:" << p_ftNode->p_node->name << + " has unequal number of child ports to its level" << endl; + anyErr++; + } + } + } + } - if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) - { + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + { for (int rank = 0; rank < numSwInRank.size(); rank++) { - cout << "-I- rank:" << rank - << " switches:" << numSwInRank[rank] - << " parents: " << parentsPerRank[rank] - << " (" << upByRank[rank] << " groups)" - << " children:" << childrenPerRank[rank] - << " (" << downByRank[rank] << " groups)" - << endl; + cout << "-I- rank:" << rank + << " switches:" << numSwInRank[rank] + << " parents: " << parentsPerRank[rank] + << " (" << upByRank[rank] << " groups)" + << " children:" << childrenPerRank[rank] + << " (" << downByRank[rank] << " groups)" + << endl; } - } + } - if (anyErr) return 1; + if (anyErr) return 1; - vec_byte firstLeafTupple(N, 0); - firstLeafTupple[0] = N-1; - maxHcasPerLeafSwitch = 0; - for (map_tupple_ftnode::iterator tI = NodeByTupple.find(firstLeafTupple); - tI != NodeByTupple.end(); - tI++) - { + vec_byte firstLeafTupple(N, 0); + firstLeafTupple[0] = N-1; + maxHcasPerLeafSwitch = 0; + for (map_tupple_ftnode::iterator tI = NodeByTupple.find(firstLeafTupple); + tI != NodeByTupple.end(); + tI++) + { FatTreeNode *p_ftNode = &((*tI).second); IBNode *p_node = p_ftNode->p_node; int numHcaPorts = 0; for (unsigned int pn = 1; pn <= p_node->numPorts; pn++) - { - IBPort *p_port = p_node->getPort(pn); - if (p_port && p_port->p_remotePort && - (p_port->p_remotePort->p_node->type == IB_CA_NODE)) - { - numHcaPorts++; - } + { + IBPort *p_port = p_node->getPort(pn); + if (p_port && p_port->p_remotePort && + (p_port->p_remotePort->p_node->type == IB_CA_NODE)) + { + numHcaPorts++; + } - } + } if (numHcaPorts > maxHcasPerLeafSwitch) - maxHcasPerLeafSwitch = numHcaPorts; - } + maxHcasPerLeafSwitch = numHcaPorts; + } - if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) - cout << "-I- HCAs per leaf switch set to:" - << maxHcasPerLeafSwitch << endl; + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + cout << "-I- HCAs per leaf switch set to:" + << maxHcasPerLeafSwitch << endl; - cout << "-I- Topology is a valid Fat Tree" << endl; - isValid = 1; + cout << "-I- Topology is a valid Fat Tree" << endl; + isValid = 1; - return 0; + return 0; } // construct the fat tree by matching the topology to it. FatTree::FatTree(IBFabric *p_f) { - isValid = 0; - p_fabric = p_f; + isValid = 0; + p_fabric = p_f; - IBNode *p_node = getLowestLevelSwitchNode(); - IBPort *p_port; - FatTreeNode *p_ftNode; + IBNode *p_node = getLowestLevelSwitchNode(); + IBPort *p_port; + FatTreeNode *p_ftNode; - if (! p_node) return; - N = p_node->rank + 1; // N = number of levels (our first rank is 0 ...) + if (! p_node) return; + N = p_node->rank + 1; // N = number of levels (our first rank is 0 ...) - // BFS from the first switch connected to HCA found on the fabric - list< IBNode * > bfsQueue; - bfsQueue.push_back(p_node); + // BFS from the first switch connected to HCA found on the fabric + list< IBNode * > bfsQueue; + bfsQueue.push_back(p_node); - // also we always allocate the address 0..0 with "rank" digits to the node: - vec_byte tupple(N, 0); + // also we always allocate the address 0..0 with "rank" digits to the node: + vec_byte tupple(N, 0); - // adjust the level: - tupple[0] = p_node->rank; - TuppleByNode[p_node] = tupple; - NodeByTupple[tupple] = FatTreeNode(p_node); - if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) - cout << "-I- Assigning tupple:" << getTuppleStr(tupple) << " to:" - << p_node->name << endl; + // adjust the level: + tupple[0] = p_node->rank; + TuppleByNode[p_node] = tupple; + NodeByTupple[tupple] = FatTreeNode(p_node); + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + cout << "-I- Assigning tupple:" << getTuppleStr(tupple) << " to:" + << p_node->name << endl; - while (! bfsQueue.empty()) - { + while (! bfsQueue.empty()) + { p_node = bfsQueue.front(); bfsQueue.pop_front(); // we must have a tupple stored - get it @@ -507,137 +528,141 @@ FatTree::FatTree(IBFabric *p_f) // go over all the node ports for (unsigned int pn = 1; pn <= p_node->numPorts; pn++) - { - p_port = p_node->getPort(pn); - if (!p_port || !p_port->p_remotePort) continue; - - IBNode *p_remNode = p_port->p_remotePort->p_node; - - if (p_remNode->type != IB_SW_NODE) - { - // for HCAs we only track the conenctions - list< int > tmpList; - tmpList.push_back(pn); - p_ftNode->childPorts.push_back(tmpList); - continue; - } - - // now try to see if this node has already a map: - map_pnode_vec_byte::iterator tI = TuppleByNode.find(p_remNode); - - // we are allowed to change the digit based on the direction we go: - unsigned int changingDigitIdx; - if (p_node->rank < p_remNode->rank) + { + p_port = p_node->getPort(pn); + if (!p_port || !p_port->p_remotePort) continue; + + IBNode *p_remNode = p_port->p_remotePort->p_node; + + if (p_remNode->type != IB_SW_NODE) + { + // for HCAs we only track the conenctions + list< int > tmpList; + tmpList.push_back(pn); + p_ftNode->childPorts.push_back(tmpList); + continue; + } + + // now try to see if this node has already a map: + map_pnode_vec_byte::iterator tI = TuppleByNode.find(p_remNode); + + // we are allowed to change the digit based on the direction we go: + unsigned int changingDigitIdx; + if (p_node->rank < p_remNode->rank) // going down the tree = use the current rank + 1 // (save one for level) changingDigitIdx = p_node->rank + 1; - else if (p_node->rank > p_remNode->rank) + else if (p_node->rank > p_remNode->rank) // goin up the tree = use current rank (first one is level) changingDigitIdx = p_node->rank; - else - { - cout << "-E- Connections on the same rank level " - << " are not allowed in Fat Tree routing." << endl; - cout << " from:" << p_node->name << "/P" << pn - << " to:" << p_remNode->name << endl; - return; - } - - // do we need to allocate a new tupple? - if (tI == TuppleByNode.end()) - { - - // the node is new - so get a new tupple for it: - vec_byte newTupple = tupple; - // change the level accordingly - newTupple[0] = p_remNode->rank; - // obtain a free one - newTupple = getFreeTupple(newTupple, changingDigitIdx); - - // assign the new tupple and add to next steps: - if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) - cout << "-I- Assigning tupple:" << getTuppleStr(newTupple) - << " to:" << p_remNode->name << " changed idx:" - << changingDigitIdx << " from:" << getTuppleStr(tupple) - << endl; - - TuppleByNode[p_remNode] = newTupple; - NodeByTupple[newTupple] = FatTreeNode(p_remNode); - - unsigned int digit = newTupple[changingDigitIdx]; - - // track the connection - if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) - cout << "-I- Connecting:" << p_node->name << " to:" - << p_remNode->name << " through port:" << pn - << " remDigit:" << digit << endl; - if (trackConnection( - p_ftNode, tupple, p_node->rank, p_remNode->rank, pn, digit)) - return; - - bfsQueue.push_back(p_remNode); - } - else - { - // other side already has a tupple - so just track the connection - vec_byte remTupple = (*tI).second; - vec_byte mergedTupple = remTupple; - - unsigned int digit = remTupple[changingDigitIdx]; - - if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) - cout << "-I- Connecting:" << p_node->name << " to:" - << p_remNode->name << " through port:" << pn - << " remDigit:" << digit << endl; - if (trackConnection( - p_ftNode, tupple, p_node->rank, p_remNode->rank, pn, digit)) - return; - } - - } // all ports - } // anything to do - - // make sure the extracted tropology can be declared "fat tree" - if (extractCoefficients()) return; - - // build mapping between HCA index and LIDs. - // We need to decide what will be the K of the lowest switches level. - // It is possible that for all of them the number of HCAs is < num - // left ports thus we should probably use the lowest number of all - vec_byte firstLeafTupple(N, 0); - firstLeafTupple[0] = N-1; - - // now restart going over all leaf switches by their tupple order and - // allocate mapping - for (map_tupple_ftnode::iterator tI = NodeByTupple.find(firstLeafTupple); - tI != NodeByTupple.end(); - tI++) - { + else + { + cout << "-E- Connections on the same rank level " + << " are not allowed in Fat Tree routing." << endl; + cout << " from:" << p_node->name << "/P" << pn + << " to:" << p_remNode->name << endl; + return; + } + + // do we need to allocate a new tupple? + if (tI == TuppleByNode.end()) + { + + // the node is new - so get a new tupple for it: + vec_byte newTupple = tupple; + // change the level accordingly + newTupple[0] = p_remNode->rank; + // obtain a free one + newTupple = getFreeTupple(newTupple, changingDigitIdx); + + // assign the new tupple and add to next steps: + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + cout << "-I- Assigning tupple:" << getTuppleStr(newTupple) + << " to:" << p_remNode->name << " changed idx:" + << changingDigitIdx << " from:" << getTuppleStr(tupple) + << endl; + + TuppleByNode[p_remNode] = newTupple; + NodeByTupple[newTupple] = FatTreeNode(p_remNode); + + unsigned int digit = newTupple[changingDigitIdx]; + + // track the connection + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + cout << "-I- Connecting:" << p_node->name << " to:" + << p_remNode->name << " through port:" << pn + << " remDigit:" << digit << endl; + if (trackConnection( + p_ftNode, tupple, p_node->rank, p_remNode->rank, pn, digit)) + return; + + bfsQueue.push_back(p_remNode); + } + else + { + // other side already has a tupple - so just track the connection + vec_byte remTupple = (*tI).second; + vec_byte mergedTupple = remTupple; + + unsigned int digit = remTupple[changingDigitIdx]; + + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + cout << "-I- Connecting:" << p_node->name << " to:" + << p_remNode->name << " through port:" << pn + << " remDigit:" << digit << endl; + if (trackConnection( + p_ftNode, tupple, p_node->rank, p_remNode->rank, pn, digit)) + return; + } + + } // all ports + } // anything to do + + // make sure the extracted tropology can be declared "fat tree" + if (extractCoefficients()) return; + + // build mapping between HCA index and LIDs. + // We need to decide what will be the K of the lowest switches level. + // It is possible that for all of them the number of HCAs is < num + // left ports thus we should probably use the lowest number of all + vec_byte firstLeafTupple(N, 0); + firstLeafTupple[0] = N-1; + + // now restart going over all leaf switches by their tupple order and + // allocate mapping + int hcaIdx = 0; + for (map_tupple_ftnode::iterator tI = NodeByTupple.find(firstLeafTupple); + tI != NodeByTupple.end(); + tI++) + { // we collect HCAs connected to the leaf switch and set their childPort // starting at the index associated with the switch tupple. FatTreeNode *p_ftNode = &((*tI).second); IBNode *p_node = p_ftNode->p_node; unsigned int pIdx = 0; for (unsigned int pn = 1; pn <= p_node->numPorts; pn++) - { - IBPort *p_port = p_node->getPort(pn); - if (p_port && p_port->p_remotePort && - (p_port->p_remotePort->p_node->type == IB_CA_NODE)) - { - LidByIdx.push_back(p_port->p_remotePort->base_lid); - pIdx++; - } - } + { + IBPort *p_port = p_node->getPort(pn); + if (p_port && p_port->p_remotePort && + (p_port->p_remotePort->p_node->type == IB_CA_NODE)) + { + LidByIdx.push_back(p_port->p_remotePort->base_lid); + IdxByName[p_port->p_remotePort->p_node->name] = hcaIdx; + pIdx++; + hcaIdx++; + } + } // we might need some padding for (; pIdx < maxHcasPerLeafSwitch; pIdx++) { - LidByIdx.push_back(0); + LidByIdx.push_back(0); + hcaIdx++; } - } + } - cout << "-I- Fat Tree Created" << endl; + cout << "-I- Fat Tree Created" << endl; - if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) - dump(); + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + dump(); } ////////////////////////////////////////////////////////////////////////////// @@ -664,100 +689,100 @@ int FatTree::assignLftUpWards(FatTreeNode *p_ftNode, uint16_t dLid, int outPortNum, int switchPathOnly) { - IBPort* p_port; - IBNode *p_node = p_ftNode->p_node; - - if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) - cout << "-V- assignLftUpWards invoked on node:" << p_node->name - << " out-port:" << outPortNum - << " to dlid:" << dLid - << " switchPathOnly:" << switchPathOnly - << endl; - - // Foreach one of the child port groups select the port which is - // less utilized and set its LFT - then recurse into it - // go over all child ports - for (int i = 0; i < p_ftNode->childPorts.size(); i++) { - if (!p_ftNode->childPorts[i].size()) continue; + IBPort* p_port; + IBNode *p_node = p_ftNode->p_node; + + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + cout << "-V- assignLftUpWards invoked on node:" << p_node->name + << " out-port:" << outPortNum + << " to dlid:" << dLid + << " switchPathOnly:" << switchPathOnly + << endl; + + // Foreach one of the child port groups select the port which is + // less utilized and set its LFT - then recurse into it + // go over all child ports + for (int i = 0; i < p_ftNode->childPorts.size(); i++) { + if (!p_ftNode->childPorts[i].size()) continue; + + // we can skip handling the remote node if + // it already has an assigned LFT for this target lid + int firstPortNum = p_ftNode->childPorts[i].front(); + IBPort *p_firstPort = p_node->getPort(firstPortNum); + IBNode *p_remNode = p_firstPort->p_remotePort->p_node; + if (p_remNode->getLFTPortForLid(dLid) != IB_LFT_UNASSIGNED) + { + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + cout << "-V- assignLftUpWards skip already assigned remote node:" + << p_remNode->name + << " switchPathOnly:" << switchPathOnly + << endl; + continue; + } - // we can skip handling the remote node if - // it already has an assigned LFT for this target lid - int firstPortNum = p_ftNode->childPorts[i].front(); - IBPort *p_firstPort = p_node->getPort(firstPortNum); - IBNode *p_remNode = p_firstPort->p_remotePort->p_node; - if (p_remNode->getLFTPortForLid(dLid) != IB_LFT_UNASSIGNED) - { - if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) - cout << "-V- assignLftUpWards skip already assigned remote node:" - << p_remNode->name - << " switchPathOnly:" << switchPathOnly - << endl; - continue; - } + int bestUsage = 0; + IBPort *p_bestPort = NULL; + int found = 0; - int bestUsage = 0; - IBPort *p_bestPort = NULL; - int found = 0; - - // we only need one best port on each group - for (list::iterator lI = p_ftNode->childPorts[i].begin(); - !found && (lI != p_ftNode->childPorts[i].end()); - lI++) { - - // can not have more then one port in group... - int portNum = *lI; - - // we do not want to descend back to the original port - if (portNum == outPortNum) - { - p_bestPort = NULL; - found = 1; - continue; - } - - IBPort *p_port = p_node->getPort(portNum); - // not required but what the hack... - if (!p_port || !p_port->p_remotePort) continue; - IBPort *p_remPort = p_port->p_remotePort; - - // ignore remote HCA nodes - if (p_remPort->p_node->type != IB_SW_NODE) continue; - - // look on the local usage as we mark usage entering a port - int usage = p_port->counter1; - if (switchPathOnly) - usage += p_port->counter2; - if ((p_bestPort == NULL) || (usage < bestUsage)) - { - p_bestPort = p_port; - bestUsage = usage; - } - } + // we only need one best port on each group + for (list::iterator lI = p_ftNode->childPorts[i].begin(); + !found && (lI != p_ftNode->childPorts[i].end()); + lI++) { + + // can not have more then one port in group... + int portNum = *lI; + + // we do not want to descend back to the original port + if (portNum == outPortNum) + { + p_bestPort = NULL; + found = 1; + continue; + } + + IBPort *p_port = p_node->getPort(portNum); + // not required but what the hack... + if (!p_port || !p_port->p_remotePort) continue; + IBPort *p_remPort = p_port->p_remotePort; + + // ignore remote HCA nodes + if (p_remPort->p_node->type != IB_SW_NODE) continue; - if (p_bestPort != NULL) + // look on the local usage as we mark usage entering a port + int usage = p_port->counter1; + if (switchPathOnly) + usage += p_port->counter2; + if ((p_bestPort == NULL) || (usage < bestUsage)) + { + p_bestPort = p_port; + bestUsage = usage; + } + } + + if (p_bestPort != NULL) { - // mark utilization - if (switchPathOnly) - p_bestPort->counter2++; - else - p_bestPort->counter1++; - - IBPort *p_bestRemPort = p_bestPort->p_remotePort; - p_remNode->setLFTPortForLid(dLid, p_bestRemPort->num); - - if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) - cout << "-V- assignLftUpWards setting lft on:" << p_remNode->name - << " to port:" << p_bestRemPort->num - << " to dlid:" << dLid << endl; - - FatTreeNode *p_remFTNode = - getFatTreeNodeByNode(p_bestRemPort->p_node); - assignLftUpWards(p_remFTNode, dLid, p_bestRemPort->num, - switchPathOnly); + // mark utilization + if (switchPathOnly) + p_bestPort->counter2++; + else + p_bestPort->counter1++; + + IBPort *p_bestRemPort = p_bestPort->p_remotePort; + p_remNode->setLFTPortForLid(dLid, p_bestRemPort->num); + + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + cout << "-V- assignLftUpWards setting lft on:" << p_remNode->name + << " to port:" << p_bestRemPort->num + << " to dlid:" << dLid << endl; + + FatTreeNode *p_remFTNode = + getFatTreeNodeByNode(p_bestRemPort->p_node); + assignLftUpWards(p_remFTNode, dLid, p_bestRemPort->num, + switchPathOnly); } - } + } - return(0); + return(0); } // to allocate a port downwards we look at all ports @@ -766,145 +791,139 @@ FatTree::assignLftUpWards(FatTreeNode *p_ftNode, uint16_t dLid, // we also start an upwards assignment to this node int FatTree::assignLftDownWards(FatTreeNode *p_ftNode, uint16_t dLid, - int outPortNum, int switchPathOnly) + int outPortNum, int switchPathOnly, int downOnly) { - IBPort *p_port; - IBNode *p_node = p_ftNode->p_node; - - if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) - cout << "-V- assignLftDownWards from:" << p_node->name - << " dlid:" << dLid - << " through port:" << outPortNum - << " switchPathOnly:" << switchPathOnly - << endl; - - if (outPortNum != 0xFF) - { - // Set FDB to that LID only if not preset or we are on "main" route - if (!switchPathOnly || - (p_node->getLFTPortForLid(dLid) == IB_LFT_UNASSIGNED)) { - p_node->setLFTPortForLid(dLid, outPortNum); - - p_port = p_node->getPort(outPortNum); - - // mark the usage of this port - if (p_port) { - if (switchPathOnly) { - p_port->counter2++; - } else { - p_port->counter1++; - } - } - } - } - - // find the remote port (following the parents list order) - // that is not used or less used. - int bestUsage = 0; - int bestGroup = -1; - IBPort *p_bestRemPort = NULL; - int found = 0; - // go over all child ports - for (int i = 0; !found && (i < p_ftNode->parentPorts.size()); i++) { - if (!p_ftNode->parentPorts[i].size()) continue; - - for (list::iterator lI = p_ftNode->parentPorts[i].begin(); - !found && (lI != p_ftNode->parentPorts[i].end()); - lI++) { - - // can not have more then one port in group... - int portNum = *lI; - IBPort *p_port = p_node->getPort(portNum); // must be if marked parent - IBPort *p_remPort = p_port->p_remotePort; - if (p_remPort == NULL) continue; - int usage = p_remPort->counter1; - if (switchPathOnly) - usage += p_remPort->counter2; - - if ((p_bestRemPort == NULL) || (usage < bestUsage)) - { - p_bestRemPort = p_remPort; - bestUsage = usage; - bestGroup = i; - // can not have better usage then no usage - if (usage == 0) - found = 1; - } + IBPort *p_port; + IBNode *p_node = p_ftNode->p_node; + + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + cout << "-V- assignLftDownWards from:" << p_node->name + << " dlid:" << dLid + << " through port:" << outPortNum + << " switchPathOnly:" << switchPathOnly + << endl; + + if (outPortNum != 0xFF) + { + // Set FDB to that LID only if not preset or we are on "main" route + if (!switchPathOnly || (p_node->getLFTPortForLid(dLid) == IB_LFT_UNASSIGNED)) { + p_node->setLFTPortForLid(dLid, outPortNum); + + p_port = p_node->getPort(outPortNum); + + // mark the usage of this port + if (p_port) { + if (switchPathOnly) { + p_port->counter2++; + } else { + p_port->counter1++; + } + } } - } - - FatTreeNode *p_remFTNode; - // first visit the official path! - if (bestGroup != -1) + } + + // find the remote port (following the parents list order) + // that is not used or less used. + int bestUsage = 0; + int bestGroup = -1; + IBPort *p_bestRemPort = NULL; + int found = 0; + // go over all child ports + for (int i = 0; !found && (i < p_ftNode->parentPorts.size()); i++) { + if (!p_ftNode->parentPorts[i].size()) continue; + + for (list::iterator lI = p_ftNode->parentPorts[i].begin(); + !found && (lI != p_ftNode->parentPorts[i].end()); + lI++) { + + // can not have more then one port in group... + int portNum = *lI; + IBPort *p_port = p_node->getPort(portNum); // must be if marked parent + IBPort *p_remPort = p_port->p_remotePort; + if (p_remPort == NULL) continue; + int usage = p_remPort->counter1; + if (switchPathOnly) + usage += p_remPort->counter2; + + if ((p_bestRemPort == NULL) || (usage < bestUsage)) { - p_remFTNode = getFatTreeNodeByNode(p_bestRemPort->p_node); - if (!p_remFTNode) - cout << "-E- Fail to get FatTree Node for node:" - << p_bestRemPort->p_node->name << endl; - else - assignLftDownWards(p_remFTNode, dLid, p_bestRemPort->num, - switchPathOnly); + p_bestRemPort = p_remPort; + bestUsage = usage; + bestGroup = i; + // can not have better usage then no usage + if (usage == 0) + found = 1; } - - // need to go all up all the possible ways to make sure all switch are - // connected to all HCAs - for (int i = 0; i < p_ftNode->parentPorts.size(); i++) { - if (!p_ftNode->parentPorts[i].size()) continue; - IBPort* p_remPort; - // if we are on the "best group" we know the best port - if (bestGroup == i) continue; - - // find the best port of the group i - p_bestRemPort = NULL; - found = 0; - for (list::iterator lI = p_ftNode->parentPorts[i].begin(); - !found && (lI != p_ftNode->parentPorts[i].end()); - lI++) { - - // can not have more then one port in group... - int portNum = *lI; - IBPort *p_port = p_node->getPort(portNum); // must be if marked parent - IBPort *p_remPort = p_port->p_remotePort; - if (p_remPort == NULL) continue; - int usage = p_remPort->counter1 + p_remPort->counter2; - - if ((p_bestRemPort == NULL) || (usage < bestUsage)) - { - p_bestRemPort = p_remPort; - bestUsage = usage; - // can not have better usage then no usage - if (usage == 0) - found = 1; - } + } + } + + FatTreeNode *p_remFTNode; + // first visit the official path! + if (bestGroup != -1) { + p_remFTNode = getFatTreeNodeByNode(p_bestRemPort->p_node); + if (!p_remFTNode) + cout << "-E- Fail to get FatTree Node for node:" + << p_bestRemPort->p_node->name << endl; + else + assignLftDownWards(p_remFTNode, dLid, p_bestRemPort->num,switchPathOnly,downOnly); + } + + // need to go all up all the possible ways to make sure all switch are + // connected to all HCAs + for (int i = 0; i < p_ftNode->parentPorts.size(); i++) { + if (!p_ftNode->parentPorts[i].size()) continue; + IBPort* p_remPort; + // if we are on the "best group" we know the best port + if (bestGroup == i) continue; + + // find the best port of the group i + p_bestRemPort = NULL; + found = 0; + for (list::iterator lI = p_ftNode->parentPorts[i].begin();!found && (lI != p_ftNode->parentPorts[i].end()); lI++) { + // can not have more then one port in group... + int portNum = *lI; + IBPort *p_port = p_node->getPort(portNum); // must be if marked parent + IBPort *p_remPort = p_port->p_remotePort; + if (p_remPort == NULL) continue; + int usage = p_remPort->counter1 + p_remPort->counter2; + + if ((p_bestRemPort == NULL) || (usage < bestUsage)) { + p_bestRemPort = p_remPort; + bestUsage = usage; + // can not have better usage then no usage + if (usage == 0) + found = 1; } - p_remFTNode = getFatTreeNodeByNode(p_bestRemPort->p_node); - if (!p_remFTNode) - cout << "-E- Fail to get FatTree Node for node:" - << p_bestRemPort->p_node->name << endl; - else - assignLftDownWards(p_remFTNode, dLid, p_bestRemPort->num, 1); - } - - // Perform Backward traversal through all ports connected to lower - // level switches in-port = out-port - assignLftUpWards(p_ftNode, dLid, outPortNum, switchPathOnly); - - return(0); + } + p_remFTNode = getFatTreeNodeByNode(p_bestRemPort->p_node); + if (!p_remFTNode) + cout << "-E- Fail to get FatTree Node for node:" + << p_bestRemPort->p_node->name << endl; + else + assignLftDownWards(p_remFTNode, dLid, p_bestRemPort->num, 1, downOnly); + } + + // Perform Backward traversal through all ports connected to lower + // level switches in-port = out-port + if (!downOnly) + assignLftUpWards(p_ftNode, dLid, outPortNum, switchPathOnly); + + return(0); } // perform the routing by filling in the fabric LFTs int FatTree::route() { - int hcaIdx = 0; - int lid; // the target LID we propagate for this time + int hcaIdx = 0; + int lid; // the target LID we propagate for this time - // go over all fat tree nodes of the lowest level - vec_byte firstLeafTupple(N, 0); - firstLeafTupple[0] = N-1; - for (map_tupple_ftnode::iterator tI = NodeByTupple.find(firstLeafTupple); - tI != NodeByTupple.end(); - tI++) - { + // go over all fat tree nodes of the lowest level + vec_byte firstLeafTupple(N, 0); + firstLeafTupple[0] = N-1; + for (map_tupple_ftnode::iterator tI = NodeByTupple.find(firstLeafTupple); + tI != NodeByTupple.end(); + tI++) + { FatTreeNode *p_ftNode = &((*tI).second); IBNode *p_node = p_ftNode->p_node; @@ -913,186 +932,414 @@ int FatTree::route() // go over all child ports for (int i = 0; i < p_ftNode->childPorts.size(); i++) { - if (!p_ftNode->childPorts[i].size()) continue; - // can not have more then one port in group... - int portNum = p_ftNode->childPorts[i].front(); - numPortWithHCA++; + if (!p_ftNode->childPorts[i].size()) continue; + // can not have more then one port in group... + int portNum = p_ftNode->childPorts[i].front(); + numPortWithHCA++; - lid = LidByIdx[hcaIdx]; + lid = LidByIdx[hcaIdx]; - if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) - cout << "-V- Start routing LID:" << lid - << " at HCA idx:" << hcaIdx << endl; - assignLftDownWards(p_ftNode, lid, portNum, 0); + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + cout << "-V- Start routing LID:" << lid + << " at HCA idx:" << hcaIdx << endl; + assignLftDownWards(p_ftNode, lid, portNum, 0,0); - hcaIdx++; + hcaIdx++; } // for ports without HCA we assign dummy LID but need to // propagate for (; numPortWithHCA < maxHcasPerLeafSwitch; numPortWithHCA++) - { - // HACK: for now we can propagate 0 as lid - if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + { + // HACK: for now we can propagate 0 as lid + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) cout << "-V- adding dummy LID to switch:" << p_node->name << " at HCA idx:" << hcaIdx << endl; - assignLftDownWards(p_ftNode, 0, 0xFF, 0); + assignLftDownWards(p_ftNode, 0, 0xFF, 0,0); - hcaIdx++; - } - } + hcaIdx++; + } + } - // now go over all switches and route to them - for (map_tupple_ftnode::iterator tI = NodeByTupple.begin(); - tI != NodeByTupple.end(); - tI++) - { + // now go over all switches and route to them + for (map_tupple_ftnode::iterator tI = NodeByTupple.begin(); + tI != NodeByTupple.end(); + tI++) + { FatTreeNode *p_ftNode = &((*tI).second); IBNode *p_node = p_ftNode->p_node; - if (p_node->type != IB_SW_NODE) continue; + if (p_node->type != IB_SW_NODE) continue; - // find the LID of the switch: - int lid = 0; - for (unsigned int pn = 1; (lid == 0) && (pn <= p_node->numPorts); pn++) - { - IBPort *p_port = p_node->getPort(pn); - if (p_port) - lid = p_port->base_lid; - } - if (lid == 0) - { - cout << "-E- failed to find LID for switch:" << p_node->name << endl; - } else { - if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + // find the LID of the switch: + int lid = 0; + for (unsigned int pn = 1; (lid == 0) && (pn <= p_node->numPorts); pn++) + { + IBPort *p_port = p_node->getPort(pn); + if (p_port) + lid = p_port->base_lid; + } + if (lid == 0) + { + cout << "-E- failed to find LID for switch:" << p_node->name << endl; + } else { + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) cout << "-V- routing to LID:" << lid << " of switch:" << p_node->name << endl; - assignLftDownWards(p_ftNode, lid, 0, 0); - } + assignLftDownWards(p_ftNode, lid, 0, 0, 0); } + } + + return(0); +} + +////////////////////////////////////////////////////////////////////////////// +// Optimally Route a permutation in the Fat Tree +// Prerequisites: Fat Tree structure was built. +// +// Algorithm: +// For each leaf switch (in order) +// For each HCA index (even if it does not have a LID - invent one) +// Setup downward paths as previously +// Traverse up from destination HCA and force output ports as +// computed by the optimal routing +// +// Data Model: +// We use the fat tree to get ordering. +// "main" routing is the routing from HCA to HCA. +// "side" routing is used from all SW to all HCAs (and dynamic routing) +// Track port utilization for the "main" routing by the "counter1" +// Track port utilzation of the "side" routing in "counter2" field of the port +// +////////////////////////////////////////////////////////////////////////////// - return(0); +// set FDB values as given in the input +int FatTree::forceLftUpWards(FatTreeNode *p_ftNode, uint16_t dLid, vec_int ports) +{ + // go over all steps + for (int i=0; igoingDown(dLid)) + return 0; + // sanity check + if ((ports[i] < 0) || (ports[i] > p_ftNode->parentPorts.size())) { + cout << "-E- Illegal port number!" << endl; + return 1; + } + IBNode *p_node = p_ftNode->p_node; + int portNum = p_ftNode->parentPorts[ports[i]].front(); + + IBPort* p_port = p_node->getPort(portNum); + if (!p_port || !p_port->p_remotePort) { + cout << "-E- Ports do not exist!" << endl; + return 1; + } + IBNode* p_remNode = p_port->p_remotePort->p_node; + // Set LFT entry + p_node->setLFTPortForLid(dLid, portNum); + // Move to next step + p_ftNode = getFatTreeNodeByNode(p_remNode); + } + return 0; +} + +// perform the routing by filling in the fabric LFTs +int FatTree::permRoute(vector src, vector dst) +{ + int hcaIdx = 0; + int lid; // the target LID we propagate for this time + // extract radix + vec_byte tmpLeafTupple(N,0); + tmpLeafTupple[0] = N-1; + FatTreeNode* p_ftN = &NodeByTupple[tmpLeafTupple]; + int radix = 0; + for (int i = 0; i < p_ftN->parentPorts.size(); i++) + if(p_ftN->parentPorts[i].size()) + radix++; + + // create requests vector + vec_int req; + req.resize(src.size()); + for (int i=0; ip_node; + // we need to track the number of ports to handle case of missing HCAs + int numPortWithHCA = 0; + + // go over all child ports + for (int i = 0; i < p_ftNode->childPorts.size(); i++) { + if (!p_ftNode->childPorts[i].size()) continue; + // can not have more then one port in group... + int portNum = p_ftNode->childPorts[i].front(); + numPortWithHCA++; + + lid = LidByIdx[hcaIdx]; + + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + cout << "-V- Start routing LID:" << lid + << " at HCA idx:" << hcaIdx << endl; + + // Assign downward LFT values + if (assignLftDownWards(p_ftNode, lid, portNum, 0, 1)) + return 1; + + hcaIdx++; + } + + // for ports without HCA we assign dummy LID but need to + // propagate + for (; numPortWithHCA < maxHcasPerLeafSwitch; numPortWithHCA++) { + // HACK: for now we can propagate 0 as lid + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + cout << "-V- adding dummy LID to switch:" + << p_node->name + << " at HCA idx:" << hcaIdx << endl; + + assignLftDownWards(p_ftNode, 0, 0xFF, 0, 0); + + hcaIdx++; + } + } + + // Now prepare upwards routing + hcaIdx = 0; + for (map_tupple_ftnode::iterator tI = NodeByTupple.find(firstLeafTupple); tI != NodeByTupple.end(); tI++) { + FatTreeNode *p_ftNode = &((*tI).second); + IBNode *p_node = p_ftNode->p_node; + + // go over all child ports + for (int i = 0; i < p_ftNode->childPorts.size(); i++) { + if (!p_ftNode->childPorts[i].size()) continue; + // can not have more then one port in group... + int portNum = p_ftNode->childPorts[i].front(); + + lid = LidByIdx[hcaIdx]; + + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + cout << "-V- Start routing LID:" << lid + << " at HCA idx:" << hcaIdx << endl; + + lid = LidByIdx[req[hcaIdx]]; + + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + cout << "-V- Creating routing from "<p_node; + + if (p_node->type != IB_SW_NODE) continue; + + // find the LID of the switch: + int lid = 0; + for (unsigned int pn = 1; (lid == 0) && (pn <= p_node->numPorts); pn++) + { + IBPort *p_port = p_node->getPort(pn); + if (p_port) + lid = p_port->base_lid; + } + if (lid == 0) + { + cout << "-E- failed to find LID for switch:" << p_node->name << endl; + } else { + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + cout << "-V- routing to LID:" << lid << " of switch:" + << p_node->name << endl; + assignLftDownWards(p_ftNode, lid, 0, 0, 0); + } + } + + return(0); } // dumps out the HCA order into a file ftree.hca void FatTree::dumpHcaOrder() { - ofstream f("ftree.hcas"); - for (unsigned int i = 0; i < LidByIdx.size(); i++) - { + ofstream f("ftree.hcas"); + for (unsigned int i = 0; i < LidByIdx.size(); i++) + { // find the HCA node by the base lid given unsigned int lid = LidByIdx[i]; if (lid <= 0) - { - f << "DUMMY_HOST LID" << endl; - } + { + f << "DUMMY_HOST LID" << endl; + } else - { - IBPort *p_port = p_fabric->PortByLid[lid]; - - if (! p_port) - { - cout << "-E- fail to find port for lid:" << lid << endl; - f << "ERROR_HOST LID" << endl; - } - else - { - f << p_port->p_node->name << "/" << p_port->num << " " << lid << endl; - } - } - } - f.close(); + { + IBPort *p_port = p_fabric->PortByLid[lid]; + + if (! p_port) + { + cout << "-E- fail to find port for lid:" << lid << endl; + f << "ERROR_HOST LID" << endl; + } + else + { + f << p_port->p_node->name << "/" << p_port->num << " " << lid << endl; + } + } + } + f.close(); } void FatTree::dump() { - unsigned int level, prevLevel = 2; - cout << "---------------------------------- FAT TREE DUMP -----------------------------" << endl; - for (map_tupple_ftnode::const_iterator tI = NodeByTupple.begin(); - tI != NodeByTupple.end(); - tI++) - { + unsigned int level, prevLevel = 2; + cout << "---------------------------------- FAT TREE DUMP -----------------------------" << endl; + for (map_tupple_ftnode::const_iterator tI = NodeByTupple.begin(); + tI != NodeByTupple.end(); + tI++) + { level = (*tI).first[0]; if (level != prevLevel) - { - prevLevel = level; - cout << "LEVEL:" << level << endl; - } + { + prevLevel = level; + cout << "LEVEL:" << level << endl; + } FatTreeNode const *p_ftNode = &((*tI).second); cout << " " << p_ftNode->p_node->name << " tupple:" << getTuppleStr((*tI).first) << endl; for (unsigned int i = 0; i < p_ftNode->parentPorts.size(); i++) - { - if (p_ftNode->parentPorts[i].size()) - { - cout << " Parents:" << i << endl; - for (list< int >::const_iterator lI = p_ftNode->parentPorts[i].begin(); - lI != p_ftNode->parentPorts[i].end(); - lI++) - { - unsigned int portNum = *lI; - cout << " p:" << portNum << " "; - IBPort *p_port = p_ftNode->p_node->getPort(portNum); - if (!p_port || !p_port->p_remotePort) - cout << " ERROR " << endl; - else - cout << p_port->p_remotePort->p_node->name << endl; - } - } - } + { + if (p_ftNode->parentPorts[i].size()) + { + cout << " Parents:" << i << endl; + for (list< int >::const_iterator lI = p_ftNode->parentPorts[i].begin(); + lI != p_ftNode->parentPorts[i].end(); + lI++) + { + unsigned int portNum = *lI; + cout << " p:" << portNum << " "; + IBPort *p_port = p_ftNode->p_node->getPort(portNum); + if (!p_port || !p_port->p_remotePort) + cout << " ERROR " << endl; + else + cout << p_port->p_remotePort->p_node->name << endl; + } + } + } for (unsigned int i = 0; i < p_ftNode->childPorts.size(); i++) - { - if (p_ftNode->childPorts[i].size()) - { - cout << " Children:" << i << endl; - for (list< int >::const_iterator lI = p_ftNode->childPorts[i].begin(); - lI != p_ftNode->childPorts[i].end(); - lI++) - { - unsigned int portNum = *lI; - cout << " p:" << portNum << " "; - IBPort *p_port = p_ftNode->p_node->getPort(portNum); - if (!p_port || !p_port->p_remotePort) - cout << "ERROR " << endl; - else - cout << p_port->p_remotePort->p_node->name << endl; - } - } - } - } + { + if (p_ftNode->childPorts[i].size()) + { + cout << " Children:" << i << endl; + for (list< int >::const_iterator lI = p_ftNode->childPorts[i].begin(); + lI != p_ftNode->childPorts[i].end(); + lI++) + { + unsigned int portNum = *lI; + cout << " p:" << portNum << " "; + IBPort *p_port = p_ftNode->p_node->getPort(portNum); + if (!p_port || !p_port->p_remotePort) + cout << "ERROR " << endl; + else + cout << p_port->p_remotePort->p_node->name << endl; + } + } + } + } - // now dump the HCA by index: - cout << "\nLID BY INDEX" << endl; - for (unsigned int i = 0; i < LidByIdx.size(); i++) { - int lid = LidByIdx[i]; - IBPort *p_port; + // now dump the HCA by index: + cout << "\nLID BY INDEX" << endl; + for (unsigned int i = 0; i < LidByIdx.size(); i++) { + int lid = LidByIdx[i]; + IBPort *p_port; - if (lid != 0) + if (lid != 0) { - p_port = p_fabric->PortByLid[lid]; - if (p_port) - { + p_port = p_fabric->PortByLid[lid]; + if (p_port) + { cout << " " << i << " -> " << LidByIdx[i] << " " << p_port->getName() << endl; - } - else - { + } + else + { cout << " ERROR : no port for lid:" << lid << endl; - } + } } - } + } } // perform the whole thing int FatTreeAnalysis(IBFabric *p_fabric) { - FatTree ftree(p_fabric); - if (!ftree.isValid) return(1); - ftree.dumpHcaOrder(); - if (ftree.route()) return(1); - return(0); + FatTree ftree(p_fabric); + if (!ftree.isValid) return(1); + ftree.dumpHcaOrder(); + if (ftree.route()) return(1); + return(0); +} + +int FatTreeRouteByPermutation( IBFabric *p_fabric, char *srcs, char* dsts ) +{ + vector sources; + vector destinations; + + char *s1, *s2, *cp; + char *saveptr; + s1 = strdup(srcs); + s2 = strdup(dsts); + cp = strtok_r(s1, " \t", &saveptr); + do { + sources.push_back(cp); + cp = strtok_r(NULL, " \t", &saveptr); + } + while (cp); + + cp = strtok_r(s2, " \t", &saveptr); + do { + destinations.push_back(cp); + cp = strtok_r(NULL, " \t", &saveptr); + } + while (cp); + + if (sources.size() != destinations.size()) { + cout << "-E- Different number of sources and destinations" << endl; + return 1; + } + + FatTree ftree(p_fabric); + if (!ftree.isValid) return(1); + //ftree.dumpHcaOrder(); + if (ftree.permRoute(sources, destinations)) return(1); + return(0); } diff --git a/ibdm/datamodel/Makefile.am b/ibdm/datamodel/Makefile.am index 0003bba..b0958fc 100644 --- a/ibdm/datamodel/Makefile.am +++ b/ibdm/datamodel/Makefile.am @@ -33,7 +33,7 @@ #-- # we would like to export these headers during install -pkginclude_HEADERS = git_version.h Fabric.h \ +pkginclude_HEADERS = git_version.h Fabric.h RouteSys.h Bipartite.h \ SubnMgt.h TraceRoute.h CredLoops.h Regexp.h \ TopoMatch.h SysDef.h Congestion.h ibnl_parser.h ibdm.i @@ -53,7 +53,7 @@ AM_YFLAGS = -d # libibdm - the TCL shared lib to be used as a package # ibdmsh - the TCL shell -common_SOURCES = Fabric.cpp \ +common_SOURCES = Fabric.cpp RouteSys.cc Bipartite.cc \ SubnMgt.cpp TraceRoute.cpp CredLoops.cpp TopoMatch.cpp SysDef.cpp \ LinkCover.cpp Congestion.cpp ibnl_parser.cc ibnl_scanner.cc FatTree.cpp diff --git a/ibdm/datamodel/RouteSys.cc b/ibdm/datamodel/RouteSys.cc new file mode 100644 index 0000000..6c33224 --- /dev/null +++ b/ibdm/datamodel/RouteSys.cc @@ -0,0 +1,258 @@ +/* + * Copyright (c) 2008 Mellanox Technologies LTD. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +#include "RouteSys.h" +#include "Bipartite.h" + +// Helper power function + +int RouteSys::myPow(int base, int pow) { + int res = 1; + for (int i=0; i 1) { + subSys = new RouteSys* [rad]; + for (int i=0; i 1) { + for (int i=0; i" << dst << endl; + + // Check port existence + if ((src >= ports) || (dst >= ports)) { + cout << "-E- Port index exceeds num ports! Ports: " << ports << ", src: " << src << ", dst: " << dst << endl; + return 1; + } + // Check port availability + if (inPorts[src].used || outPorts[dst]) { + cout << "-E- Port already used! src: " << src << ", dst: " << dst << endl; + return 1; + } + // Mark ports as used + inPorts[src].used = true; + inPorts[src].src = src; + inPorts[src].dst = dst; + inPorts[src].inputNum = src; + inPorts[src].outNum = dst; + + outPorts[dst] = true; + } + return 0; +} + +///////////////////////////////////////////////////////////////////////////// + +// Perform routing after requests were pushed + +int RouteSys::doRouting (vec_vec_int& out) +{ + // Atomic system nothing to do + if (ports == radix) + return 0; + + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + cout << "-V- Starting routing, step: " << step << ", height " << height << endl; + + // Init the output structure + if (out.size() < ports) { + out.resize(ports); + for (int i=0; iconnectNodes(i/radix,inPorts[i].outNum/radix,inPorts[i]); + + // Now decompose the graph to radix-1 graphs + while (1 < currRadix) { + for (int i=0; buff[idx][i] && imaximumMatch(); + matchGraphs++; + } + // Now we can perform Euler decomposition + if (FabricUtilsVerboseLevel & FABU_LOG_VERBOSE) + cout << "-V- Performing Euler decompostion" << endl; + if (2*i+1 >= radix) { + cout << "-E- Graph index illegal"<< endl; + return 1; + } + buff[idx][i]->decompose(&buff[idx_new][2*i],&buff[idx_new][2*i+1]); + delete buff[idx][i]; + buff[idx][i] = NULL; + } + idx = idx_new; + idx_new = (idx_new+1)%2; + currRadix = currRadix / 2; + } + // Collect all result graphs to array buff[2][i] + for (int i=matchGraphs; isetIterFirst()) { + cout << "-E- Empty graph found!" << endl; + return 1; + } + bool stop = false; + while (!stop) { + inputData d = G->getReqDat(); + // Build output + if (out.size() <= d.src || out[d.src].size() <= step) { + cout << "Output index illegal" << endl; + return 1; + } + out[d.src][step] = i; + // Add request to sub-system + RouteSys* sub = subSys[i]; + int inPort = d.inputNum/radix; + int outPort = d.outNum/radix; + if (sub->inPorts[inPort].used || sub->outPorts[outPort]) { + cout << "Port already used! inPort: " << inPort << ", outPort: " << outPort << endl; + return 1; + } + // Mark ports as used + sub->inPorts[inPort].used = true; + sub->inPorts[inPort].src = d.src; + sub->inPorts[inPort].dst = d.dst; + sub->inPorts[inPort].inputNum = inPort; + sub->inPorts[inPort].outNum = outPort; + + outPorts[outPort] = true; + // Next request + if (!G->setIterNext()) { + stop = true; + } + } + } + + // Free memory + for (int i=0; idoRouting(out)) { + cout << "-E- Subsystem routing failed!" << endl; + return 1; + } + } + + return 0; +} diff --git a/ibdm/datamodel/RouteSys.h b/ibdm/datamodel/RouteSys.h new file mode 100644 index 0000000..a5f5cfc --- /dev/null +++ b/ibdm/datamodel/RouteSys.h @@ -0,0 +1,98 @@ +/* + * Copyright (c) 2008 Mellanox Technologies LTD. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + * $Id$ + */ + +/* + * Fabric Utilities Project + * + * Permutation Routing System abstraction Header file + * + * Author: Vladimir Zdornov, Mellanox Technologies + * + */ + +#ifndef IBDM_ROUTE_SYS_H_ +#define IBDM_ROUTE_SYS_H_ + +#include +#include "Fabric.h" + +using namespace std; + +class inputData +{ + public: + bool used; + int src; + int dst; + int inputNum; + int outNum; + + inputData():used(false){} +}; + +// Routing system abstraction class +class RouteSys { + // Basic parameters + int radix; + int height; + int step; + int ports; + // Ports data + inputData* inPorts; + bool* outPorts; + // Sub-systems + RouteSys** subSys; + + int myPow(int base, int pow); + + public: + // Constructor + RouteSys(int rad, int hgth, int s=0); + // Add communication requests to the system + // Format: i -> req[i] + // Restriction: Requests must form a complete permutation + int pushRequests(vec_int req); + // D'tor + ~RouteSys(); + // Get input data for input port i + inputData& getInput(int i); + // Is output port i busy already? + bool& getOutput(int i); + + // Invoke the system level routing + int doRouting(vec_vec_int& out); +}; + + +#endif diff --git a/ibdm/datamodel/SubnMgt.h b/ibdm/datamodel/SubnMgt.h index 8e68818..99e763f 100644 --- a/ibdm/datamodel/SubnMgt.h +++ b/ibdm/datamodel/SubnMgt.h @@ -134,5 +134,9 @@ LinkCoverageAnalysis(IBFabric *p_fabric, list_pnode rootNodes); int FatTreeAnalysis(IBFabric *p_fabric); +// Perform FatTree optimal permutation routing +int +FatTreeRouteByPermutation(IBFabric* p_fabric, char* srcs, char* dsts); + #endif /* IBDM_SUBN_MGT_H */ diff --git a/ibdm/datamodel/ibdm.i b/ibdm/datamodel/ibdm.i index b4541ad..7f3e0dc 100644 --- a/ibdm/datamodel/ibdm.i +++ b/ibdm/datamodel/ibdm.i @@ -1263,6 +1263,9 @@ int ibdmFatTreeRoute(IBFabric *p_fabric, list_pnode rootNodes); %name(ibdmFatTreeAnalysis) int FatTreeAnalysis(IBFabric *p_fabric); // Performs FatTree structural analysis +%name(ibdmFatTreeRouteByPermutation) int FatTreeRouteByPermutation(IBFabric *p_fabric, char* srcs, char* dsts); +// Performs optimal permutation routing in FatTree + %name(ibdmVerifyCAtoCARoutes) int SubnMgtVerifyAllCaToCaRoutes(IBFabric *p_fabric); // Verify point to point connectivity -- 1.5.1.4 From vlad at lists.openfabrics.org Tue Oct 7 03:15:47 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 7 Oct 2008 03:15:47 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081007-0200 daily build status Message-ID: <20081007101547.70CB8E60BEF@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: Build failed on ppc64 with linux-2.6.24 Log: /home/vlad/tmp/ofa_1_4_kernel-20081007-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c: In function 'ehca_poll_eqs': /home/vlad/tmp/ofa_1_4_kernel-20081007-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:942: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type /home/vlad/tmp/ofa_1_4_kernel-20081007-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:946: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081007-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081007-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081007-0200_linux-2.6.24_ppc64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081007-0200_linux-2.6.24_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.24' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From hal.rosenstock at gmail.com Tue Oct 7 06:22:00 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 7 Oct 2008 09:22:00 -0400 Subject: ***SPAM*** Re: [ofa-general] [PATCH 0/6] opensm: Unicast Routing Cache In-Reply-To: <48EA9ABA.6010509@dev.mellanox.co.il> References: <48E96928.8030200@dev.mellanox.co.il> <48EA9ABA.6010509@dev.mellanox.co.il> Message-ID: Hi Yevgeny, On Mon, Oct 6, 2008 at 7:09 PM, Yevgeny Kliteynik wrote: > Hi Hal, > > Hal Rosenstock wrote: >> >> Hi Yevgeny, >> >> On Sun, Oct 5, 2008 at 9:26 PM, Yevgeny Kliteynik >> wrote: >>> >>> Hi Sasha, >>> >>> The following series of 6 patches implements unicast routing cache >>> in OpenSM. >>> >>> This implementation (v2, previous version was sent before OFED 1.3) >>> was rewritten from scratch: >>> - no caching of existing connectivity >>> - no caching of existing lid matrices >>> - each switch has an LFT buffer that contains the result of >>> the last routing engine execution (instead of one buffer >>> in ucast_mgr) >>> - links/ports/nodes changes are spotted during the discovery >>> - only the links/ports/nodes that went down are cached >>> - when switch goes down, caching its lid matrices and LFT >>> >>> In one of the following cases we can use cached routing >>> - there is no topology change >>> - one or more CAs disappeared >>> - one or more leaf switches disappeared >>> In these cases cached routing is written to the switches as is >>> (unless the switch doesn't exist). >>> If there is any other topology change, existing cache is invalidated >>> and the routing engine(s) run as usual. >> >> Glad to see this! >> >> A few comments/questions: >> >> It seems that there is a LFT cache per switch. This seems to be a big >> memory penalty to me (in large subnets). So I have two questions >> related to this: >> Can this only be done this way when cached routing is being used ? > > Actually, I was thinking about something else: > Currently we have switch LFT implemented as osm_fwd_tbl_t. > I can remove the unnecessary complexity of the osm_fwd_tbl_t by replacing > it with a simple uint8_t array (same as LFT buffer). Then by simple > comparison I will check whether the recently calculated LFT > matches the switch's LFT, and if there is a match, then lft_buf > can be freed. In this case only the switches that have LFT different > from the recently calculated LFT will have both tables, which would be > rare and temporary - on the next heavy sweep the LFTs would match, and > lft_buf would be freed. Can the forwarding tables be removed ? How would paths be calculated/walked end to end on an SA PathRecord/MultiPathRecord query ? Would that then require query of the LFTs in the switches ? > Effectively, it won't have memory penalty. > It can be done in a separate patch. I think somehow eliminating the memory penalty is important. >> Also, when cached routing is being used, is this only needed for leaf >> switches ? > > No, it is needed for all the switches, because cache can also > handle non-leaf switch fast reset. OK; didn't realize that but it makes sense. >> I'm wondering when there is a cached node match whether the available >> peer ports/neighbors are validated (or something equivalent) to know >> caching is valid ? It might also include whether a switch is still a >> leaf switch (which may be redundant as that should show up as a peer >> port/neighbor change). It looks like the structure is there for this >> but I didn't review the code in detail. > > If I understood your question correctly, then yes, such validation > is done by osm_ucast_cache_validate() function. > Can you describe in more details the case that you are asking about? I'm just wondering about the preconditions to determine that the cached routing for a node is valid: Is it that the current port physical state LinkUp links are a subset of the cached ones ? >> Are you sure all the memory allocation failures are handled properly >> within the routing cache code ? What I mean is that NULL is returned >> and does this always result in a caching not used/routing recalculated >> ? Also, in that case, should some log message be indicated rather than >> hiding this ? > > I will check it. Thanks. -- Hal >> Nit: doc/current-routing.txt should also be updated for this feature. > > OK, separate patch. > > -- Yevgeny > >> -- Hal >> >>> The patches are: >>> - patch 1/6: move lft_buf from ucast_mgr to osm_switch >>> - patch 2/6: Add "-A" or "--ucast_cache" option to opensm >>> - patch 3/6: adding osm_ucast_cache.{c,h} files (this is >>> the cache implementation itself) >>> - patch 4/6: adding new cache files to makefile >>> - patch 5/6: integrating unicast cache into the discovery >>> and ucast manager >>> - patch 6/6: man entry for cached routing >>> >>> -- Yevgeny >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >> > > > From vlad at mellanox.co.il Tue Oct 7 08:45:55 2008 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 07 Oct 2008 17:45:55 +0200 Subject: [ofa-general] [PATCH] IB/mlx4: Set RLKEY bit in QP context Message-ID: <48EB8433.1020108@mellanox.co.il> Set RLKEY bit in QP context so that QP can use Reserved L_Key for memory reference. Signed-off-by: Vladimir Sokolovsky --- drivers/infiniband/hw/mlx4/qp.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 9559248..baa01de 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -1058,6 +1058,9 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp, else sqd_event = 0; + if (!ibqp->uobject && cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT) + context->rlkey |= (1 << 4); + /* * Before passing a kernel QP to the HW, make sure that the * ownership bits of the send queue are set and the SQ -- 1.6.0.2.307.gc427 From cameron at harr.org Tue Oct 7 08:44:44 2008 From: cameron at harr.org (Cameron Harr) Date: Tue, 07 Oct 2008 09:44:44 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48EB28F7.7070301@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EA3706.2080700@vlnb.net> <48EA6838.40706@harr.org> <48EA8A8F.2000301@harr.org> <48EB28F7.7070301@vlnb.net> Message-ID: <48EB83EC.2000005@harr.org> Vladislav Bolkhovitin wrote: > Cameron Harr wrote: >> Cameron Harr wrote: >>>> This is still too high. Considering that each CS is about 1 >>>> microsecond you can estimate how many IOPS's it costs you. >>> Dropping scst_threads down to 2, from 8, with 2 initiators, seems to >>> make a fairly significant difference, propelling me to a little over >>> 100K IOPs and putting the CS rate around 2:1, sometimes lower. 2 >>> threads gave the best performance compared to 1, 4 and 8. >> >> Just as a status update, I've gotten my best performance with >> scst_threads=3 on 2 initiators, and using a separate QP for each >> drive an initiator is writing to. I'm getting pretty consistent >> 112-115K IOPs using two initiators, each writing with 2 processes to >> the same 2 physical targets, using 512B blocks. Adding the second >> initiator only bumps me up by about 20K IOPs, but as all the CPUs are >> pegged around 99%, I'll take that as a bottleneck. Also, as a note >> from Vlad's advice, the CS rate is now around 70K/s on 115K IOPs, so >> it's not too bad. Interrupts (where this thread started), are around >> 200K/s - a lot higher than I thought they'd go, but I'm not >> complaining. :) > > Actually, what you did is tune your workload so it put nicely on all > the participating threads and CPU cores, so all the threads stay each > on its own CPU core and gracefully pass commands during processing to > each other being busy almost all the time. I.e. you put your system in > some kind of resonance. If you change your workload just a bit or > Linux scheduler changed in the next kernel version, your tuning would > be destroyed. > This "resonance" thought actually crossed my mind. I later went and ran the test locally and found that I got better performance via SRP than I did locally (good marketing for you :) ). The local run, using no networking, gave me around 2 CS/IO. It appeared that when I added the second initiator, the requests from the 2 initiators for a single target would get coalesced, which would improve the performance. > So, I wouldn't overestimate your results. As I already wrote, the only > real fix is to remove all the unneeded context switches between > threads during commands processing. This fix would work not only on > carefully tuned artificial workloads, but on real life ones too. 5-10 > threads participating in a single command processing reminds me the > famous set of histories about how many people of some kind is > necessary to change a burnt out lamp ;) Nice analogy :). I wish I knew how to eradicate the extra context switches. I'll try Bart's trick and see if I can get more info: From vuhuong at mellanox.com Tue Oct 7 09:21:05 2008 From: vuhuong at mellanox.com (Vu Pham) Date: Tue, 07 Oct 2008 09:21:05 -0700 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48EA7106.1050601@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E6A372.8000702@mellanox.com> <48EA7106.1050601@harr.org> Message-ID: <48EB8C71.2010505@mellanox.com> Cameron Harr wrote: > > > Vu Pham wrote: >> Cameron Harr wrote: >>> Vu Pham wrote: >>>> >>>>> Alternatively, is there anything in the SCST layer I should tweak. >>>>> I'm >>>>> still running rev 245 of that code (kinda old, but works with OFED >>>>> 1.3.1 >>>>> w/o hacks). >> >> With blockio I get the best performance + stability with scst_threads=1 > > I got best performance with threads=2 or 3, and I've noticed that the > srpt_thread is often at 99%, though if I increase/decrease the > "thread=?" parameter for ib_srpt, it doesn't seem to make a > difference. A second initiator doesn't seem to help much either, with > a single initiator writing to two targets, can now usually get between > 95K and 105K IOPs. ib_srpt's "thread=?" parameter does not mean number of thread but indicating you are in thread context or not thread=0 --> you will avoid one context switch between ib_srpt's thread and scst's threads. You may get better result with thread=0; however, there is stability risk >>>>>> >>>>>> My target server (with DAS) contains 8 2.8 GHz CPU cores and can >>>>>> sustain over 200K IOPs locally, but only around 73K IOPs over SRP. >>>> >>>> Is this number from one initiator or multiple? >>> One initiator. At first I thought it might be a limitation of the >>> SRP, and added a second initiator, but the aggregate performance of >>> the two was about equal to that of a single initiator. >> >> Try again with scst_threads=1. I expect that you can get ~140K with >> two initiators >> > Unfortunately, I'm nowhere close that high, though I am significantly > higher than before. 2 initiators does seem to reduce the context > switching rate however, which is good. Could you try again with ib_srpt' thread=0? >>>>>> Looking at /proc/interrupts, I see that the mlx_core (comp) >>>>>> device is pushing about 135K Int/s on 1 of 2 CPUs. All CPUs are >>>>>> enabled for that PCI-E slot, but it only ever uses 2 of the CPUs, >>>>>> and only 1 at a time. None of the other CPUs has an interrupt >>>>>> rate more than about 40-50K/s. >>>> The number of interrupt can be cut down if there are more >>>> completions to be processed by sw. ie. please test with multiple >>>> QPs between one initiator vs. your target and multiple initiators >>>> vs. your target > Interrupts are still pretty high (around 160K/s now), but that seems > to not be my bottleneck. Context switching seems to be about 2-2.5 for > every IOP and sometimes less - not perfect, but not horrible either. >> >> ib_srpt process completions in event callback handler. With more QPs >> there are more completions pending per interrupt instead of one >> completion event per interrupt. >> You can have multiple QPs between initiator vs. target by using >> different initiator_id_ext ie. >> echo id_ext=xxx,ioc_guid=yyy,....initiator_ext=1 > >> /sys/class/infiniband_srp/.../add_target >> echo id_ext=xxx,ioc_guid=yyy,....initiator_ext=2 > >> /sys/class/infiniband_srp/.../add_target >> echo id_ext=xxx,ioc_guid=yyy,....initiator_ext=3 > >> /sys/class/infiniband_srp/.../add_target > This doesn't seem to net much of an improvement, though I understand > the reasoning behind it. My hunch is there's another bottleneck now to > look for. > > Cameron > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From cameron at harr.org Tue Oct 7 09:22:20 2008 From: cameron at harr.org (Cameron Harr) Date: Tue, 07 Oct 2008 10:22:20 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> Message-ID: <48EB8CBC.30303@harr.org> An HTML attachment was scrubbed... URL: From arlin.r.davis at intel.com Tue Oct 7 09:32:37 2008 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Tue, 7 Oct 2008 09:32:37 -0700 Subject: [ofa-general] [ANNOUNCE] compat-dapl-1.2.11 and dapl-2.0.14 Release Message-ID: New DAPL releases now available from OFA download page: http://www.openfabrics.org/downloads/dapl/ md5sum: 658be22f64372140c1f7b4f42269b07c compat-dapl-1.2.11.tar.gz md5sum: 743fb0aa04eeaf2cf892f714f9bb870c dapl-2.0.14.tar.gz Summary of changes from last release: v1,v2 - iWarp, 1 iov on rdma_reads, reduce iov's in dtest, add dat.conf entry v1,v2 - add $(DESTDIR) on install/uninstall hooks v2 - add new options to dtestx for UD testing v2 - IB UD fixes in common code/socket cm provider to allow multiple EP support v2 - fix dtest and dtestx build warnings Bug fixes (1228,1229,1232) Vlad, please pick up new packages and install following for OFED 1.4 rc3: compat-dapl-1.2.11-1 compat-dapl-devel-1.2.11-1 dapl-2.0.14-1 dapl-utils-2.0.14-1 dapl-devel-2.0.14-1 dapl-debuginfo-2.0.14-1 -arlin From vst at vlnb.net Tue Oct 7 10:05:09 2008 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Tue, 07 Oct 2008 21:05:09 +0400 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48EB8CBC.30303@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> Message-ID: <48EB96C5.2060202@vlnb.net> Cameron Harr wrote: > Bart Van Assche wrote: >> On Mon, Oct 6, 2008 at 5:31 PM, Cameron Harr wrote: >> >>> Thanks for the suggestion. As I look via vmstat, my CSw/s rate is fairly >>> constant around 280K when scst_threads=1 (per Vu's suggestion) and pops up >>> to ~330-340K CSw/s when scst_threads is set to 8. >>> >> >> Which threads are causing all those context switches ? You can find >> this out by making sure that CONFIG_SCHEDSTATS=y is enabled in the >> kernel .config and by running the following bash command: >> >> ( cd /proc && for p in [1-9]* ; do echo "$(<${p}/cmdline) >> $(<${p}/schedstat)" ; done ) | sort -rn -k 3 | head >> >> >> > Thanks for the bash lesson :). It wasn't working how I think you had > planned because many processes have nothing in the cmdline file. So, I > touched up the command a bit, putting in the pid and displaying the > cmdline at the end so as not to mess up the sort: > ( cd /proc && for p in [1-9]* ; do echo -e "$p:\t $(<${p}/schedstat) > \t\t$(<${p}/cmdline)" ; done ) | sort -rn -k 3 | head > > Using that, and watching who's moving up in amount of time waiting, the > main culprits are all of the scst_threads when scst_threads=8, and when > threads=2, the culprit is srpt_thread. After some code examination, I figured out that Vu has chosen a "defensive programming" way ;): always switch to another thread. I personally don't see why srpt_thread is needed at all. Vu, if you think that the processing is too heavy weighted, you should rather use tasklets instead. SCST functions scst_cmd_init_done() and scst_rx_data() should be called with context SCST_CONTEXT_DIRECT_ATOMIC from interrupt context or SCST_CONTEXT_DIRECT from thread context. Then amount of context switches per cmd will go to the same reasonable level <=1 as for qla2x00t. > -Cameron > > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From bart.vanassche at gmail.com Tue Oct 7 02:33:37 2008 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Tue, 7 Oct 2008 11:33:37 +0200 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48EA2F42.80008@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> Message-ID: On Mon, Oct 6, 2008 at 5:31 PM, Cameron Harr wrote: > Thanks for the suggestion. As I look via vmstat, my CSw/s rate is fairly > constant around 280K when scst_threads=1 (per Vu's suggestion) and pops up > to ~330-340K CSw/s when scst_threads is set to 8. Which threads are causing all those context switches ? You can find this out by making sure that CONFIG_SCHEDSTATS=y is enabled in the kernel .config and by running the following bash command: ( cd /proc && for p in [1-9]* ; do echo "$(<${p}/cmdline) $(<${p}/schedstat)" ; done ) | sort -rn -k 3 | head Bart. From vuhuong at mellanox.com Tue Oct 7 11:08:01 2008 From: vuhuong at mellanox.com (Vu Pham) Date: Tue, 07 Oct 2008 11:08:01 -0700 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48EB96C5.2060202@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> Message-ID: <48EBA581.4040301@mellanox.com> Vladislav Bolkhovitin wrote: > Cameron Harr wrote: >> Bart Van Assche wrote: >>> On Mon, Oct 6, 2008 at 5:31 PM, Cameron Harr wrote: >>> >>>> Thanks for the suggestion. As I look via vmstat, my CSw/s rate is >>>> fairly >>>> constant around 280K when scst_threads=1 (per Vu's suggestion) and >>>> pops up >>>> to ~330-340K CSw/s when scst_threads is set to 8. >>>> >>> >>> Which threads are causing all those context switches ? You can find >>> this out by making sure that CONFIG_SCHEDSTATS=y is enabled in the >>> kernel .config and by running the following bash command: >>> >>> ( cd /proc && for p in [1-9]* ; do echo "$(<${p}/cmdline) >>> $(<${p}/schedstat)" ; done ) | sort -rn -k 3 | head >>> >>> >>> >> Thanks for the bash lesson :). It wasn't working how I think you had >> planned because many processes have nothing in the cmdline file. So, >> I touched up the command a bit, putting in the pid and displaying the >> cmdline at the end so as not to mess up the sort: >> ( cd /proc && for p in [1-9]* ; do echo -e "$p:\t $(<${p}/schedstat) >> \t\t$(<${p}/cmdline)" ; done ) | sort -rn -k 3 | head >> >> Using that, and watching who's moving up in amount of time waiting, >> the main culprits are all of the scst_threads when scst_threads=8, >> and when threads=2, the culprit is srpt_thread. > > After some code examination, I figured out that Vu has chosen a > "defensive programming" way ;): always switch to another thread. > > I personally don't see why srpt_thread is needed at all. Vu, if you > think that the processing is too heavy weighted, you should rather use > tasklets instead. > > SCST functions scst_cmd_init_done() and scst_rx_data() should be > called with context SCST_CONTEXT_DIRECT_ATOMIC from interrupt context > or SCST_CONTEXT_DIRECT from thread context. Then amount of context > switches per cmd will go to the same reasonable level <=1 as for > qla2x00t. \ You are correct - by default srp run in thread mode - srp can also run in tasklet mode (parameter thread=0); however, the main trade of is instability (in heavy tpc-h workload) I already let Cameron know about this. We should have some prelim. number from him soon (running with thread=0) and we need some quality time to debug/fix the instability of some special workload > >> -Cameron >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > From cameron at harr.org Tue Oct 7 11:15:07 2008 From: cameron at harr.org (Cameron Harr) Date: Tue, 07 Oct 2008 12:15:07 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48EBA581.4040301@mellanox.com> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> Message-ID: <48EBA72B.4000909@harr.org> Vu Pham wrote: >>> Using that, and watching who's moving up in amount of time waiting, >>> the main culprits are all of the scst_threads when scst_threads=8, >>> and when threads=2, the culprit is srpt_thread. >> >> After some code examination, I figured out that Vu has chosen a >> "defensive programming" way ;): always switch to another thread. >> >> I personally don't see why srpt_thread is needed at all. Vu, if you >> think that the processing is too heavy weighted, you should rather >> use tasklets instead. >> >> SCST functions scst_cmd_init_done() and scst_rx_data() should be >> called with context SCST_CONTEXT_DIRECT_ATOMIC from interrupt context >> or SCST_CONTEXT_DIRECT from thread context. Then amount of context >> switches per cmd will go to the same reasonable level <=1 as for >> qla2x00t. \ > > You are correct - by default srp run in thread mode - srp can also run > in tasklet mode (parameter thread=0); however, the main trade of is > instability (in heavy tpc-h workload) > > I already let Cameron know about this. We should have some prelim. > number from him soon (running with thread=0) and we need some quality > time to debug/fix the instability of some special workload > I may be hitting the instability problems and am currently rebooting my initiators again after the test (FIO) went into zombie-mode. When I first set thread=0, with scst_threads=8, my performance was much lower (around 50-60K IOPs) than normal and it appeared that only one target could be written to at a time. I set scst_threads=2 after that and got pretty wide performance differences, between 55K and 85K IOPs. I then brought in another initiator and was seeing numbers as high as 135K IOPs and as low as 70K IPs, but could also see that a lot of the requests were being coalesced by the time they got to the target. I let it run for a while, and when I came back, the tests were still "running" but no work was being done and the processes couldn't be killed. From cameron at harr.org Tue Oct 7 12:51:13 2008 From: cameron at harr.org (Cameron Harr) Date: Tue, 07 Oct 2008 13:51:13 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48EBA72B.4000909@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> Message-ID: <48EBBDB1.1080203@harr.org> Cameron Harr wrote: > I may be hitting the instability problems and am currently rebooting > my initiators again after the test (FIO) went into zombie-mode. > > When I first set thread=0, with scst_threads=8, my performance was > much lower (around 50-60K IOPs) than normal and it appeared that only > one target could be written to at a time. I set scst_threads=2 after > that and got pretty wide performance differences, between 55K and 85K > IOPs. I then brought in another initiator and was seeing numbers as > high as 135K IOPs and as low as 70K IPs, but could also see that a lot > of the requests were being coalesced by the time they got to the > target. I let it run for a while, and when I came back, the tests were > still "running" but no work was being done and the processes couldn't > be killed. > One thing that makes results hard to interpret is that they vary enormously. I've been doing more testing with 3 physical LUNs (instead of two) on the target, srpt_thread=0, and changing between scst_thread=[1,2,3]. With scst_thread=1, I'm fairly low (50K IOPs), while at 2 and three threads, the results are higher, though in all cases, the context switches are low, often less than 1:1. My best performance comes with scst_threads=3 again (and are often pegged at 100% CPU), but the results seem to go in phases between the low 80s, the 90s, 110s, 120s and will run for a while in 130s and 140s (in thousands of IOPs). For reference, locally on the three LUNs I get around 130-150K IOPs. But the numbers really vary. Also a little disconcerting is that my average request size on the target has gotten larger. I'm always writing 512B packets, and when I run on one initiator, the average reqsz is around 600-800B. When I add an initiator, the average reqsz basically doubles and is now around 1200 - 1600B. I'm specifying direct IO in the test and scst is configured as blockio (and thus direct IO), but it appears something is cached at some point and seems to be coalesced when another initiator is involved. Does this seem odd or normal? This shows true whether the initiators are writing to different partitions on the same LUN or the same LUN with no partitions. From chu11 at llnl.gov Tue Oct 7 15:38:17 2008 From: chu11 at llnl.gov (Al Chu) Date: Tue, 07 Oct 2008 15:38:17 -0700 Subject: [ofa-general] [infiniband-diags] [trivial] add clarification in perfquery manpage Message-ID: <1223419097.1197.102.camel@cardanus.llnl.gov> Hey Sasha, I ran upon a script that was using 255 for the port number. I had no idea what it meant at the time. Thought a clarification in the perfquery manpage would help. Al -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-note-that-port-255-is-the-same-as-a.patch Type: text/x-patch Size: 831 bytes Desc: not available URL: From chu11 at llnl.gov Tue Oct 7 15:38:27 2008 From: chu11 at llnl.gov (Al Chu) Date: Tue, 07 Oct 2008 15:38:27 -0700 Subject: [ofa-general] [infiniband-diags] [trivial] use 0xff vs. 255 consistently in perfquery Message-ID: <1223419107.1197.104.camel@cardanus.llnl.gov> Hey Sasha, Nothing fancy. Was just inconsistent. Al -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0002-create-use-ALL_PORTS-macro-for-clarity.patch Type: text/x-patch Size: 1666 bytes Desc: not available URL: From chu11 at llnl.gov Tue Oct 7 15:38:51 2008 From: chu11 at llnl.gov (Al Chu) Date: Tue, 07 Oct 2008 15:38:51 -0700 Subject: [ofa-general] [infiniband-diags] [trivial] add more detail to error message on perfquery workaround Message-ID: <1223419131.1197.105.camel@cardanus.llnl.gov> Hey Sasha, Didn't know what some error messages really meant. So added some detail. Al -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0003-clarify-error-messages.patch Type: text/x-patch Size: 1175 bytes Desc: not available URL: From chu11 at llnl.gov Tue Oct 7 15:54:05 2008 From: chu11 at llnl.gov (Al Chu) Date: Tue, 07 Oct 2008 15:54:05 -0700 Subject: [ofa-general] [infiniband-diags] add --loop_ports option to perfquery Message-ID: <1223420045.1197.117.camel@cardanus.llnl.gov> Hey Sasha, We have a switch here that does not report the AllPortSelect flag as a capability. It's pretty annoying typing each port on the switch or always having to script around this one oddball switch we have. So I added an option --loop_ports for perfquery. If you want to do something to all the ports on the CA/Switch, but AllPortSelect isn't available, it loops through all the available ports instead. There was already a workaround in the tool for a CA that did not support the AllPortSelect flag. I get the feeling the workaround may have been for a specific hardware, so I kept the workaround in there. Al -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0004-support-loop_ports-in-perfquery.patch Type: text/x-patch Size: 7599 bytes Desc: not available URL: From vuhuong at mellanox.com Tue Oct 7 15:46:14 2008 From: vuhuong at mellanox.com (Vu Pham) Date: Tue, 07 Oct 2008 15:46:14 -0700 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48EBBDB1.1080203@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> Message-ID: <48EBE6B6.4060804@mellanox.com> Cameron Harr wrote: > Cameron Harr wrote: >> I may be hitting the instability problems and am currently rebooting >> my initiators again after the test (FIO) went into zombie-mode. >> >> When I first set thread=0, with scst_threads=8, my performance was >> much lower (around 50-60K IOPs) than normal and it appeared that only >> one target could be written to at a time. I set scst_threads=2 after >> that and got pretty wide performance differences, between 55K and 85K >> IOPs. I then brought in another initiator and was seeing numbers as >> high as 135K IOPs and as low as 70K IPs, but could also see that a >> lot of the requests were being coalesced by the time they got to the >> target. I let it run for a while, and when I came back, the tests >> were still "running" but no work was being done and the processes >> couldn't be killed. >> > > One thing that makes results hard to interpret is that they vary > enormously. I've been doing more testing with 3 physical LUNs (instead > of two) on the target, srpt_thread=0, and changing between > scst_thread=[1,2,3]. With scst_thread=1, I'm fairly low (50K IOPs), > while at 2 and three threads, the results are higher, though in all > cases, the context switches are low, often less than 1:1. > Can you test again with srpt_thread=0,1 and scst_threads=1,2,3 in NULLIO mode (with 1,2,3 export NULLIO luns) > My best performance comes with scst_threads=3 again (and are often > pegged at 100% CPU), but the results seem to go in phases between the > low 80s, the 90s, 110s, 120s and will run for a while in 130s and 140s > (in thousands of IOPs). For reference, locally on the three LUNs I get > around 130-150K IOPs. But the numbers really vary. > > Also a little disconcerting is that my average request size on the > target has gotten larger. I'm always writing 512B packets, and when I > run on one initiator, the average reqsz is around 600-800B. When I add > an initiator, the average reqsz basically doubles and is now around > 1200 - 1600B. I'm specifying direct IO in the test and scst is > configured as blockio (and thus direct IO), but it appears something > is cached at some point and seems to be coalesced when another > initiator is involved. Does this seem odd or normal? This shows true > whether the initiators are writing to different partitions on the same > LUN or the same LUN with no partitions. What io scheduler are you running on local storage? Since you are using blockio you should play around with io scheduler's tuned parameters (for example deadline scheduler: front_merges, write_starved,...) Please see ~/Documentation/block/*.txt From sashak at voltaire.com Tue Oct 7 16:37:13 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 8 Oct 2008 01:37:13 +0200 Subject: [ofa-general] Re: [infiniband-diags] [trivial] add more detail to error message on perfquery workaround In-Reply-To: <1223419131.1197.105.camel@cardanus.llnl.gov> References: <1223419131.1197.105.camel@cardanus.llnl.gov> Message-ID: <20081007233713.GD7563@sashak.voltaire.com> On 15:38 Tue 07 Oct , Al Chu wrote: > Hey Sasha, > > Didn't know what some error messages really meant. So added some > detail. > > Al > > -- > Albert Chu > chu11 at llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > From 86e61a0c7c450b8f0f449f95d2903b3cc5077b6b Mon Sep 17 00:00:00 2001 > From: Albert Chu > Date: Tue, 7 Oct 2008 09:57:56 -0700 > Subject: [PATCH] clarify error messages > > > Signed-off-by: Albert Chu All three applied. Thanks. Sasha From sashak at voltaire.com Tue Oct 7 16:37:34 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 8 Oct 2008 01:37:34 +0200 Subject: [ofa-general] Re: [infiniband-diags] add --loop_ports option to perfquery In-Reply-To: <1223420045.1197.117.camel@cardanus.llnl.gov> References: <1223420045.1197.117.camel@cardanus.llnl.gov> Message-ID: <20081007233734.GE7563@sashak.voltaire.com> On 15:54 Tue 07 Oct , Al Chu wrote: > Hey Sasha, > > We have a switch here that does not report the AllPortSelect flag as a > capability. It's pretty annoying typing each port on the switch or > always having to script around this one oddball switch we have. So I > added an option --loop_ports for perfquery. If you want to do something > to all the ports on the CA/Switch, but AllPortSelect isn't available, it > loops through all the available ports instead. > > There was already a workaround in the tool for a CA that did not support > the AllPortSelect flag. I get the feeling the workaround may have been > for a specific hardware, so I kept the workaround in there. > > Al > > -- > Albert Chu > chu11 at llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > From 6de5b57f0905ea719b4dc32508140a00704ac466 Mon Sep 17 00:00:00 2001 > From: Albert Chu > Date: Tue, 7 Oct 2008 14:05:54 -0700 > Subject: [PATCH] support --loop_ports in perfquery > > > Signed-off-by: Albert Chu Applied. Thanks. Sasha From sashak at voltaire.com Tue Oct 7 16:40:38 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 8 Oct 2008 01:40:38 +0200 Subject: [ofa-general] Re: [PATCH][TRIVIAL] OpenSM: Display port number in decimal in log messages In-Reply-To: <48E66A72.6020408@obsidianresearch.com> References: <48E66A72.6020408@obsidianresearch.com> Message-ID: <20081007234038.GF7563@sashak.voltaire.com> On 12:54 Fri 03 Oct , Hal Rosenstock wrote: > Sasha, > > Cosmetic patch attached. > > -- Hal > > > OpenSM: Convert display of port numbers to decimal > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Tue Oct 7 16:53:03 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 8 Oct 2008 01:53:03 +0200 Subject: [ofa-general] Re: [PATCHv2][TRIVIAL] OpenSM: Display port number in decimal in log messages In-Reply-To: <48E67B71.8050508@obsidianresearch.com> References: <48E67B71.8050508@obsidianresearch.com> Message-ID: <20081007235303.GG7563@sashak.voltaire.com> Hi Hal, On 14:07 Fri 03 Oct , Hal Rosenstock wrote: > > Cosmetic patch attached. Found some more cases... The patch is broken (it has parts from previous patch) and doesn't apply clearly. So I was need to apply this to another branch and rebase. Also please next time add 'Sign-off-by'. Sasha > > -- Hal > diff --git a/opensm/opensm/osm_drop_mgr.c b/opensm/opensm/osm_drop_mgr.c > index e827c26..8c6e7fb 100644 > --- a/opensm/opensm/osm_drop_mgr.c > +++ b/opensm/opensm/osm_drop_mgr.c > @@ -103,7 +103,7 @@ static void drop_mgr_clean_physp(osm_sm_t * sm, IN osm_physp_t * p_physp) > IB_LINK_ACTIVE) { > OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, > "Forcing new heavy sweep. Remote " > - "port 0x%016" PRIx64 " port num: 0x%X " > + "port 0x%016" PRIx64 " port num: %u " > "was recognized in ACTIVE state\n", > cl_ntoh64(p_remote_physp->port_guid), > p_remote_physp->port_num); > @@ -117,7 +117,7 @@ static void drop_mgr_clean_physp(osm_sm_t * sm, IN osm_physp_t * p_physp) > p_remote_port->discovery_count = 0; > OSM_LOG(sm->p_log, OSM_LOG_DEBUG, > "Resetting discovery count of node: " > - "0x%016" PRIx64 " port num:0x%X\n", > + "0x%016" PRIx64 " port num:%u\n", > cl_ntoh64(osm_node_get_node_guid > (p_remote_physp->p_node)), > p_remote_physp->port_num); > @@ -125,9 +125,9 @@ static void drop_mgr_clean_physp(osm_sm_t * sm, IN osm_physp_t * p_physp) > } > > OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, > - "Unlinking local node 0x%016" PRIx64 ", port 0x%X" > + "Unlinking local node 0x%016" PRIx64 ", port %u" > "\n\t\t\t\tand remote node 0x%016" PRIx64 > - ", port 0x%X\n", > + ", port %u\n", > cl_ntoh64(osm_node_get_node_guid(p_physp->p_node)), > p_physp->port_num, > cl_ntoh64(osm_node_get_node_guid > @@ -139,7 +139,7 @@ static void drop_mgr_clean_physp(osm_sm_t * sm, IN osm_physp_t * p_physp) > } > > OSM_LOG(sm->p_log, OSM_LOG_DEBUG, > - "Clearing node 0x%016" PRIx64 " physical port number 0x%X\n", > + "Clearing node 0x%016" PRIx64 " physical port number %u\n", > cl_ntoh64(osm_node_get_node_guid(p_physp->p_node)), > p_physp->port_num); > > @@ -186,7 +186,8 @@ static void __osm_drop_mgr_remove_port(osm_sm_t * sm, IN osm_port_t * p_port) > if (p_sm != (osm_remote_sm_t *) cl_qmap_end(p_sm_guid_tbl)) { > /* need to remove this item */ > OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, > - "Cleaned SM for port guid\n"); > + "Cleaned SM for port guid 0x%016" PRIx64 "\n", > + cl_ntoh64(port_guid)); > > free(p_sm); > } > diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c > index 4c67c3f..45a100e 100644 > --- a/opensm/opensm/osm_helper.c > +++ b/opensm/opensm/osm_helper.c > @@ -780,7 +780,7 @@ osm_dump_port_info(IN osm_log_t * const p_log, > > osm_log(p_log, log_level, > "PortInfo dump:\n" > - "\t\t\t\tport number.............0x%X\n" > + "\t\t\t\tport number.............%u\n" > "\t\t\t\tnode_guid...............0x%016" PRIx64 "\n" > "\t\t\t\tport_guid...............0x%016" PRIx64 "\n" > "\t\t\t\tm_key...................0x%016" PRIx64 "\n" > @@ -790,7 +790,7 @@ osm_dump_port_info(IN osm_log_t * const p_log, > "\t\t\t\tcapability_mask.........0x%X\n" > "\t\t\t\tdiag_code...............0x%X\n" > "\t\t\t\tm_key_lease_period......0x%X\n" > - "\t\t\t\tlocal_port_num..........0x%X\n" > + "\t\t\t\tlocal_port_num..........%u\n" > "\t\t\t\tlink_width_enabled......0x%X\n" > "\t\t\t\tlink_width_supported....0x%X\n" > "\t\t\t\tlink_width_active.......0x%X\n" > @@ -879,7 +879,7 @@ osm_dump_portinfo_record(IN osm_log_t * const p_log, > "\t\t\t\tcapability_mask.........0x%X\n" > "\t\t\t\tdiag_code...............0x%X\n" > "\t\t\t\tm_key_lease_period......0x%X\n" > - "\t\t\t\tlocal_port_num..........0x%X\n" > + "\t\t\t\tlocal_port_num..........%u\n" > "\t\t\t\tlink_width_enabled......0x%X\n" > "\t\t\t\tlink_width_supported....0x%X\n" > "\t\t\t\tlink_width_active.......0x%X\n" > @@ -994,14 +994,14 @@ osm_dump_node_info(IN osm_log_t * const p_log, > "\t\t\t\tbase_version............0x%X\n" > "\t\t\t\tclass_version...........0x%X\n" > "\t\t\t\tnode_type...............%s\n" > - "\t\t\t\tnum_ports...............0x%X\n" > + "\t\t\t\tnum_ports...............%u\n" > "\t\t\t\tsys_guid................0x%016" PRIx64 "\n" > "\t\t\t\tnode_guid...............0x%016" PRIx64 "\n" > "\t\t\t\tport_guid...............0x%016" PRIx64 "\n" > "\t\t\t\tpartition_cap...........0x%X\n" > "\t\t\t\tdevice_id...............0x%X\n" > "\t\t\t\trevision................0x%X\n" > - "\t\t\t\tport_num................0x%X\n" > + "\t\t\t\tport_num................%u\n" > "\t\t\t\tvendor_id...............0x%X\n", > p_ni->base_version, > p_ni->class_version, > @@ -1041,14 +1041,14 @@ osm_dump_node_record(IN osm_log_t * const p_log, > "\t\t\t\tbase_version............0x%X\n" > "\t\t\t\tclass_version...........0x%X\n" > "\t\t\t\tnode_type...............%s\n" > - "\t\t\t\tnum_ports...............0x%X\n" > + "\t\t\t\tnum_ports...............%u\n" > "\t\t\t\tsys_guid................0x%016" PRIx64 "\n" > "\t\t\t\tnode_guid...............0x%016" PRIx64 "\n" > "\t\t\t\tport_guid...............0x%016" PRIx64 "\n" > "\t\t\t\tpartition_cap...........0x%X\n" > "\t\t\t\tdevice_id...............0x%X\n" > "\t\t\t\trevision................0x%X\n" > - "\t\t\t\tport_num................0x%X\n" > + "\t\t\t\tport_num................%u\n" > "\t\t\t\tvendor_id...............0x%X\n" > "\t\t\t\tNodeDescription\n" > "\t\t\t\t%s\n", > @@ -1489,8 +1489,8 @@ osm_dump_link_record(IN osm_log_t * const p_log, > osm_log(p_log, log_level, > "Link Record dump:\n" > "\t\t\t\tfrom_lid................%u\n" > - "\t\t\t\tfrom_port_num...........0x%X\n" > - "\t\t\t\tto_port_num.............0x%X\n" > + "\t\t\t\tfrom_port_num...........%u\n" > + "\t\t\t\tto_port_num.............%u\n" > "\t\t\t\tto_lid..................%u\n", > cl_ntoh16(p_lr->from_lid), > p_lr->from_port_num, > @@ -1512,9 +1512,9 @@ osm_dump_switch_info(IN osm_log_t * const p_log, > "\t\t\t\trand_cap................0x%X\n" > "\t\t\t\tmcast_cap...............0x%X\n" > "\t\t\t\tlin_top.................0x%X\n" > - "\t\t\t\tdef_port................0x%X\n" > - "\t\t\t\tdef_mcast_pri_port......0x%X\n" > - "\t\t\t\tdef_mcast_not_port......0x%X\n" > + "\t\t\t\tdef_port................%u\n" > + "\t\t\t\tdef_mcast_pri_port......%u\n" > + "\t\t\t\tdef_mcast_not_port......%u\n" > "\t\t\t\tlife_state..............0x%X\n" > "\t\t\t\tlids_per_port...........%u\n" > "\t\t\t\tpartition_enf_cap.......0x%X\n" > @@ -1549,9 +1549,9 @@ osm_dump_switch_info_record(IN osm_log_t * const p_log, > "\t\t\t\trand_cap................0x%X\n" > "\t\t\t\tmcast_cap...............0x%X\n" > "\t\t\t\tlin_top.................0x%X\n" > - "\t\t\t\tdef_port................0x%X\n" > - "\t\t\t\tdef_mcast_pri_port......0x%X\n" > - "\t\t\t\tdef_mcast_not_port......0x%X\n" > + "\t\t\t\tdef_port................%u\n" > + "\t\t\t\tdef_mcast_pri_port......%u\n" > + "\t\t\t\tdef_mcast_not_port......%u\n" > "\t\t\t\tlife_state..............0x%X\n" > "\t\t\t\tlids_per_port...........%u\n" > "\t\t\t\tpartition_enf_cap.......0x%X\n" > @@ -1593,7 +1593,7 @@ osm_dump_pkey_block(IN osm_log_t * const p_log, > "P_Key table dump:\n" > "\t\t\tport_guid...........0x%016" PRIx64 "\n" > "\t\t\tblock_num...........0x%X\n" > - "\t\t\tport_num............0x%X\n\tP_Key Table: %s\n", > + "\t\t\tport_num............%u\n\tP_Key Table: %s\n", > cl_ntoh64(port_guid), block_num, port_num, buf_line); > } > } > @@ -1621,8 +1621,8 @@ osm_dump_slvl_map_table(IN osm_log_t * const p_log, > osm_log(p_log, log_level, > "SLtoVL dump:\n" > "\t\t\tport_guid............0x%016" PRIx64 "\n" > - "\t\t\tin_port_num..........0x%X\n" > - "\t\t\tout_port_num.........0x%X\n\tSL: | %s\n\tVL: | %s\n", > + "\t\t\tin_port_num..........%u\n" > + "\t\t\tout_port_num.........%u\n\tSL: | %s\n\tVL: | %s\n", > cl_ntoh64(port_guid), > in_port_num, out_port_num, buf_line1, buf_line2); > } > @@ -1651,7 +1651,7 @@ osm_dump_vl_arb_table(IN osm_log_t * const p_log, > osm_log(p_log, log_level, > "VLArb dump:\n" "\t\t\tport_guid...........0x%016" > PRIx64 "\n" "\t\t\tblock_num...........0x%X\n" > - "\t\t\tport_num............0x%X\n\tVL : | %s\n\tWEIGHT:| %s\n", > + "\t\t\tport_num............%u\n\tVL : | %s\n\tWEIGHT:| %s\n", > cl_ntoh64(port_guid), block_num, port_num, buf_line1, > buf_line2); > } > diff --git a/opensm/opensm/osm_link_mgr.c b/opensm/opensm/osm_link_mgr.c > index 452d151..98870cb 100644 > --- a/opensm/opensm/osm_link_mgr.c > +++ b/opensm/opensm/osm_link_mgr.c > @@ -377,7 +377,7 @@ __osm_link_mgr_process_node(osm_sm_t * sm, > if (link_state != IB_LINK_NO_CHANGE && > link_state <= current_state) > OSM_LOG(sm->p_log, OSM_LOG_DEBUG, > - "Physical port 0x%X already %s. Skipping\n", > + "Physical port %u already %s. Skipping\n", > p_physp->port_num, > ib_get_port_state_str(current_state)); > else if (__osm_link_mgr_set_physp_pi(sm, p_physp, link_state)) > diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c > index fc8533d..c4cd632 100644 > --- a/opensm/opensm/osm_mcast_mgr.c > +++ b/opensm/opensm/osm_mcast_mgr.c > @@ -628,7 +628,7 @@ static osm_mtree_node_t *__osm_mcast_mgr_branch(osm_sm_t * sm, > */ > if (depth > 1) { > OSM_LOG(sm->p_log, OSM_LOG_DEBUG, > - "Adding upstream port 0x%X\n", upstream_port); > + "Adding upstream port %u\n", upstream_port); > > CL_ASSERT(upstream_port); > osm_mcast_tbl_set(p_tbl, mlid_ho, upstream_port); > @@ -662,7 +662,7 @@ static osm_mtree_node_t *__osm_mcast_mgr_branch(osm_sm_t * sm, > continue; /* No routes down this port. */ > > OSM_LOG(sm->p_log, OSM_LOG_DEBUG, > - "Routing %zu destinations via switch port 0x%X\n", > + "Routing %zu destinations via switch port %u\n", > count, i); > > /* > @@ -716,7 +716,7 @@ static osm_mtree_node_t *__osm_mcast_mgr_branch(osm_sm_t * sm, > > OSM_LOG(sm->p_log, OSM_LOG_DEBUG, > "Found leaf for port 0x%016" PRIx64 > - " on switch port 0x%X\n", > + " on switch port %u\n", > cl_ntoh64(osm_port_get_guid (p_wobj->p_port)), > i); > > diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c > index 86710d1..a37ce0a 100644 > --- a/opensm/opensm/osm_node_info_rcv.c > +++ b/opensm/opensm/osm_node_info_rcv.c > @@ -217,7 +217,7 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm, > port_num != 0 && cl_qmap_count(&sm->p_subn->sw_guid_tbl) == 0) { > OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, > "Duplicate GUID found by link from a port to itself:" > - "node 0x%" PRIx64 ", port number 0x%X\n", > + "node 0x%" PRIx64 ", port number %u\n", > cl_ntoh64(osm_node_get_node_guid(p_node)), port_num); > p_physp = osm_node_get_physp_ptr(p_node, port_num); > osm_dump_dr_path(sm->p_log, > @@ -235,8 +235,8 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm, > > OSM_LOG(sm->p_log, OSM_LOG_DEBUG, > "Creating new link between:\n\t\t\t\tnode 0x%" PRIx64 > - ", port number 0x%X and\n\t\t\t\tnode 0x%" PRIx64 > - ", port number 0x%X\n", > + ", port number %u and\n\t\t\t\tnode 0x%" PRIx64 > + ", port number %u\n", > cl_ntoh64(osm_node_get_node_guid(p_node)), port_num, > cl_ntoh64(p_ni_context->node_guid), p_ni_context->port_num); > > diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c > index 8d292ab..a168f89 100644 > --- a/opensm/opensm/osm_perfmgr.c > +++ b/opensm/opensm/osm_perfmgr.c > @@ -219,7 +219,7 @@ osm_perfmgr_mad_send_err_callback(void *bind_context, osm_madw_t * p_madw) > p_mon_node = (__monitored_node_t *) p_node; > > OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C02: %s (0x%" PRIx64 > - ") port %d\n", p_mon_node->name, p_mon_node->guid, port); > + ") port %u\n", p_mon_node->name, p_mon_node->guid, port); > > if (pm->subn->opt.perfmgr_redir && p_madw->status == IB_TIMEOUT) { > /* First, find the node in the monitored map */ > @@ -228,8 +228,8 @@ osm_perfmgr_mad_send_err_callback(void *bind_context, osm_madw_t * p_madw) > if (port > p_mon_node->redir_tbl_size) { > cl_plock_release(pm->lock); > OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C16: " > - "Invalid port num %d for %s (GUID 0x%016" > - PRIx64 ") num ports %d\n", port, p_mon_node->name, > + "Invalid port num %u for %s (GUID 0x%016" > + PRIx64 ") num ports %u\n", port, p_mon_node->name, > p_mon_node->guid, p_mon_node->redir_tbl_size); > goto Exit; > } > diff --git a/opensm/opensm/osm_pkey_rcv.c b/opensm/opensm/osm_pkey_rcv.c > index 9e09806..d845c20 100644 > --- a/opensm/opensm/osm_pkey_rcv.c > +++ b/opensm/opensm/osm_pkey_rcv.c > @@ -127,7 +127,7 @@ void osm_pkey_rcv_process(IN void *context, IN void *data) > */ > if (!p_physp) { > OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 4807: " > - "Got invalid port number 0x%X\n", port_num); > + "Got invalid port number %u\n", port_num); > goto Exit; > } > > diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c > index 54a5fd1..48a12f9 100644 > --- a/opensm/opensm/osm_port.c > +++ b/opensm/opensm/osm_port.c > @@ -322,7 +322,7 @@ osm_physp_calc_link_mtu(IN osm_log_t * p_log, IN const osm_physp_t * p_physp) > ib_port_info_get_mtu_cap(&p_remote_physp->port_info); > > OSM_LOG(p_log, OSM_LOG_DEBUG, > - "Remote port 0x%016" PRIx64 " port = 0x%X : " > + "Remote port 0x%016" PRIx64 " port = %u : " > "MTU = %u. This Port MTU: %u\n", > cl_ntoh64(osm_physp_get_port_guid(p_remote_physp)), > osm_physp_get_port_num(p_remote_physp), > @@ -334,8 +334,8 @@ osm_physp_calc_link_mtu(IN osm_log_t * p_log, IN const osm_physp_t * p_physp) > > OSM_LOG(p_log, OSM_LOG_VERBOSE, > "MTU mismatch between ports." > - "\n\t\t\t\tPort 0x%016" PRIx64 ", port 0x%X" > - " and port 0x%016" PRIx64 ", port 0x%X." > + "\n\t\t\t\tPort 0x%016" PRIx64 ", port %u" > + " and port 0x%016" PRIx64 ", port %u." > "\n\t\t\t\tUsing lower MTU of %u\n", > cl_ntoh64(osm_physp_get_port_guid(p_physp)), > osm_physp_get_port_num(p_physp), > @@ -769,7 +769,7 @@ osm_physp_set_pkey_tbl(IN osm_log_t * p_log, > if (block_num >= max_blocks) { > OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4108: " > "Got illegal set for block number:%u " > - "For GUID: %" PRIx64 " port number:0x%X\n", > + "For GUID: %" PRIx64 " port number:%u\n", > block_num, > cl_ntoh64(p_physp->p_node->node_info.node_guid), > p_physp->port_num); > diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c > index a820069..73afd8e 100644 > --- a/opensm/opensm/osm_port_info_rcv.c > +++ b/opensm/opensm/osm_port_info_rcv.c > @@ -235,9 +235,9 @@ __osm_pi_rcv_process_switch_port(IN osm_sm_t * sm, > > OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, > "Unlinking local node 0x%" PRIx64 > - ", port 0x%X" > + ", port %u" > "\n\t\t\t\tand remote node 0x%" PRIx64 > - ", port 0x%X\n", > + ", port %u\n", > cl_ntoh64(osm_node_get_node_guid > (p_node)), port_num, > cl_ntoh64(osm_node_get_node_guid > @@ -292,13 +292,13 @@ __osm_pi_rcv_process_switch_port(IN osm_sm_t * sm, > ib_get_err_str(status)); > } else > OSM_LOG(sm->p_log, OSM_LOG_DEBUG, > - "Skipping SMP responder port 0x%X\n", > + "Skipping SMP responder port %u\n", > p_pi->local_port_num); > break; > > default: > OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0F03: " > - "Unknown link state = %u, port = 0x%X\n", > + "Unknown link state = %u, port = %u\n", > ib_port_info_get_port_state(p_pi), > p_pi->local_port_num); > break; > @@ -488,7 +488,7 @@ osm_pi_rcv_process_set(IN osm_sm_t * sm, IN osm_node_t * const p_node, > > OSM_LOG(sm->p_log, OSM_LOG_DEBUG, > "Received logical SetResp() for GUID 0x%" PRIx64 > - ", port num 0x%X" > + ", port num %u" > "\n\t\t\t\tfor parent node GUID 0x%" PRIx64 > " TID 0x%" PRIx64 "\n", > cl_ntoh64(port_guid), port_num, > @@ -598,7 +598,7 @@ void osm_pi_rcv_process(IN void *context, IN void *data) > most likely due to a subnet sweep in progress. > */ > OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, > - "Discovered port num 0x%X with GUID 0x%" PRIx64 > + "Discovered port num %u with GUID 0x%" PRIx64 > " for parent node GUID 0x%" PRIx64 > ", TID 0x%" PRIx64 "\n", > port_num, cl_ntoh64(port_guid), > @@ -613,7 +613,7 @@ void osm_pi_rcv_process(IN void *context, IN void *data) > */ > if (!p_physp) { > OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, > - "Initializing port number 0x%X\n", port_num); > + "Initializing port number %u\n", port_num); > p_physp = &p_node->physp_table[port_num]; > osm_physp_init(p_physp, > port_guid, > diff --git a/opensm/opensm/osm_sa_link_record.c b/opensm/opensm/osm_sa_link_record.c > index 9b1ad90..8e74297 100644 > --- a/opensm/opensm/osm_sa_link_record.c > +++ b/opensm/opensm/osm_sa_link_record.c > @@ -194,8 +194,8 @@ __osm_lr_rcv_get_physp_link(IN osm_sa_t * sa, > goto Exit; > > OSM_LOG(sa->p_log, OSM_LOG_DEBUG, "Acquiring link record\n" > - "\t\t\t\tsrc port 0x%" PRIx64 " (port 0x%X)" > - ", dest port 0x%" PRIx64 " (port 0x%X)\n", > + "\t\t\t\tsrc port 0x%" PRIx64 " (port %u)" > + ", dest port 0x%" PRIx64 " (port %u)\n", > cl_ntoh64(osm_physp_get_port_guid(p_src_physp)), src_port_num, > cl_ntoh64(osm_physp_get_port_guid(p_dest_physp)), > dest_port_num); > diff --git a/opensm/opensm/osm_sa_pkey_record.c b/opensm/opensm/osm_sa_pkey_record.c > index bba01b4..b694c8c 100644 > --- a/opensm/opensm/osm_sa_pkey_record.c > +++ b/opensm/opensm/osm_sa_pkey_record.c > @@ -93,7 +93,7 @@ __osm_sa_pkey_create(IN osm_sa_t * sa, > > OSM_LOG(sa->p_log, OSM_LOG_DEBUG, > "New P_Key table for: port 0x%016" PRIx64 > - ", lid %u, port 0x%X Block:%u\n", > + ", lid %u, port %u Block:%u\n", > cl_ntoh64(osm_physp_get_port_guid(p_physp)), > cl_ntoh16(lid), osm_physp_get_port_num(p_physp), block); > > diff --git a/opensm/opensm/osm_sa_portinfo_record.c b/opensm/opensm/osm_sa_portinfo_record.c > index b263ce8..2ac2611 100644 > --- a/opensm/opensm/osm_sa_portinfo_record.c > +++ b/opensm/opensm/osm_sa_portinfo_record.c > @@ -94,7 +94,7 @@ __osm_pir_rcv_new_pir(IN osm_sa_t * sa, > > OSM_LOG(sa->p_log, OSM_LOG_DEBUG, > "New PortInfoRecord: port 0x%016" PRIx64 > - ", lid %u, port 0x%X\n", > + ", lid %u, port %u\n", > cl_ntoh64(osm_physp_get_port_guid(p_physp)), > cl_ntoh16(lid), osm_physp_get_port_num(p_physp)); > > diff --git a/opensm/opensm/osm_sa_slvl_record.c b/opensm/opensm/osm_sa_slvl_record.c > index 46b66a6..886cb9b 100644 > --- a/opensm/opensm/osm_sa_slvl_record.c > +++ b/opensm/opensm/osm_sa_slvl_record.c > @@ -100,7 +100,7 @@ __osm_sa_slvl_create(IN osm_sa_t * sa, > > OSM_LOG(sa->p_log, OSM_LOG_DEBUG, > "New SLtoVL Map for: OUT port 0x%016" PRIx64 > - ", lid 0x%X, port 0x%X to In Port:%u\n", > + ", lid 0x%X, port %u to In Port:%u\n", > cl_ntoh64(osm_physp_get_port_guid(p_physp)), > cl_ntoh16(lid), osm_physp_get_port_num(p_physp), in_port_idx); > > diff --git a/opensm/opensm/osm_sa_vlarb_record.c b/opensm/opensm/osm_sa_vlarb_record.c > index e2aafc1..ae59b13 100644 > --- a/opensm/opensm/osm_sa_vlarb_record.c > +++ b/opensm/opensm/osm_sa_vlarb_record.c > @@ -100,7 +100,7 @@ __osm_sa_vl_arb_create(IN osm_sa_t * sa, > > OSM_LOG(sa->p_log, OSM_LOG_DEBUG, > "New VLArbitration for: port 0x%016" PRIx64 > - ", lid %u, port 0x%X Block:%u\n", > + ", lid %u, port %u Block:%u\n", > cl_ntoh64(osm_physp_get_port_guid(p_physp)), > cl_ntoh16(lid), osm_physp_get_port_num(p_physp), block); > > diff --git a/opensm/opensm/osm_slvl_map_rcv.c b/opensm/opensm/osm_slvl_map_rcv.c > index 000c2a6..e7f1cd2 100644 > --- a/opensm/opensm/osm_slvl_map_rcv.c > +++ b/opensm/opensm/osm_slvl_map_rcv.c > @@ -135,7 +135,7 @@ void osm_slvl_rcv_process(IN void *context, IN void *p_data) > */ > if (!p_physp) { > OSM_LOG(sm->p_log, OSM_LOG_ERROR, > - "Got invalid port number 0x%X\n", out_port_num); > + "Got invalid port number %u\n", out_port_num); > goto Exit; > } > > diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c > index 0213915..b7c2c77 100644 > --- a/opensm/opensm/osm_trap_rcv.c > +++ b/opensm/opensm/osm_trap_rcv.c > @@ -129,7 +129,7 @@ osm_trap_rcv_aging_tracker_callback(IN uint64_t key, > p_physp = get_physp_by_lid_and_num(sm, lid, port_num); > if (!p_physp) > OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, > - "Cannot find port num:0x%X with lid:%u\n", > + "Cannot find port num:%u with lid:%u\n", > port_num, lid); > /* make sure the physp is still valid */ > /* If the health port was false - set it to true */ > diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c > index bde0c29..be8e724 100644 > --- a/opensm/opensm/osm_ucast_mgr.c > +++ b/opensm/opensm/osm_ucast_mgr.c > @@ -152,7 +152,7 @@ __osm_ucast_mgr_process_neighbor(IN osm_ucast_mgr_t * const p_mgr, > > OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, > "Node 0x%" PRIx64 ", remote node 0x%" PRIx64 > - ", port 0x%X, remote port 0x%X\n", > + ", port %u, remote port %u\n", > cl_ntoh64(osm_node_get_node_guid(p_this_sw->p_node)), > cl_ntoh64(osm_node_get_node_guid(p_remote_sw->p_node)), > port_num, remote_port_num); > @@ -273,7 +273,7 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr, > osm_physp_t *p = osm_node_get_physp_ptr(p_sw->p_node, port); > > OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, > - "Routing LID %u to port 0x%X" > + "Routing LID %u to port %u" > " for switch 0x%" PRIx64 "\n", > lid_ho, port, cl_ntoh64(node_guid)); > > diff --git a/opensm/opensm/osm_vl_arb_rcv.c b/opensm/opensm/osm_vl_arb_rcv.c > index 725cc3b..674c2d6 100644 > --- a/opensm/opensm/osm_vl_arb_rcv.c > +++ b/opensm/opensm/osm_vl_arb_rcv.c > @@ -132,7 +132,7 @@ void osm_vla_rcv_process(IN void *context, IN void *data) > */ > if (!p_physp) { > OSM_LOG(sm->p_log, OSM_LOG_ERROR, > - "Got invalid port number 0x%X\n", port_num); > + "Got invalid port number %u\n", port_num); > goto Exit; > } > From sashak at voltaire.com Tue Oct 7 17:25:26 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 8 Oct 2008 02:25:26 +0200 Subject: [ofa-general] questions about opensm and unmanaged switch In-Reply-To: References: <20081003155247.GE6566@sashak.voltaire.com> Message-ID: <20081008002526.GH7563@sashak.voltaire.com> On 11:45 Fri 03 Oct , Yicheng Jia wrote: > err:4294967295. I'm running it on QNX With QNX you should handle somehow file related errors. > so ibnetdiscover is not available so > far. I attach the verbose output of opensm during startup. As you can see, > it start to receive SMP from other HCAs after several heavy sweep. It > looks like the switch block them at the beginning? Yes, it looks exactly so. I would recommend to investigate discovery process in deep. The problem can be with switch (but it is strange that initialized switch doesn't forward direct queries). Some tools can be useful - smpquery (query NodeInfo by direct paths). Maybe trying with linux box could be useful too. Sasha From sashak at voltaire.com Tue Oct 7 17:32:39 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 8 Oct 2008 02:32:39 +0200 Subject: [ofa-general] ***SPAM*** ibdm network topology format In-Reply-To: References: <829ded920809292304k3ffc78c0m556efbdd7d35c528@mail.gmail.com> <20080930121252.GA7396@sashak.voltaire.com> <829ded920810010207r475d82abu269d47cd3baddb3f@mail.gmail.com> <20081001203813.GL7396@sashak.voltaire.com> <20081002022430.GQ7396@sashak.voltaire.com> <20081002170033.GI25831@sashak.voltaire.com> Message-ID: <20081008003239.GI7563@sashak.voltaire.com> Hi Hal, On 08:26 Mon 06 Oct , Hal Rosenstock wrote: > > > > Me too. See below. > > > >>> > Somehow it works without ibsim - so I suspect user_mad handles it. > >>> > > >>> > (Hal, could you clarify?) > >>> > >>> The kernel (user_mad/mad) does not change the requested registrations > >>> but I'm not sure I understand the question you are asking to be > >>> clarified. Is that what you're asking ? > >> > >> ibis works somehow with real stack. It registers 0x1 class only and > >> uses direct routing SMPs. Do you have any idea about why > >> (osm_vendor_idumad and/or libibumad don't help)? > > > > libibumad umad_register does not do anything that would affect this > > either. I can only conclude there must be something in ibutils that > > fixes this if it does work with the real stack. It shouldn't be too > > hard to track down where that registration for class 0x81 comes from. > > Are you sure this is the only registration and not DR class too ? I'm not sure I understood the question. But I was about registration (or more accurate not registration) of class 0x81 by ibutils and by any lower layer up to kernel. > That's the first thing to confirm or maybe you've already confirmed > this and it wasn't clear to me in what you wrote. If so, I have a > theory about what could be occuring. It may be the case that it is an > effect of the kernel MAD layer in that a MAD agent can send any class > and when using request/response it matches on transaction ID which > contains the MAD agent. Unsolicited messages on that other class > wouldn't get through though. I just ran a simple test of this and that > appears to be the case. This could explain the phenomena. And then it seems that similar mechanism should be implemented in umad2sim. Sasha From ctung at neteffect.com Tue Oct 7 17:54:34 2008 From: ctung at neteffect.com (Chien Tung) Date: Tue, 7 Oct 2008 19:54:34 -0500 Subject: [ofa-general] [PATCH] RDMA/nes: correct error_module bit mask Message-ID: <200810080054.m980sYXi029769@velma.neteffect.com> From: Chien Tung RDMA/nes: correct error_module bit mask error_module is 5 bits wide not 4. The corresponding crit_error_count array is correct with 32 entries. Signed-off-by: Chien Tung -- drivers/infiniband/hw/nes/nes_hw.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_hw.c b/drivers/infiniband/hw/nes/nes_hw.c index eca3520..7c49cc8 100644 --- a/drivers/infiniband/hw/nes/nes_hw.c +++ b/drivers/infiniband/hw/nes/nes_hw.c @@ -2008,7 +2008,7 @@ static void process_critical_error(struct nes_device *nesdev) 0x01010000 | (debug_error & 0x0000ffff)); if (crit_err_count++ > 10) nes_write_indexed(nesdev, NES_IDX_DEBUG_ERROR_MASKS1, 1 << 0x17); - error_module = (u16) (debug_error & 0x0F00) >> 8; + error_module = (u16) (debug_error & 0x1F00) >> 8; if (++nesdev->nesadapter->crit_error_count[error_module-1] >= nes_max_critical_error_count) { printk(KERN_ERR PFX "Masking off critical error for module " From sashak at voltaire.com Tue Oct 7 18:21:49 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 8 Oct 2008 03:21:49 +0200 Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer overflow Message-ID: <20081008012149.GK7563@sashak.voltaire.com> Lash first overflows its buffer and then check for the size (based on number VLs used). Fix the check order. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_ucast_lash.c | 13 ++++++------- 1 files changed, 6 insertions(+), 7 deletions(-) diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index ce3982f..03cfc1f 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -979,6 +979,12 @@ static int lash_core(lash_t * p_lash) switches[dest_switch]->routing_table[i].lane = v_lane; if (cycle_found == 1 || cycle_found2 == 1) { + if (lanes_needed + 1 > p_lash->vl_min) { + lanes_needed++; + goto Error_Not_Enough_Lanes; + } else + lanes_needed++; + generate_cdg_for_sp(p_lash, i, dest_switch, v_lane); generate_cdg_for_sp(p_lash, dest_switch, i, v_lane); @@ -987,13 +993,6 @@ static int lash_core(lash_t * p_lash) set_temp_depend_to_permanent_for_sp(p_lash, dest_switch, i, v_lane); - if (lanes_needed + 1 > p_lash->vl_min) { - lanes_needed++; - goto Error_Not_Enough_Lanes; - } else - lanes_needed++; - - // goto error exit with message p_lash->num_mst_in_lane[v_lane]++; p_lash->num_mst_in_lane[v_lane]++; } -- 1.6.0.1.196.g01914 From sashak at voltaire.com Tue Oct 7 18:22:48 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 8 Oct 2008 03:22:48 +0200 Subject: [ofa-general] [PATCH] opensm/main.c: trivial usage message formatting fix Message-ID: <20081008012248.GL7563@sashak.voltaire.com> Add new line at end of LMC value check message - "ERROR: LMC must be 7 or less.". Signed-off-by: Sasha Khapyorsky --- opensm/opensm/main.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index 2f53157..65adb9a 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -773,7 +773,7 @@ int main(int argc, char *argv[]) temp = strtol(optarg, NULL, 0); if (temp > 7) { fprintf(stderr, - "ERROR: LMC must be 7 or less."); + "ERROR: LMC must be 7 or less.\n"); return (-1); } opt.lmc = (uint8_t) temp; -- 1.6.0.1.196.g01914 From mashirle at us.ibm.com Tue Oct 7 19:44:52 2008 From: mashirle at us.ibm.com (Shirley Ma) Date: Tue, 07 Oct 2008 19:44:52 -0700 Subject: [ofa-general] Re: openSM for supporting IPv6 SNM MGIDs consolidation In-Reply-To: <1223051358.8387.26.camel@IBM-29AB850785D.beaverton.ibm.com> References: <1223007639.8387.22.camel@IBM-29AB850785D.beaverton.ibm.com> <20081003153900.GC6566@sashak.voltaire.com> <1223049716.8387.24.camel@IBM-29AB850785D.beaverton.ibm.com> <20081003160448.GF6566@sashak.voltaire.com> <1223051358.8387.26.camel@IBM-29AB850785D.beaverton.ibm.com> Message-ID: <1223433892.24201.16.camel@IBM-29AB850785D.beaverton.ibm.com> Hello Sasha, The customer installed management packages from OFED download on the openSM node only and kept the rest of nodes with OFED-1.3 stack. They also tried the management packages from OFED-1.4. They noticed the difference. And "saquery -m" didn't work with existing OFED-1.3 stack. Could you please give some summary for the difference between these two openSM versions? I didn't install it, I just peeked the source code, it looks like the OFED-1.4 will have one IB MGC with ff10601b...01ff000000, the management packages for July release is ff10601b...01ffXXXXXX (first node IPv6 SNM 24 bit interface id). Is that right? Thanks Shirley From sashak at voltaire.com Tue Oct 7 20:22:15 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 8 Oct 2008 05:22:15 +0200 Subject: [ofa-general] Re: openSM for supporting IPv6 SNM MGIDs consolidation In-Reply-To: <1223433892.24201.16.camel@IBM-29AB850785D.beaverton.ibm.com> References: <1223007639.8387.22.camel@IBM-29AB850785D.beaverton.ibm.com> <20081003153900.GC6566@sashak.voltaire.com> <1223049716.8387.24.camel@IBM-29AB850785D.beaverton.ibm.com> <20081003160448.GF6566@sashak.voltaire.com> <1223051358.8387.26.camel@IBM-29AB850785D.beaverton.ibm.com> <1223433892.24201.16.camel@IBM-29AB850785D.beaverton.ibm.com> Message-ID: <20081008032215.GR7563@sashak.voltaire.com> Hi Shirley, On 19:44 Tue 07 Oct , Shirley Ma wrote: > > They also tried the management packages from OFED-1.4. They noticed the > difference. And "saquery -m" didn't work with existing OFED-1.3 stack. We changed (actually fixed) default SM_Key byte order. And saquery has '--smkey' option now. I guess: saquery --smkey 0x0100000000000000 -m . should work against old OpenSM. Alternatively SM_Key value can be fixed in OpenSM config file (to be 0x1 instead of 0x0100000000000000). > Could you please give some summary for the difference between these two > openSM versions? AFAIR it is only "compatibility" issue. > I didn't install it, I just peeked the source code, it looks like the > OFED-1.4 will have one IB MGC with ff10601b...01ff000000, the management > packages for July release is ff10601b...01ffXXXXXX (first node IPv6 SNM > 24 bit interface id). Is that right? Yes. Sasha From locore64 at alkyltechnology.com Tue Oct 7 20:53:16 2008 From: locore64 at alkyltechnology.com (Toru Nishimura) Date: Wed, 8 Oct 2008 12:53:16 +0900 Subject: [ofa-general] ***SPAM*** PPC64 platform and InfiniBand HW Message-ID: <2763777062A04D64A0CEBA827536F4B1@tpad2> Hi, OpenFabrics guys, I'm new to HPC technology domain and one of engineering team being asked to research a clustering computer built with OFED and PPC combination. My initial Qs are; - I can see PPC64 is one of supported architecture. What are PPC64 platform in use, for development and in-field applications? We need proven / known-to-work PPC64 platform(s) as a solid foundation to start the entire project. Specific model names are wanted. - What IB HW are in use for PPC64 combination? We are consulted to use Mellanox LX PCIe card. Is the product mature enough for PPC64 platforms? Toru Nishimura / ALKYL Technology From locore64 at alkyltechnology.com Tue Oct 7 21:00:21 2008 From: locore64 at alkyltechnology.com (Toru Nishimura) Date: Wed, 8 Oct 2008 13:00:21 +0900 Subject: [ofa-general] ***SPAM*** PPC64 platform and InfiniBand HW Message-ID: <0A837F7CBDEC43558A61CEB6440AD8CA@tpad2> Hi, OpenFabrics guys, I'm new to HPC technology domain and one of engineering team being asked to research a clustering computer built with OFED and PPC combination. My Qs are; - I can see PPC64 is one of supported architecture. What are PPC64 platform in use, for development and real field applications? We need proven / known-to-work PPC64 platform(s) as a solid foundation to start the entire project. Specific model names are wanted. - What IB HW are in use for PPC64 combination? We are consulted to use Mellanox LX PCIe card. Is the product mature enough for PPC64 platforms? - Our initial plan is to use PowerMac G5 (PCIe model) and Mellanox LX card. Is it a wise selection for a development foundation? If not, what's the better (less risky) alternative available this moment? Toru Nishimura / ALKYL Technology From mashirle at us.ibm.com Tue Oct 7 21:11:12 2008 From: mashirle at us.ibm.com (Shirley Ma) Date: Tue, 07 Oct 2008 21:11:12 -0700 Subject: [ofa-general] Re: openSM for supporting IPv6 SNM MGIDs consolidation In-Reply-To: <20081008032215.GR7563@sashak.voltaire.com> References: <1223007639.8387.22.camel@IBM-29AB850785D.beaverton.ibm.com> <20081003153900.GC6566@sashak.voltaire.com> <1223049716.8387.24.camel@IBM-29AB850785D.beaverton.ibm.com> <20081003160448.GF6566@sashak.voltaire.com> <1223051358.8387.26.camel@IBM-29AB850785D.beaverton.ibm.com> <1223433892.24201.16.camel@IBM-29AB850785D.beaverton.ibm.com> <20081008032215.GR7563@sashak.voltaire.com> Message-ID: <1223439072.24201.19.camel@IBM-29AB850785D.beaverton.ibm.com> 在 2008-10-08三的 05:22 +0200,Sasha Khapyorsky写道: > saquery --smkey 0x0100000000000000 -m Thanks Sasha, I will forward this to the customer. Shirley From vlad at dev.mellanox.co.il Tue Oct 7 23:59:46 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 08 Oct 2008 08:59:46 +0200 Subject: [ofa-general] [ANNOUNCE] compat-dapl-1.2.11 and dapl-2.0.14 Release In-Reply-To: References: Message-ID: <48EC5A62.3010601@dev.mellanox.co.il> Davis, Arlin R wrote: > > Vlad, please pick up new packages and install following for OFED 1.4 > rc3: > > compat-dapl-1.2.11-1 > compat-dapl-devel-1.2.11-1 > dapl-2.0.14-1 > dapl-utils-2.0.14-1 > dapl-devel-2.0.14-1 > dapl-debuginfo-2.0.14-1 > > -arlin Done, Regards, Vladimir From vlad at lists.openfabrics.org Wed Oct 8 03:16:21 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 8 Oct 2008 03:16:21 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081008-0200 daily build status Message-ID: <20081008101621.CEA20E60E25@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: Build failed on ppc64 with linux-2.6.24 Log: /home/vlad/tmp/ofa_1_4_kernel-20081008-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c: In function 'ehca_poll_eqs': /home/vlad/tmp/ofa_1_4_kernel-20081008-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:942: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type /home/vlad/tmp/ofa_1_4_kernel-20081008-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:946: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081008-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081008-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081008-0200_linux-2.6.24_ppc64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081008-0200_linux-2.6.24_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.24' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From hal.rosenstock at gmail.com Wed Oct 8 04:03:12 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 8 Oct 2008 07:03:12 -0400 Subject: [ofa-general] [infiniband-diags] add --loop_ports option to perfquery In-Reply-To: <1223420045.1197.117.camel@cardanus.llnl.gov> References: <1223420045.1197.117.camel@cardanus.llnl.gov> Message-ID: Al, On Tue, Oct 7, 2008 at 6:54 PM, Al Chu wrote: > Hey Sasha, > > We have a switch here that does not report the AllPortSelect flag as a > capability. It's pretty annoying typing each port on the switch or > always having to script around this one oddball switch we have. So I > added an option --loop_ports for perfquery. If you want to do something > to all the ports on the CA/Switch, but AllPortSelect isn't available, it > loops through all the available ports instead. Why not add simulated AllPortSelect for multiple ports rather than add another perquery option for this ? > There was already a workaround in the tool for a CA that did not support > the AllPortSelect flag. I get the feeling the workaround may have been > for a specific hardware, so I kept the workaround in there. > Al > > -- > Albert Chu > chu11 at llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > There are also 2 for loops which are not correct for some switches: for (i = 1; i <= num_ports; i++) -- Hal From hal.rosenstock at gmail.com Wed Oct 8 04:04:05 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 8 Oct 2008 07:04:05 -0400 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCHv2][TRIVIAL] OpenSM: Display port number in decimal in log messages In-Reply-To: <20081007235303.GG7563@sashak.voltaire.com> References: <48E67B71.8050508@obsidianresearch.com> <20081007235303.GG7563@sashak.voltaire.com> Message-ID: Sasha, On Tue, Oct 7, 2008 at 7:53 PM, Sasha Khapyorsky wrote: > Hi Hal, > > On 14:07 Fri 03 Oct , Hal Rosenstock wrote: >> >> Cosmetic patch attached. Found some more cases... > > The patch is broken (it has parts from previous patch) and doesn't apply > clearly.So I was need to apply this to another branch and rebase. v2 means a replacement patch not built on top of the previous one. >Also please next time add 'Sign-off-by'. Sorry; forgot that this time. -- Hal > Sasha > >> >> -- Hal > >> diff --git a/opensm/opensm/osm_drop_mgr.c b/opensm/opensm/osm_drop_mgr.c >> index e827c26..8c6e7fb 100644 >> --- a/opensm/opensm/osm_drop_mgr.c >> +++ b/opensm/opensm/osm_drop_mgr.c >> @@ -103,7 +103,7 @@ static void drop_mgr_clean_physp(osm_sm_t * sm, IN osm_physp_t * p_physp) >> IB_LINK_ACTIVE) { >> OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, >> "Forcing new heavy sweep. Remote " >> - "port 0x%016" PRIx64 " port num: 0x%X " >> + "port 0x%016" PRIx64 " port num: %u " >> "was recognized in ACTIVE state\n", >> cl_ntoh64(p_remote_physp->port_guid), >> p_remote_physp->port_num); >> @@ -117,7 +117,7 @@ static void drop_mgr_clean_physp(osm_sm_t * sm, IN osm_physp_t * p_physp) >> p_remote_port->discovery_count = 0; >> OSM_LOG(sm->p_log, OSM_LOG_DEBUG, >> "Resetting discovery count of node: " >> - "0x%016" PRIx64 " port num:0x%X\n", >> + "0x%016" PRIx64 " port num:%u\n", >> cl_ntoh64(osm_node_get_node_guid >> (p_remote_physp->p_node)), >> p_remote_physp->port_num); >> @@ -125,9 +125,9 @@ static void drop_mgr_clean_physp(osm_sm_t * sm, IN osm_physp_t * p_physp) >> } >> >> OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, >> - "Unlinking local node 0x%016" PRIx64 ", port 0x%X" >> + "Unlinking local node 0x%016" PRIx64 ", port %u" >> "\n\t\t\t\tand remote node 0x%016" PRIx64 >> - ", port 0x%X\n", >> + ", port %u\n", >> cl_ntoh64(osm_node_get_node_guid(p_physp->p_node)), >> p_physp->port_num, >> cl_ntoh64(osm_node_get_node_guid >> @@ -139,7 +139,7 @@ static void drop_mgr_clean_physp(osm_sm_t * sm, IN osm_physp_t * p_physp) >> } >> >> OSM_LOG(sm->p_log, OSM_LOG_DEBUG, >> - "Clearing node 0x%016" PRIx64 " physical port number 0x%X\n", >> + "Clearing node 0x%016" PRIx64 " physical port number %u\n", >> cl_ntoh64(osm_node_get_node_guid(p_physp->p_node)), >> p_physp->port_num); >> >> @@ -186,7 +186,8 @@ static void __osm_drop_mgr_remove_port(osm_sm_t * sm, IN osm_port_t * p_port) >> if (p_sm != (osm_remote_sm_t *) cl_qmap_end(p_sm_guid_tbl)) { >> /* need to remove this item */ >> OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, >> - "Cleaned SM for port guid\n"); >> + "Cleaned SM for port guid 0x%016" PRIx64 "\n", >> + cl_ntoh64(port_guid)); >> >> free(p_sm); >> } >> diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c >> index 4c67c3f..45a100e 100644 >> --- a/opensm/opensm/osm_helper.c >> +++ b/opensm/opensm/osm_helper.c >> @@ -780,7 +780,7 @@ osm_dump_port_info(IN osm_log_t * const p_log, >> >> osm_log(p_log, log_level, >> "PortInfo dump:\n" >> - "\t\t\t\tport number.............0x%X\n" >> + "\t\t\t\tport number.............%u\n" >> "\t\t\t\tnode_guid...............0x%016" PRIx64 "\n" >> "\t\t\t\tport_guid...............0x%016" PRIx64 "\n" >> "\t\t\t\tm_key...................0x%016" PRIx64 "\n" >> @@ -790,7 +790,7 @@ osm_dump_port_info(IN osm_log_t * const p_log, >> "\t\t\t\tcapability_mask.........0x%X\n" >> "\t\t\t\tdiag_code...............0x%X\n" >> "\t\t\t\tm_key_lease_period......0x%X\n" >> - "\t\t\t\tlocal_port_num..........0x%X\n" >> + "\t\t\t\tlocal_port_num..........%u\n" >> "\t\t\t\tlink_width_enabled......0x%X\n" >> "\t\t\t\tlink_width_supported....0x%X\n" >> "\t\t\t\tlink_width_active.......0x%X\n" >> @@ -879,7 +879,7 @@ osm_dump_portinfo_record(IN osm_log_t * const p_log, >> "\t\t\t\tcapability_mask.........0x%X\n" >> "\t\t\t\tdiag_code...............0x%X\n" >> "\t\t\t\tm_key_lease_period......0x%X\n" >> - "\t\t\t\tlocal_port_num..........0x%X\n" >> + "\t\t\t\tlocal_port_num..........%u\n" >> "\t\t\t\tlink_width_enabled......0x%X\n" >> "\t\t\t\tlink_width_supported....0x%X\n" >> "\t\t\t\tlink_width_active.......0x%X\n" >> @@ -994,14 +994,14 @@ osm_dump_node_info(IN osm_log_t * const p_log, >> "\t\t\t\tbase_version............0x%X\n" >> "\t\t\t\tclass_version...........0x%X\n" >> "\t\t\t\tnode_type...............%s\n" >> - "\t\t\t\tnum_ports...............0x%X\n" >> + "\t\t\t\tnum_ports...............%u\n" >> "\t\t\t\tsys_guid................0x%016" PRIx64 "\n" >> "\t\t\t\tnode_guid...............0x%016" PRIx64 "\n" >> "\t\t\t\tport_guid...............0x%016" PRIx64 "\n" >> "\t\t\t\tpartition_cap...........0x%X\n" >> "\t\t\t\tdevice_id...............0x%X\n" >> "\t\t\t\trevision................0x%X\n" >> - "\t\t\t\tport_num................0x%X\n" >> + "\t\t\t\tport_num................%u\n" >> "\t\t\t\tvendor_id...............0x%X\n", >> p_ni->base_version, >> p_ni->class_version, >> @@ -1041,14 +1041,14 @@ osm_dump_node_record(IN osm_log_t * const p_log, >> "\t\t\t\tbase_version............0x%X\n" >> "\t\t\t\tclass_version...........0x%X\n" >> "\t\t\t\tnode_type...............%s\n" >> - "\t\t\t\tnum_ports...............0x%X\n" >> + "\t\t\t\tnum_ports...............%u\n" >> "\t\t\t\tsys_guid................0x%016" PRIx64 "\n" >> "\t\t\t\tnode_guid...............0x%016" PRIx64 "\n" >> "\t\t\t\tport_guid...............0x%016" PRIx64 "\n" >> "\t\t\t\tpartition_cap...........0x%X\n" >> "\t\t\t\tdevice_id...............0x%X\n" >> "\t\t\t\trevision................0x%X\n" >> - "\t\t\t\tport_num................0x%X\n" >> + "\t\t\t\tport_num................%u\n" >> "\t\t\t\tvendor_id...............0x%X\n" >> "\t\t\t\tNodeDescription\n" >> "\t\t\t\t%s\n", >> @@ -1489,8 +1489,8 @@ osm_dump_link_record(IN osm_log_t * const p_log, >> osm_log(p_log, log_level, >> "Link Record dump:\n" >> "\t\t\t\tfrom_lid................%u\n" >> - "\t\t\t\tfrom_port_num...........0x%X\n" >> - "\t\t\t\tto_port_num.............0x%X\n" >> + "\t\t\t\tfrom_port_num...........%u\n" >> + "\t\t\t\tto_port_num.............%u\n" >> "\t\t\t\tto_lid..................%u\n", >> cl_ntoh16(p_lr->from_lid), >> p_lr->from_port_num, >> @@ -1512,9 +1512,9 @@ osm_dump_switch_info(IN osm_log_t * const p_log, >> "\t\t\t\trand_cap................0x%X\n" >> "\t\t\t\tmcast_cap...............0x%X\n" >> "\t\t\t\tlin_top.................0x%X\n" >> - "\t\t\t\tdef_port................0x%X\n" >> - "\t\t\t\tdef_mcast_pri_port......0x%X\n" >> - "\t\t\t\tdef_mcast_not_port......0x%X\n" >> + "\t\t\t\tdef_port................%u\n" >> + "\t\t\t\tdef_mcast_pri_port......%u\n" >> + "\t\t\t\tdef_mcast_not_port......%u\n" >> "\t\t\t\tlife_state..............0x%X\n" >> "\t\t\t\tlids_per_port...........%u\n" >> "\t\t\t\tpartition_enf_cap.......0x%X\n" >> @@ -1549,9 +1549,9 @@ osm_dump_switch_info_record(IN osm_log_t * const p_log, >> "\t\t\t\trand_cap................0x%X\n" >> "\t\t\t\tmcast_cap...............0x%X\n" >> "\t\t\t\tlin_top.................0x%X\n" >> - "\t\t\t\tdef_port................0x%X\n" >> - "\t\t\t\tdef_mcast_pri_port......0x%X\n" >> - "\t\t\t\tdef_mcast_not_port......0x%X\n" >> + "\t\t\t\tdef_port................%u\n" >> + "\t\t\t\tdef_mcast_pri_port......%u\n" >> + "\t\t\t\tdef_mcast_not_port......%u\n" >> "\t\t\t\tlife_state..............0x%X\n" >> "\t\t\t\tlids_per_port...........%u\n" >> "\t\t\t\tpartition_enf_cap.......0x%X\n" >> @@ -1593,7 +1593,7 @@ osm_dump_pkey_block(IN osm_log_t * const p_log, >> "P_Key table dump:\n" >> "\t\t\tport_guid...........0x%016" PRIx64 "\n" >> "\t\t\tblock_num...........0x%X\n" >> - "\t\t\tport_num............0x%X\n\tP_Key Table: %s\n", >> + "\t\t\tport_num............%u\n\tP_Key Table: %s\n", >> cl_ntoh64(port_guid), block_num, port_num, buf_line); >> } >> } >> @@ -1621,8 +1621,8 @@ osm_dump_slvl_map_table(IN osm_log_t * const p_log, >> osm_log(p_log, log_level, >> "SLtoVL dump:\n" >> "\t\t\tport_guid............0x%016" PRIx64 "\n" >> - "\t\t\tin_port_num..........0x%X\n" >> - "\t\t\tout_port_num.........0x%X\n\tSL: | %s\n\tVL: | %s\n", >> + "\t\t\tin_port_num..........%u\n" >> + "\t\t\tout_port_num.........%u\n\tSL: | %s\n\tVL: | %s\n", >> cl_ntoh64(port_guid), >> in_port_num, out_port_num, buf_line1, buf_line2); >> } >> @@ -1651,7 +1651,7 @@ osm_dump_vl_arb_table(IN osm_log_t * const p_log, >> osm_log(p_log, log_level, >> "VLArb dump:\n" "\t\t\tport_guid...........0x%016" >> PRIx64 "\n" "\t\t\tblock_num...........0x%X\n" >> - "\t\t\tport_num............0x%X\n\tVL : | %s\n\tWEIGHT:| %s\n", >> + "\t\t\tport_num............%u\n\tVL : | %s\n\tWEIGHT:| %s\n", >> cl_ntoh64(port_guid), block_num, port_num, buf_line1, >> buf_line2); >> } >> diff --git a/opensm/opensm/osm_link_mgr.c b/opensm/opensm/osm_link_mgr.c >> index 452d151..98870cb 100644 >> --- a/opensm/opensm/osm_link_mgr.c >> +++ b/opensm/opensm/osm_link_mgr.c >> @@ -377,7 +377,7 @@ __osm_link_mgr_process_node(osm_sm_t * sm, >> if (link_state != IB_LINK_NO_CHANGE && >> link_state <= current_state) >> OSM_LOG(sm->p_log, OSM_LOG_DEBUG, >> - "Physical port 0x%X already %s. Skipping\n", >> + "Physical port %u already %s. Skipping\n", >> p_physp->port_num, >> ib_get_port_state_str(current_state)); >> else if (__osm_link_mgr_set_physp_pi(sm, p_physp, link_state)) >> diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c >> index fc8533d..c4cd632 100644 >> --- a/opensm/opensm/osm_mcast_mgr.c >> +++ b/opensm/opensm/osm_mcast_mgr.c >> @@ -628,7 +628,7 @@ static osm_mtree_node_t *__osm_mcast_mgr_branch(osm_sm_t * sm, >> */ >> if (depth > 1) { >> OSM_LOG(sm->p_log, OSM_LOG_DEBUG, >> - "Adding upstream port 0x%X\n", upstream_port); >> + "Adding upstream port %u\n", upstream_port); >> >> CL_ASSERT(upstream_port); >> osm_mcast_tbl_set(p_tbl, mlid_ho, upstream_port); >> @@ -662,7 +662,7 @@ static osm_mtree_node_t *__osm_mcast_mgr_branch(osm_sm_t * sm, >> continue; /* No routes down this port. */ >> >> OSM_LOG(sm->p_log, OSM_LOG_DEBUG, >> - "Routing %zu destinations via switch port 0x%X\n", >> + "Routing %zu destinations via switch port %u\n", >> count, i); >> >> /* >> @@ -716,7 +716,7 @@ static osm_mtree_node_t *__osm_mcast_mgr_branch(osm_sm_t * sm, >> >> OSM_LOG(sm->p_log, OSM_LOG_DEBUG, >> "Found leaf for port 0x%016" PRIx64 >> - " on switch port 0x%X\n", >> + " on switch port %u\n", >> cl_ntoh64(osm_port_get_guid (p_wobj->p_port)), >> i); >> >> diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c >> index 86710d1..a37ce0a 100644 >> --- a/opensm/opensm/osm_node_info_rcv.c >> +++ b/opensm/opensm/osm_node_info_rcv.c >> @@ -217,7 +217,7 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm, >> port_num != 0 && cl_qmap_count(&sm->p_subn->sw_guid_tbl) == 0) { >> OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, >> "Duplicate GUID found by link from a port to itself:" >> - "node 0x%" PRIx64 ", port number 0x%X\n", >> + "node 0x%" PRIx64 ", port number %u\n", >> cl_ntoh64(osm_node_get_node_guid(p_node)), port_num); >> p_physp = osm_node_get_physp_ptr(p_node, port_num); >> osm_dump_dr_path(sm->p_log, >> @@ -235,8 +235,8 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm, >> >> OSM_LOG(sm->p_log, OSM_LOG_DEBUG, >> "Creating new link between:\n\t\t\t\tnode 0x%" PRIx64 >> - ", port number 0x%X and\n\t\t\t\tnode 0x%" PRIx64 >> - ", port number 0x%X\n", >> + ", port number %u and\n\t\t\t\tnode 0x%" PRIx64 >> + ", port number %u\n", >> cl_ntoh64(osm_node_get_node_guid(p_node)), port_num, >> cl_ntoh64(p_ni_context->node_guid), p_ni_context->port_num); >> >> diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c >> index 8d292ab..a168f89 100644 >> --- a/opensm/opensm/osm_perfmgr.c >> +++ b/opensm/opensm/osm_perfmgr.c >> @@ -219,7 +219,7 @@ osm_perfmgr_mad_send_err_callback(void *bind_context, osm_madw_t * p_madw) >> p_mon_node = (__monitored_node_t *) p_node; >> >> OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C02: %s (0x%" PRIx64 >> - ") port %d\n", p_mon_node->name, p_mon_node->guid, port); >> + ") port %u\n", p_mon_node->name, p_mon_node->guid, port); >> >> if (pm->subn->opt.perfmgr_redir && p_madw->status == IB_TIMEOUT) { >> /* First, find the node in the monitored map */ >> @@ -228,8 +228,8 @@ osm_perfmgr_mad_send_err_callback(void *bind_context, osm_madw_t * p_madw) >> if (port > p_mon_node->redir_tbl_size) { >> cl_plock_release(pm->lock); >> OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C16: " >> - "Invalid port num %d for %s (GUID 0x%016" >> - PRIx64 ") num ports %d\n", port, p_mon_node->name, >> + "Invalid port num %u for %s (GUID 0x%016" >> + PRIx64 ") num ports %u\n", port, p_mon_node->name, >> p_mon_node->guid, p_mon_node->redir_tbl_size); >> goto Exit; >> } >> diff --git a/opensm/opensm/osm_pkey_rcv.c b/opensm/opensm/osm_pkey_rcv.c >> index 9e09806..d845c20 100644 >> --- a/opensm/opensm/osm_pkey_rcv.c >> +++ b/opensm/opensm/osm_pkey_rcv.c >> @@ -127,7 +127,7 @@ void osm_pkey_rcv_process(IN void *context, IN void *data) >> */ >> if (!p_physp) { >> OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 4807: " >> - "Got invalid port number 0x%X\n", port_num); >> + "Got invalid port number %u\n", port_num); >> goto Exit; >> } >> >> diff --git a/opensm/opensm/osm_port.c b/opensm/opensm/osm_port.c >> index 54a5fd1..48a12f9 100644 >> --- a/opensm/opensm/osm_port.c >> +++ b/opensm/opensm/osm_port.c >> @@ -322,7 +322,7 @@ osm_physp_calc_link_mtu(IN osm_log_t * p_log, IN const osm_physp_t * p_physp) >> ib_port_info_get_mtu_cap(&p_remote_physp->port_info); >> >> OSM_LOG(p_log, OSM_LOG_DEBUG, >> - "Remote port 0x%016" PRIx64 " port = 0x%X : " >> + "Remote port 0x%016" PRIx64 " port = %u : " >> "MTU = %u. This Port MTU: %u\n", >> cl_ntoh64(osm_physp_get_port_guid(p_remote_physp)), >> osm_physp_get_port_num(p_remote_physp), >> @@ -334,8 +334,8 @@ osm_physp_calc_link_mtu(IN osm_log_t * p_log, IN const osm_physp_t * p_physp) >> >> OSM_LOG(p_log, OSM_LOG_VERBOSE, >> "MTU mismatch between ports." >> - "\n\t\t\t\tPort 0x%016" PRIx64 ", port 0x%X" >> - " and port 0x%016" PRIx64 ", port 0x%X." >> + "\n\t\t\t\tPort 0x%016" PRIx64 ", port %u" >> + " and port 0x%016" PRIx64 ", port %u." >> "\n\t\t\t\tUsing lower MTU of %u\n", >> cl_ntoh64(osm_physp_get_port_guid(p_physp)), >> osm_physp_get_port_num(p_physp), >> @@ -769,7 +769,7 @@ osm_physp_set_pkey_tbl(IN osm_log_t * p_log, >> if (block_num >= max_blocks) { >> OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4108: " >> "Got illegal set for block number:%u " >> - "For GUID: %" PRIx64 " port number:0x%X\n", >> + "For GUID: %" PRIx64 " port number:%u\n", >> block_num, >> cl_ntoh64(p_physp->p_node->node_info.node_guid), >> p_physp->port_num); >> diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c >> index a820069..73afd8e 100644 >> --- a/opensm/opensm/osm_port_info_rcv.c >> +++ b/opensm/opensm/osm_port_info_rcv.c >> @@ -235,9 +235,9 @@ __osm_pi_rcv_process_switch_port(IN osm_sm_t * sm, >> >> OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, >> "Unlinking local node 0x%" PRIx64 >> - ", port 0x%X" >> + ", port %u" >> "\n\t\t\t\tand remote node 0x%" PRIx64 >> - ", port 0x%X\n", >> + ", port %u\n", >> cl_ntoh64(osm_node_get_node_guid >> (p_node)), port_num, >> cl_ntoh64(osm_node_get_node_guid >> @@ -292,13 +292,13 @@ __osm_pi_rcv_process_switch_port(IN osm_sm_t * sm, >> ib_get_err_str(status)); >> } else >> OSM_LOG(sm->p_log, OSM_LOG_DEBUG, >> - "Skipping SMP responder port 0x%X\n", >> + "Skipping SMP responder port %u\n", >> p_pi->local_port_num); >> break; >> >> default: >> OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0F03: " >> - "Unknown link state = %u, port = 0x%X\n", >> + "Unknown link state = %u, port = %u\n", >> ib_port_info_get_port_state(p_pi), >> p_pi->local_port_num); >> break; >> @@ -488,7 +488,7 @@ osm_pi_rcv_process_set(IN osm_sm_t * sm, IN osm_node_t * const p_node, >> >> OSM_LOG(sm->p_log, OSM_LOG_DEBUG, >> "Received logical SetResp() for GUID 0x%" PRIx64 >> - ", port num 0x%X" >> + ", port num %u" >> "\n\t\t\t\tfor parent node GUID 0x%" PRIx64 >> " TID 0x%" PRIx64 "\n", >> cl_ntoh64(port_guid), port_num, >> @@ -598,7 +598,7 @@ void osm_pi_rcv_process(IN void *context, IN void *data) >> most likely due to a subnet sweep in progress. >> */ >> OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, >> - "Discovered port num 0x%X with GUID 0x%" PRIx64 >> + "Discovered port num %u with GUID 0x%" PRIx64 >> " for parent node GUID 0x%" PRIx64 >> ", TID 0x%" PRIx64 "\n", >> port_num, cl_ntoh64(port_guid), >> @@ -613,7 +613,7 @@ void osm_pi_rcv_process(IN void *context, IN void *data) >> */ >> if (!p_physp) { >> OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, >> - "Initializing port number 0x%X\n", port_num); >> + "Initializing port number %u\n", port_num); >> p_physp = &p_node->physp_table[port_num]; >> osm_physp_init(p_physp, >> port_guid, >> diff --git a/opensm/opensm/osm_sa_link_record.c b/opensm/opensm/osm_sa_link_record.c >> index 9b1ad90..8e74297 100644 >> --- a/opensm/opensm/osm_sa_link_record.c >> +++ b/opensm/opensm/osm_sa_link_record.c >> @@ -194,8 +194,8 @@ __osm_lr_rcv_get_physp_link(IN osm_sa_t * sa, >> goto Exit; >> >> OSM_LOG(sa->p_log, OSM_LOG_DEBUG, "Acquiring link record\n" >> - "\t\t\t\tsrc port 0x%" PRIx64 " (port 0x%X)" >> - ", dest port 0x%" PRIx64 " (port 0x%X)\n", >> + "\t\t\t\tsrc port 0x%" PRIx64 " (port %u)" >> + ", dest port 0x%" PRIx64 " (port %u)\n", >> cl_ntoh64(osm_physp_get_port_guid(p_src_physp)), src_port_num, >> cl_ntoh64(osm_physp_get_port_guid(p_dest_physp)), >> dest_port_num); >> diff --git a/opensm/opensm/osm_sa_pkey_record.c b/opensm/opensm/osm_sa_pkey_record.c >> index bba01b4..b694c8c 100644 >> --- a/opensm/opensm/osm_sa_pkey_record.c >> +++ b/opensm/opensm/osm_sa_pkey_record.c >> @@ -93,7 +93,7 @@ __osm_sa_pkey_create(IN osm_sa_t * sa, >> >> OSM_LOG(sa->p_log, OSM_LOG_DEBUG, >> "New P_Key table for: port 0x%016" PRIx64 >> - ", lid %u, port 0x%X Block:%u\n", >> + ", lid %u, port %u Block:%u\n", >> cl_ntoh64(osm_physp_get_port_guid(p_physp)), >> cl_ntoh16(lid), osm_physp_get_port_num(p_physp), block); >> >> diff --git a/opensm/opensm/osm_sa_portinfo_record.c b/opensm/opensm/osm_sa_portinfo_record.c >> index b263ce8..2ac2611 100644 >> --- a/opensm/opensm/osm_sa_portinfo_record.c >> +++ b/opensm/opensm/osm_sa_portinfo_record.c >> @@ -94,7 +94,7 @@ __osm_pir_rcv_new_pir(IN osm_sa_t * sa, >> >> OSM_LOG(sa->p_log, OSM_LOG_DEBUG, >> "New PortInfoRecord: port 0x%016" PRIx64 >> - ", lid %u, port 0x%X\n", >> + ", lid %u, port %u\n", >> cl_ntoh64(osm_physp_get_port_guid(p_physp)), >> cl_ntoh16(lid), osm_physp_get_port_num(p_physp)); >> >> diff --git a/opensm/opensm/osm_sa_slvl_record.c b/opensm/opensm/osm_sa_slvl_record.c >> index 46b66a6..886cb9b 100644 >> --- a/opensm/opensm/osm_sa_slvl_record.c >> +++ b/opensm/opensm/osm_sa_slvl_record.c >> @@ -100,7 +100,7 @@ __osm_sa_slvl_create(IN osm_sa_t * sa, >> >> OSM_LOG(sa->p_log, OSM_LOG_DEBUG, >> "New SLtoVL Map for: OUT port 0x%016" PRIx64 >> - ", lid 0x%X, port 0x%X to In Port:%u\n", >> + ", lid 0x%X, port %u to In Port:%u\n", >> cl_ntoh64(osm_physp_get_port_guid(p_physp)), >> cl_ntoh16(lid), osm_physp_get_port_num(p_physp), in_port_idx); >> >> diff --git a/opensm/opensm/osm_sa_vlarb_record.c b/opensm/opensm/osm_sa_vlarb_record.c >> index e2aafc1..ae59b13 100644 >> --- a/opensm/opensm/osm_sa_vlarb_record.c >> +++ b/opensm/opensm/osm_sa_vlarb_record.c >> @@ -100,7 +100,7 @@ __osm_sa_vl_arb_create(IN osm_sa_t * sa, >> >> OSM_LOG(sa->p_log, OSM_LOG_DEBUG, >> "New VLArbitration for: port 0x%016" PRIx64 >> - ", lid %u, port 0x%X Block:%u\n", >> + ", lid %u, port %u Block:%u\n", >> cl_ntoh64(osm_physp_get_port_guid(p_physp)), >> cl_ntoh16(lid), osm_physp_get_port_num(p_physp), block); >> >> diff --git a/opensm/opensm/osm_slvl_map_rcv.c b/opensm/opensm/osm_slvl_map_rcv.c >> index 000c2a6..e7f1cd2 100644 >> --- a/opensm/opensm/osm_slvl_map_rcv.c >> +++ b/opensm/opensm/osm_slvl_map_rcv.c >> @@ -135,7 +135,7 @@ void osm_slvl_rcv_process(IN void *context, IN void *p_data) >> */ >> if (!p_physp) { >> OSM_LOG(sm->p_log, OSM_LOG_ERROR, >> - "Got invalid port number 0x%X\n", out_port_num); >> + "Got invalid port number %u\n", out_port_num); >> goto Exit; >> } >> >> diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c >> index 0213915..b7c2c77 100644 >> --- a/opensm/opensm/osm_trap_rcv.c >> +++ b/opensm/opensm/osm_trap_rcv.c >> @@ -129,7 +129,7 @@ osm_trap_rcv_aging_tracker_callback(IN uint64_t key, >> p_physp = get_physp_by_lid_and_num(sm, lid, port_num); >> if (!p_physp) >> OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, >> - "Cannot find port num:0x%X with lid:%u\n", >> + "Cannot find port num:%u with lid:%u\n", >> port_num, lid); >> /* make sure the physp is still valid */ >> /* If the health port was false - set it to true */ >> diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c >> index bde0c29..be8e724 100644 >> --- a/opensm/opensm/osm_ucast_mgr.c >> +++ b/opensm/opensm/osm_ucast_mgr.c >> @@ -152,7 +152,7 @@ __osm_ucast_mgr_process_neighbor(IN osm_ucast_mgr_t * const p_mgr, >> >> OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, >> "Node 0x%" PRIx64 ", remote node 0x%" PRIx64 >> - ", port 0x%X, remote port 0x%X\n", >> + ", port %u, remote port %u\n", >> cl_ntoh64(osm_node_get_node_guid(p_this_sw->p_node)), >> cl_ntoh64(osm_node_get_node_guid(p_remote_sw->p_node)), >> port_num, remote_port_num); >> @@ -273,7 +273,7 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr, >> osm_physp_t *p = osm_node_get_physp_ptr(p_sw->p_node, port); >> >> OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, >> - "Routing LID %u to port 0x%X" >> + "Routing LID %u to port %u" >> " for switch 0x%" PRIx64 "\n", >> lid_ho, port, cl_ntoh64(node_guid)); >> >> diff --git a/opensm/opensm/osm_vl_arb_rcv.c b/opensm/opensm/osm_vl_arb_rcv.c >> index 725cc3b..674c2d6 100644 >> --- a/opensm/opensm/osm_vl_arb_rcv.c >> +++ b/opensm/opensm/osm_vl_arb_rcv.c >> @@ -132,7 +132,7 @@ void osm_vla_rcv_process(IN void *context, IN void *data) >> */ >> if (!p_physp) { >> OSM_LOG(sm->p_log, OSM_LOG_ERROR, >> - "Got invalid port number 0x%X\n", port_num); >> + "Got invalid port number %u\n", port_num); >> goto Exit; >> } >> > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Wed Oct 8 04:04:58 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 8 Oct 2008 07:04:58 -0400 Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer overflow In-Reply-To: <20081008012149.GK7563@sashak.voltaire.com> References: <20081008012149.GK7563@sashak.voltaire.com> Message-ID: Sasha, On Tue, Oct 7, 2008 at 9:21 PM, Sasha Khapyorsky wrote: > > Lash first overflows its buffer and then check for the size (based on > number VLs used). Fix the check order. > > Signed-off-by: Sasha Khapyorsky > --- > opensm/opensm/osm_ucast_lash.c | 13 ++++++------- > 1 files changed, 6 insertions(+), 7 deletions(-) > > diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c > index ce3982f..03cfc1f 100644 > --- a/opensm/opensm/osm_ucast_lash.c > +++ b/opensm/opensm/osm_ucast_lash.c > @@ -979,6 +979,12 @@ static int lash_core(lash_t * p_lash) > switches[dest_switch]->routing_table[i].lane = v_lane; > > if (cycle_found == 1 || cycle_found2 == 1) { > + if (lanes_needed + 1 > p_lash->vl_min) { > + lanes_needed++; > + goto Error_Not_Enough_Lanes; > + } else > + lanes_needed++; > + > generate_cdg_for_sp(p_lash, i, dest_switch, v_lane); > generate_cdg_for_sp(p_lash, dest_switch, i, v_lane); > > @@ -987,13 +993,6 @@ static int lash_core(lash_t * p_lash) > set_temp_depend_to_permanent_for_sp(p_lash, dest_switch, i, > v_lane); > > - if (lanes_needed + 1 > p_lash->vl_min) { > - lanes_needed++; > - goto Error_Not_Enough_Lanes; > - } else > - lanes_needed++; Minor simplification as it seems like this could just be: if (++lanes_needed > p_lash->vl_min) goto Error_Not_Enough_Lanes; -- Hal > - > - // goto error exit with message > p_lash->num_mst_in_lane[v_lane]++; > p_lash->num_mst_in_lane[v_lane]++; > } > -- > 1.6.0.1.196.g01914 > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Wed Oct 8 04:08:14 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 8 Oct 2008 07:08:14 -0400 Subject: [ofa-general] ***SPAM*** ibdm network topology format In-Reply-To: <20081008003239.GI7563@sashak.voltaire.com> References: <20080930121252.GA7396@sashak.voltaire.com> <829ded920810010207r475d82abu269d47cd3baddb3f@mail.gmail.com> <20081001203813.GL7396@sashak.voltaire.com> <20081002022430.GQ7396@sashak.voltaire.com> <20081002170033.GI25831@sashak.voltaire.com> <20081008003239.GI7563@sashak.voltaire.com> Message-ID: Sasha, On Tue, Oct 7, 2008 at 8:32 PM, Sasha Khapyorsky wrote: > Hi Hal, > > On 08:26 Mon 06 Oct , Hal Rosenstock wrote: >> > >> > Me too. See below. >> > >> >>> > Somehow it works without ibsim - so I suspect user_mad handles it. >> >>> > >> >>> > (Hal, could you clarify?) >> >>> >> >>> The kernel (user_mad/mad) does not change the requested registrations >> >>> but I'm not sure I understand the question you are asking to be >> >>> clarified. Is that what you're asking ? >> >> >> >> ibis works somehow with real stack. It registers 0x1 class only and >> >> uses direct routing SMPs. Do you have any idea about why >> >> (osm_vendor_idumad and/or libibumad don't help)? >> > >> > libibumad umad_register does not do anything that would affect this >> > either. I can only conclude there must be something in ibutils that >> > fixes this if it does work with the real stack. It shouldn't be too >> > hard to track down where that registration for class 0x81 comes from. >> >> Are you sure this is the only registration and not DR class too ? > > I'm not sure I understood the question. But I was about registration > (or more accurate not registration) of class 0x81 by ibutils and by any > lower layer up to kernel. The question was whether you are sure that class 0x81 is not being registered with ibdiagnet ? Was this actually verified at some level rather than code inspection of ibutils ? Just to be sure... >> That's the first thing to confirm or maybe you've already confirmed >> this and it wasn't clear to me in what you wrote. If so, I have a >> theory about what could be occuring. It may be the case that it is an >> effect of the kernel MAD layer in that a MAD agent can send any class >> and when using request/response it matches on transaction ID which >> contains the MAD agent. Unsolicited messages on that other class >> wouldn't get through though. I just ran a simple test of this and that >> appears to be the case. > > This could explain the phenomena. And then it seems that similar > mechanism should be implemented in umad2sim. Yes. -- Hal > Sasha > From chu11 at llnl.gov Wed Oct 8 03:26:29 2008 From: chu11 at llnl.gov (Al Chu) Date: Wed, 08 Oct 2008 06:26:29 -0400 Subject: [ofa-general] [infiniband-diags] add --loop_ports option to perfquery In-Reply-To: References: <1223420045.1197.117.camel@cardanus.llnl.gov> Message-ID: <1223461589.8503.19.camel@whatsup> Hey Hal, On Wed, 2008-10-08 at 07:03 -0400, Hal Rosenstock wrote: > Al, > > On Tue, Oct 7, 2008 at 6:54 PM, Al Chu wrote: > > Hey Sasha, > > > > We have a switch here that does not report the AllPortSelect flag as a > > capability. It's pretty annoying typing each port on the switch or > > always having to script around this one oddball switch we have. So I > > added an option --loop_ports for perfquery. If you want to do something > > to all the ports on the CA/Switch, but AllPortSelect isn't available, it > > loops through all the available ports instead. > > Why not add simulated AllPortSelect for multiple ports rather than add > another perquery option for this ? I did try that, and it did seem to work for the switches we had. But when I read the IB spec, it said something to the affect that if a system doesn't support AllPortSelect, setting the PortSelect field to 0xFF was undefined behavior. > > There was already a workaround in the tool for a CA that did not support > > the AllPortSelect flag. I get the feeling the workaround may have been > > for a specific hardware, so I kept the workaround in there. > > > Al > > > > -- > > Albert Chu > > chu11 at llnl.gov > > Computer Scientist > > High Performance Systems Division > > Lawrence Livermore National Laboratory > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > > > > There are also 2 for loops which are not correct for some switches: > for (i = 1; i <= num_ports; i++) I guess I've never seen a switch that doesn't go from 1 to num_ports. Is there something else I need to handle? Al > -- Hal > -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From hal.rosenstock at gmail.com Wed Oct 8 06:44:45 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 8 Oct 2008 09:44:45 -0400 Subject: [ofa-general] [infiniband-diags] add --loop_ports option to perfquery In-Reply-To: <1223461589.8503.19.camel@whatsup> References: <1223420045.1197.117.camel@cardanus.llnl.gov> <1223461589.8503.19.camel@whatsup> Message-ID: Hi Al, On Wed, Oct 8, 2008 at 6:26 AM, Al Chu wrote: > Hey Hal, > > On Wed, 2008-10-08 at 07:03 -0400, Hal Rosenstock wrote: >> Al, >> >> On Tue, Oct 7, 2008 at 6:54 PM, Al Chu wrote: >> > Hey Sasha, >> > >> > We have a switch here that does not report the AllPortSelect flag as a >> > capability. It's pretty annoying typing each port on the switch or >> > always having to script around this one oddball switch we have. So I >> > added an option --loop_ports for perfquery. If you want to do something >> > to all the ports on the CA/Switch, but AllPortSelect isn't available, it >> > loops through all the available ports instead. >> >> Why not add simulated AllPortSelect for multiple ports rather than add >> another perquery option for this ? > > I did try that, and it did seem to work for the switches we had. But > when I read the IB spec, it said something to the affect that if a > system doesn't support AllPortSelect, setting the PortSelect field to > 0xFF was undefined behavior. I was suggesting that the emulation support (when AllPortSelect is not supported) be enhanced for multiple ports and work on both CAs and all switches. The one difference is one response for AllPortSelect (whether emulated or not) v. many responses for port loop. >> > There was already a workaround in the tool for a CA that did not support >> > the AllPortSelect flag. I get the feeling the workaround may have been >> > for a specific hardware, so I kept the workaround in there. >> >> > Al >> > >> > -- >> > Albert Chu >> > chu11 at llnl.gov >> > Computer Scientist >> > High Performance Systems Division >> > Lawrence Livermore National Laboratory >> > >> > _______________________________________________ >> > general mailing list >> > general at lists.openfabrics.org >> > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> > >> > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general >> > >> >> There are also 2 for loops which are not correct for some switches: >> for (i = 1; i <= num_ports; i++) > > I guess I've never seen a switch that doesn't go from 1 to num_ports. > Is there something else I need to handle? Yes, per the spec, enhanced SP0 supports PortCounters. All your switches likely support AllPortSelect so it's not an issue there. -- Hal > Al > >> -- Hal >> > -- > Albert Chu > chu11 at llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > > From dledford at redhat.com Wed Oct 8 08:12:47 2008 From: dledford at redhat.com (Doug Ledford) Date: Wed, 08 Oct 2008 11:12:47 -0400 Subject: [ofa-general] OFED Roll In-Reply-To: <9A1DE9E267DB43339B1589572062D0AF@inspiron9100> References: <48E4F93A.8040309@ibt.unam.mx> <9A1DE9E267DB43339B1589572062D0AF@inspiron9100> Message-ID: <1223478767.11102.234.camel@firewall.xsintricity.com> On Mon, 2008-10-06 at 11:57 -0400, publications wrote: > Am I correct that the Cisco OFED Roll installs Infiniband but not Infiniband > over IP? Does it just use RDMA as a transport? You have your acronym reversed in your mind. There is no such thing as Infiniband over IP, it's IP over IB (IPoIB). > The OFED download from Openfabrics installs IB over IP and I prefer not to > use it since the latencies are double that of RDMA and the throughput is > about one half of RDMA. The presence of IPoIB doesn't not slow down native IB communications. You're right that it's half the improvement of actual RDMA, but that's because it's only there to allow IP based apps to run over IB unchanged. If you write your app to use RDMA, then it uses RDMA and gets the full benefit of RDMA regardless of any other apps using IPoIB. > So, another question. Suppose I have already installed OFED 1.3.1 with IB > over IP. How do I configure my system (including the ib0.conf and other > conf files) to use RDMA rather than IB over IP? See above. You have nothing to do. It's properly configured as it is. > Thanks for your help. > > Jim > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From dledford at redhat.com Wed Oct 8 08:56:19 2008 From: dledford at redhat.com (Doug Ledford) Date: Wed, 08 Oct 2008 11:56:19 -0400 Subject: [ofa-general] [Fwd: [PATCH] ib: release locks in the proper order] Message-ID: <1223481379.11102.241.camel@firewall.xsintricity.com> Forwarding a patch written by one of our real time kernel guys. Description of the issue is with the patch. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- An embedded message was scrubbed... From: Steven Rostedt Subject: ib: release locks in the proper order Date: Wed, 08 Oct 2008 11:55:52 -0400 Size: 1732 URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From mike.marty at gmail.com Wed Oct 8 09:07:28 2008 From: mike.marty at gmail.com (Mike Marty) Date: Wed, 8 Oct 2008 11:07:28 -0500 Subject: [ofa-general] ***SPAM*** manually patching/installing ofed kernel Message-ID: <229af89c0810080907k145a8fcibbf8b488b9b0d6a5@mail.gmail.com> I am running a Ubuntu-based system. Trying to use the install script, invoked with "sudo perl install.pl", fails with the errors pasted below. I previously compiled and installed a vanilla 2.6.26 kernel. Is there any documentation on manually integrating the ofed kernel modifications? I installed the ofa_kernel-1.4 SRPM and see that the ofa_kernel-1.4 directory in /usr/src/rpm/SOURCES contains a mix of files, patches for older kernels, and a Makefile. Presumeably I can't just copy these into my kernel source and go. Eventually I will need to integrate the ofa_kernel stuff with a custom Linux kernel tree, but for now, I am just trying to use a vanilla kernel. Thank you, Mike mikemarty at mikemarty-msn:/usr/src/OFED-1.4-rc2$ sudo perl install.pl OFED Distribution Software Installation Menu 1) Basic (OFED modules and basic user level libraries) 2) HPC (OFED modules and libraries, MPI and diagnostic tools) 3) All packages (all of Basic, HPC) 4) Customize Q) Exit Select Option [1-4]:1 error: open of /usr/src/OFED-1.4-rc2/RPMS/file failed: No such file or directory error: open of /usr/src/OFED-1.4-rc2/RPMS/file failed: No such file or directory error: open of Ubuntu failed: No such file or directory error: open of Ubuntu failed: No such file or directory error: open of is failed: No such file or directory error: open of is failed: No such file or directory error: open of not failed: No such file or directory error: open of not failed: No such file or directory error: open of owned failed: No such file or directory error: open of owned failed: No such file or directory error: open of by failed: No such file or directory error: open of by failed: No such file or directory error: open of any failed: No such file or directory error: open of any failed: No such file or directory gcc rpm is required to build libibverbs glibc-devel rpm is required to build libibverbs libstdc++ rpm is required to build libibverbs mikemarty at mikemarty-msn:/usr/src/OFED-1.4-rc2$ ls BUILD_ID LICENSE README.txt RPMS SOURCES SRPMS docs install.pl ofed.conf uninstall.sh mikemarty at mikemarty-msn:/usr/src/OFED-1.4-rc2$ -------------- next part -------------- An HTML attachment was scrubbed... URL: From Thomas.Talpey at netapp.com Wed Oct 8 10:01:12 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Wed, 08 Oct 2008 13:01:12 -0400 Subject: [ofa-general] Fwd: [PATCH 00/15] RPC/RDMA patchset for next merge window Message-ID: FYI, and comments to linux-nfs at vger.kernel.org if any. BTW, linux-nfs is also available at http://news.gmane.org/gmane.linux.nfs . Tom. > ---------- Forwarded Message ---------- >From: Tom Talpey >Subject: [PATCH 00/15] RPC/RDMA patchset for next merge window >To: linux-nfs at vger.kernel.org >Date: Wed, 08 Oct 2008 11:46:53 -0400 >User-Agent: StGIT/0.14.2 >Sender: linux-nfs-owner at vger.kernel.org >List-ID: > >The following series updates the RPC/RDMA (NFS/RDMA) client to >support the new rdma "fastreg" memory registration mode, which >fixes operation on the Chelsio cxgb3 adapter and strengthens >the safety of others. > >Additionally, it fixes many smaller issues in the code improving >its robustness and performance. Except for supporting large (>32KB) >rpc's, it addresses all known issues in the client. > >It's my hope this patchset can be queued for the upcoming merge >window. It has been extensively tested with both IB and iWARP >adapters under Connectathon and heavy parallel load. > >This patchset applies to the current nfs-2.6 git; > (4330ed8ed4da360ac1ca14b0fddff4c05b10de16) > >--- > >Tom Talpey (14): > RPC/RDMA: optionally emit useful transport info upon connect/disconnect. > RPC/RDMA: reformat a debug printk to keep lines together. > RPC/RDMA: harden connection logic against missing/late rdma_cm upcalls. > RPC/RDMA: correct a 5 second pause on reconnecting to an idle server. > RPC/RDMA: fix connect/reconnect resource leak. > RPC/RDMA: return a consistent error to mount, when connect fails. > RPC/RDMA: adhere to protocol for unpadded client trailing write chunks. > RPC/RDMA: avoid an oops due to disconnect racing with async upcalls. > RPC/RDMA: maintain the RPC task bytes-sent statistic. > RPC/RDMA: suppress retransmit on RPC/RDMA clients. > RPC/RDMA: support FRMR client memory registration. > RPC/RDMA: check selected memory registration mode at runtime. > RPC/RDMA: add data types and new FRMR memory registration enum. > RPC/RDMA: refactor the inline memory registration code. > >Tom Tucker (1): > RPC/RDMA: fix connection IRD/ORD setting > > > net/sunrpc/xprtrdma/rpc_rdma.c | 30 +- > net/sunrpc/xprtrdma/transport.c | 39 +- > net/sunrpc/xprtrdma/verbs.c | 737 +++++++++++++++++++++++++++------------ > net/sunrpc/xprtrdma/xprt_rdma.h | 12 + > 4 files changed, 570 insertions(+), 248 deletions(-) > >-- > >Tom. >-- >To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >the body of a message to majordomo at vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html > ---------- End of Forwarded Message ---------- From cameron at harr.org Wed Oct 8 10:13:49 2008 From: cameron at harr.org (Cameron Harr) Date: Wed, 08 Oct 2008 11:13:49 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48EBE6B6.4060804@mellanox.com> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> Message-ID: <48ECEA4D.7080504@harr.org> Vu Pham wrote: > Cameron Harr wrote: >> >> One thing that makes results hard to interpret is that they vary >> enormously. I've been doing more testing with 3 physical LUNs >> (instead of two) on the target, srpt_thread=0, and changing between >> scst_thread=[1,2,3]. With scst_thread=1, I'm fairly low (50K IOPs), >> while at 2 and three threads, the results are higher, though in all >> cases, the context switches are low, often less than 1:1. >> > > Can you test again with srpt_thread=0,1 and scst_threads=1,2,3 in > NULLIO mode (with 1,2,3 export NULLIO luns) srpt_thread=0: scst_t: | 1 | 2 | 3 | -------------------------------------------| 1 LUN* | 54K | 54K-75K | 54K-75K | 2 LUNs* |120K-200K|150K-200K**| 120K-180K**| 3 LUNs* |170K-195K|160K-195K | 130K-170K**| srpt_thread=1: scst_t: | 1 | 2 | 3 | ------------------------------------------| 1 LUN* | 74K | 54K | 55K | 2 LUNs* |140K-190K| 130K-200K | 150K-220K | 3 LUNs* |170K-195K| 170K-195K | 175K-195K | * a FIO (benchmark) process was run for each LUN, so when there were 3 LUNs, there were three FIO processes runnning simultaneously. ** Sometimes the benchmark "zombied" (process doing no work, but process can't be killed) after running a certain amount of time. However, it wasn't repeatable in a reliable way, so I mark that this particular run has zombied before. - Note 1: There were a number of outliers (often between 98K and 230K), but I tried to capture where the bulk of the activity happened. It's still somewhat of a rough guess though. Where the range is large, it usually mean the results were just really scattered. Summary: It's hard to draw a good summary due to the variation of results. I would say the runs with srpt_thread=1 tended to have fewer outliers at the beginning, but as time went on, they scattered as well. Running with 2 or 3 threads almost seems to be a toss-up. >> >> Also a little disconcerting is that my average request size on the >> target has gotten larger. I'm always writing 512B packets, and when I >> run on one initiator, the average reqsz is around 600-800B. When I >> add an initiator, the average reqsz basically doubles and is now >> around 1200 - 1600B. I'm specifying direct IO in the test and scst is >> configured as blockio (and thus direct IO), but it appears something >> is cached at some point and seems to be coalesced when another >> initiator is involved. Does this seem odd or normal? This shows true >> whether the initiators are writing to different partitions on the >> same LUN or the same LUN with no partitions. > > What io scheduler are you running on local storage? Since you are > using blockio you should play around with io scheduler's tuned > parameters (for example deadline scheduler: front_merges, > write_starved,...) Please see ~/Documentation/block/*.txt I'm using CFQ. Months ago, I tried different schedulers with their default options and saw basically no difference. I can try some of that again; however I don't believe I can tune the schedulers because my back end doesn't give me a "queue" directory in /sys/block// -Cameron From cameron at harr.org Wed Oct 8 10:27:16 2008 From: cameron at harr.org (Cameron Harr) Date: Wed, 08 Oct 2008 11:27:16 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48ECEA4D.7080504@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> Message-ID: <48ECED74.9090908@harr.org> Cameron Harr wrote: > srpt_thread=1: > scst_t: | 1 | 2 | 3 | > ------------------------------------------| > 1 LUN* | 74K | 54K | 55K | This row should probably be a range of 54K-74K, however all results would be very tight around 54K or 74K for a while and then shift to the other number, with literally no outliers. From michael.heinz at qlogic.com Wed Oct 8 10:45:02 2008 From: michael.heinz at qlogic.com (Mike Heinz) Date: Wed, 8 Oct 2008 12:45:02 -0500 Subject: [ofa-general] Allowing end-users to query for fabric information In-Reply-To: References: Message-ID: Well, all that I want the tool to do is collect information; I don't want users to be able to modify the SM settings - but you have to admit, it could be useful to someone trying to tune their fabric to be able (for example) get a report on the length of the paths between any two nodes. So, the critical issue is that some SMs don't implement the M-Key properly? -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -----Original Message----- From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] Sent: Monday, October 06, 2008 12:00 PM To: Mike Heinz Cc: general at lists.openfabrics.org Subject: Re: [ofa-general] Allowing end-users to query for fabric information On Mon, Oct 6, 2008 at 11:27 AM, Mike Heinz wrote: > Well, > > I guess that's my point - I'd like to be able to create tools for > non-root users that would collect interesting information about the > fabric. As far as I know, this should be a safe operation, because the > SA should be protected by the m-key - but it seems that the policy in > OFED is that this is not a safe operation and access must be tightly > controlled. Do you mean SM or SA ? Subverting the SM is not a good idea. The SM is the central point for setting up SM attributes. Policy needs to be instilled through the SM. There are some SA attributes which are somewhat dangerous too as they are essentially writable as well from an end node. Furthermore, most fabrics do not utilize MKey protection so the second level is not there yet and only the most primitive form of this is available within some SMs. > While it's a trivial task to patch OFED to give non-root users access > to the /dev/infiniband/umad* devices, I certainly don't want to > provide tools to my users that create security holes in the fabric. IMO this would do that although I would phrase it slightly differently. -- Hal > -- > Michael Heinz > Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania > > -----Original Message----- > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > Sent: Monday, October 06, 2008 11:16 AM > To: Mike Heinz > Cc: Roland Dreier; general at lists.openfabrics.org > Subject: Re: [ofa-general] Allowing end-users to query for fabric > information > > Mike, > > On Mon, Oct 6, 2008 at 11:09 AM, Mike Heinz > wrote: >> Roland, >> >> I've been thinking about this some more and I have to say I'm still a >> bit confused. Are you saying that any root user on any node of the >> fabric can change the routing tables? Isn't the ability to access and >> alter subnet information controlled via the management key? > > There are two levels to this. First you must be able to send the MAD > and once that can happen the receiving SMA performs the usual MKey > checks which depend on the protection level assuming it is an SM class > MAD like the one to change the routing tables. > > -- Hal > >> >> >> -- >> Michael Heinz >> Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania >> >> -----Original Message----- >> From: general-bounces at lists.openfabrics.org >> [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Mike >> Heinz >> Sent: Monday, September 22, 2008 3:19 PM >> To: Roland Dreier >> Cc: general at lists.openfabrics.org >> Subject: RE: [ofa-general] Allowing end-users to query for fabric >> information >> >> Thanks for the explanation. >> >> >> -- >> Michael Heinz >> Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania >> >> -----Original Message----- >> From: Roland Dreier [mailto:rdreier at cisco.com] >> Sent: Monday, September 22, 2008 3:18 PM >> To: Mike Heinz >> Cc: general at lists.openfabrics.org >> Subject: Re: [ofa-general] Allowing end-users to query for fabric >> information >> >> > What was the reason for making this design choice? While I could >> > > >> certainly provide boot scripts to change the permissions to > >> /dev/infiniband/umad*, I'd rather understand why the decision was >> made >>> to restrict access. >> >> because /dev/infiniband/umadX allows full unfiltered access to >> send/receive any MADs. Including changing routing tables, bringing >> ports down, etc. Not stuff that unprivileged users should be able to >> do. >> >> It would make sense to have a higher-level interface that only allows >> safe queries without side effects, but that's quite a bit more work >> than just changing permissions on device nodes. >> >> - R. >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > From rdreier at cisco.com Wed Oct 8 11:26:39 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Oct 2008 11:26:39 -0700 Subject: [ofa-general] Allowing end-users to query for fabric information In-Reply-To: (Mike Heinz's message of "Wed, 8 Oct 2008 12:45:02 -0500") References: Message-ID: > Well, all that I want the tool to do is collect information; I don't > want users to be able to modify the SM settings - but you have to admit, > it could be useful to someone trying to tune their fabric to be able > (for example) get a report on the length of the paths between any two > nodes. I can think of several ways to implement this that do not require making the umad device files accessible to all users. This really is not complicated stuff, and I'm not sure why you're so fixated on giving raw access to MADs to all users. You could: - implement a kernel-level service that exposes a high-level set of safe queries, which can be made available to all users. - you could install your tools as SUID binaries that allow ordinary users to run them with the elevated privileged required for MAD access; of course this requires some care to be taken in how you implement the query tools, so that they don't allow arbitrary privilege escalation due to security bugs. - you could create a new group, eg "ibadmin" and have the umad files owned by this group with permissions 0660. Then users that need access to this tool could be added to the ibadmin group. > So, the critical issue is that some SMs don't implement the M-Key > properly? Or administrators have not enabled M_Keys with their SM config. Or they don't want to open the possibility of a user DOS-ing the SM (via a flood of MADs sent through the raw access file) and waiting for the M_Key timeout to take over the fabric. Or... The way I look at the permissions of the umad files is that it is just common sense to restrict unfiltered access to sending and receiving MADs. This is analogous to restrictions on binding to ports below 1024 or sniffing packets that traditionally exist. - R. From michael.heinz at qlogic.com Wed Oct 8 11:49:06 2008 From: michael.heinz at qlogic.com (Mike Heinz) Date: Wed, 8 Oct 2008 13:49:06 -0500 Subject: [ofa-general] Allowing end-users to query for fabric information In-Reply-To: References: Message-ID: I'm aware there are several approaches to solving the problem. My "fixation" is in making sure I understand why there is an apparent security hole in the fabric management. See, my "root" problem here is that y'all are telling me that if a user gains root access to a single node on the fabric, they can use that node to undermine the entire fabric. -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -----Original Message----- From: Roland Dreier [mailto:rdreier at cisco.com] Sent: Wednesday, October 08, 2008 2:27 PM To: Mike Heinz Cc: Hal Rosenstock; general at lists.openfabrics.org Subject: Re: [ofa-general] Allowing end-users to query for fabric information > Well, all that I want the tool to do is collect information; I don't > want users to be able to modify the SM settings - but you have to admit, > it could be useful to someone trying to tune their fabric to be able > (for example) get a report on the length of the paths between any two > nodes. I can think of several ways to implement this that do not require making the umad device files accessible to all users. This really is not complicated stuff, and I'm not sure why you're so fixated on giving raw access to MADs to all users. You could: - implement a kernel-level service that exposes a high-level set of safe queries, which can be made available to all users. - you could install your tools as SUID binaries that allow ordinary users to run them with the elevated privileged required for MAD access; of course this requires some care to be taken in how you implement the query tools, so that they don't allow arbitrary privilege escalation due to security bugs. - you could create a new group, eg "ibadmin" and have the umad files owned by this group with permissions 0660. Then users that need access to this tool could be added to the ibadmin group. > So, the critical issue is that some SMs don't implement the M-Key > properly? Or administrators have not enabled M_Keys with their SM config. Or they don't want to open the possibility of a user DOS-ing the SM (via a flood of MADs sent through the raw access file) and waiting for the M_Key timeout to take over the fabric. Or... The way I look at the permissions of the umad files is that it is just common sense to restrict unfiltered access to sending and receiving MADs. This is analogous to restrictions on binding to ports below 1024 or sniffing packets that traditionally exist. - R. From rdreier at cisco.com Wed Oct 8 11:49:48 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Oct 2008 11:49:48 -0700 Subject: [ofa-general] [Fwd: [PATCH] ib: release locks in the proper order] In-Reply-To: <1223481379.11102.241.camel@firewall.xsintricity.com> (Doug Ledford's message of "Wed, 08 Oct 2008 11:56:19 -0400") References: <1223481379.11102.241.camel@firewall.xsintricity.com> Message-ID: > Forwarding a patch written by one of our real time kernel guys. Is there some reason why sending the patch himself is too hard? > RT is very sensitive to the order locks are taken and released > wrt read write locks. We must do > > lock(a); > lock(b); > lock(c); > > [...] > > unlock(c); > unlock(b); > unlock(a); > > otherwise bad things can happen. Maybe I'm being dense but what bad things are fixed by this patch? I can't even see a theoretical issue with the code as is. This change looks very much like fiddling for no good reason -- has a real problem been seen with this code? - R. From rdreier at cisco.com Wed Oct 8 11:51:38 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Oct 2008 11:51:38 -0700 Subject: [ofa-general] Allowing end-users to query for fabric information In-Reply-To: (Mike Heinz's message of "Wed, 8 Oct 2008 13:49:06 -0500") References: Message-ID: > See, my "root" problem here is that y'all are telling me that if a user > gains root access to a single node on the fabric, they can use that node > to undermine the entire fabric. Why is that surprising? - R. From michael.heinz at qlogic.com Wed Oct 8 11:55:25 2008 From: michael.heinz at qlogic.com (Mike Heinz) Date: Wed, 8 Oct 2008 13:55:25 -0500 Subject: [ofa-general] Allowing end-users to query for fabric information In-Reply-To: References: Message-ID: It's comparable to saying that a single machine on the company net can subvert DNS. I know, technically that's true - but people also spend a great deal of effort hardening DNS against attack. -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -----Original Message----- From: Roland Dreier [mailto:rdreier at cisco.com] Sent: Wednesday, October 08, 2008 2:52 PM To: Mike Heinz Cc: Hal Rosenstock; general at lists.openfabrics.org Subject: Re: [ofa-general] Allowing end-users to query for fabric information > See, my "root" problem here is that y'all are telling me that if a user > gains root access to a single node on the fabric, they can use that node > to undermine the entire fabric. Why is that surprising? - R. From rdreier at cisco.com Wed Oct 8 13:07:14 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Oct 2008 13:07:14 -0700 Subject: [ofa-general] Allowing end-users to query for fabric information In-Reply-To: (Mike Heinz's message of "Wed, 8 Oct 2008 13:55:25 -0500") References: Message-ID: > It's comparable to saying that a single machine on the company net can > subvert DNS. Just think about all the things a malicious host can do on an IB fabric. For example, a malicious SMA could send an unending stream of traps to the SM, or consume huge SM resources by faking an ever-changing virtual topology, or just report a GID that collides with another port on the fabric. And I'm sure there are other things you can think of if you try to get really nasty. - R. From jgunthorpe at obsidianresearch.com Wed Oct 8 13:44:36 2008 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 8 Oct 2008 14:44:36 -0600 Subject: [ofa-general] Allowing end-users to query for fabric information In-Reply-To: References: Message-ID: <20081008204436.GB26851@obsidianresearch.com> On Wed, Oct 08, 2008 at 01:07:14PM -0700, Roland Dreier wrote: > > It's comparable to saying that a single machine on the company net can > > subvert DNS. > > Just think about all the things a malicious host can do on an IB fabric. > For example, a malicious SMA could send an unending stream of traps to > the SM, or consume huge SM resources by faking an ever-changing virtual > topology, or just report a GID that collides with another port on the > fabric. And I'm sure there are other things you can think of if you try > to get really nasty. Right, it is quite similar to the problems with ethernet spanning tree protocol, linux prevents unprivileged processed from sending spanning tree packets too. I expect as IB matures we will get the same kinds of protections we see in ethernet, namely switch ports marked as untrusted have many restrictions placed on them, like single CA only, no outgoing SMPs except to the SM, etc. GMPs face a similar problem, except a little worse, any process can create a UD QP and send a GMP to QP1 on another node. You can mess with performance management, multicast registrations, service registrations, etc like this. Jason From rdreier at cisco.com Wed Oct 8 14:43:54 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Oct 2008 14:43:54 -0700 Subject: [ofa-general] Re: [PATCH] RDMA/nes: correct error_module bit mask In-Reply-To: <200810080054.m980sYXi029769@velma.neteffect.com> (Chien Tung's message of "Tue, 7 Oct 2008 19:54:34 -0500") References: <200810080054.m980sYXi029769@velma.neteffect.com> Message-ID: thanks, applied From srostedt at redhat.com Wed Oct 8 11:59:14 2008 From: srostedt at redhat.com (Steven Rostedt) Date: Wed, 08 Oct 2008 14:59:14 -0400 Subject: [ofa-general] [Fwd: [PATCH] ib: release locks in the proper order] In-Reply-To: References: <1223481379.11102.241.camel@firewall.xsintricity.com> Message-ID: <48ED0302.4010704@redhat.com> Roland Dreier wrote: > > Forwarding a patch written by one of our real time kernel guys. > > Is there some reason why sending the patch himself is too hard? > > > RT is very sensitive to the order locks are taken and released > > wrt read write locks. We must do > > > > lock(a); > > lock(b); > > lock(c); > > > > [...] > > > > unlock(c); > > unlock(b); > > unlock(a); > > > > otherwise bad things can happen. > > Maybe I'm being dense but what bad things are fixed by this patch? I > can't even see a theoretical issue with the code as is. This change > looks very much like fiddling for no good reason -- has a real problem > been seen with this code? > No problem upstream (or mainline for that matter). And with my latest email back and forth with Linus, it may be best to just drop it from going upstream. The problem arised with RT. RT converts spin_locks and rwlocks as well as rwsems into priority inheritance mutexes. With rwlocks and rwsems it becomes a bit more complex, since they can have multiple owners. To accomplish this, the tasks have an array field of all reader locks (rwlocks or sems) that they hold. But the unlock expected the last taken lock to be released, to keep the array clean (just decrement the length). I have just finished testing a patch that allows for this array to have holes in it (and unnested unlocking order), but until it is in, we need this patch for RT. -- Steve From cameron at harr.org Wed Oct 8 15:30:33 2008 From: cameron at harr.org (Cameron Harr) Date: Wed, 08 Oct 2008 16:30:33 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48ECEA4D.7080504@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> Message-ID: <48ED3489.4030905@harr.org> Cameron Harr wrote: >>> >>> Also a little disconcerting is that my average request size on the >>> target has gotten larger. I'm always writing 512B packets, and when >>> I run on one initiator, the average reqsz is around 600-800B. When I >>> add an initiator, the average reqsz basically doubles and is now >>> around 1200 - 1600B. I'm specifying direct IO in the test and scst >>> is configured as blockio (and thus direct IO), but it appears >>> something is cached at some point and seems to be coalesced when >>> another initiator is involved. Does this seem odd or normal? This >>> shows true whether the initiators are writing to different >>> partitions on the same LUN or the same LUN with no partitions. I've been doing some testing trying to determine why my average req sz is bloated beyond the 512B packets I'm sending. It appears to me to be caused by heavy utilization of the middleware: SRPT or SCST. As I add processes on an initiator, the ave req sz goes up, and really jumps when I have more than 2 processes (running on 1 or 2 initiators) or if I'm writing to the same target LUN. My hunch is that the calculation of the ave req sz over a 1s interval is skewed due to some requests having to wait for either the IB layer or the SCST layer. Thinking that perhaps the srpt_thread was a cause, I turned off threading there, but that caused the packet sizing to be much more wild - never dropping to 512B and growing to as much as 4KB. Using the default deadline scheduler as opposed to the default cfq scheduler didn't seem to make a difference. Cameron From chu11 at llnl.gov Wed Oct 8 15:45:18 2008 From: chu11 at llnl.gov (Al Chu) Date: Wed, 08 Oct 2008 15:45:18 -0700 Subject: [ofa-general] [infiniband-diags] add --loop_ports option to perfquery In-Reply-To: References: <1223420045.1197.117.camel@cardanus.llnl.gov> <1223461589.8503.19.camel@whatsup> Message-ID: <1223505919.1197.140.camel@cardanus.llnl.gov> Hey Hal, On Wed, 2008-10-08 at 09:44 -0400, Hal Rosenstock wrote: > Hi Al, > > On Wed, Oct 8, 2008 at 6:26 AM, Al Chu wrote: > > Hey Hal, > > > > On Wed, 2008-10-08 at 07:03 -0400, Hal Rosenstock wrote: > >> Al, > >> > >> On Tue, Oct 7, 2008 at 6:54 PM, Al Chu wrote: > >> > Hey Sasha, > >> > > >> > We have a switch here that does not report the AllPortSelect flag as a > >> > capability. It's pretty annoying typing each port on the switch or > >> > always having to script around this one oddball switch we have. So I > >> > added an option --loop_ports for perfquery. If you want to do something > >> > to all the ports on the CA/Switch, but AllPortSelect isn't available, it > >> > loops through all the available ports instead. > >> > >> Why not add simulated AllPortSelect for multiple ports rather than add > >> another perquery option for this ? > > > > I did try that, and it did seem to work for the switches we had. But > > when I read the IB spec, it said something to the affect that if a > > system doesn't support AllPortSelect, setting the PortSelect field to > > 0xFF was undefined behavior. > > I was suggesting that the emulation support (when AllPortSelect is not > supported) be enhanced for multiple ports and work on both CAs and all > switches. The one difference is one response for AllPortSelect > (whether emulated or not) v. many responses for port loop. Oh. I thought you were referring the the workaround "simulation" that was in the original code. But you're referring to aggregating the data/output make it look like AllPortSelect was supported. I'll put this on the TODO. > > >> > There was already a workaround in the tool for a CA that did not support > >> > the AllPortSelect flag. I get the feeling the workaround may have been > >> > for a specific hardware, so I kept the workaround in there. > >> > >> > Al > >> > > >> > -- > >> > Albert Chu > >> > chu11 at llnl.gov > >> > Computer Scientist > >> > High Performance Systems Division > >> > Lawrence Livermore National Laboratory > >> > > >> > _______________________________________________ > >> > general mailing list > >> > general at lists.openfabrics.org > >> > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > > >> > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > >> > > >> > >> There are also 2 for loops which are not correct for some switches: > >> for (i = 1; i <= num_ports; i++) > > > > I guess I've never seen a switch that doesn't go from 1 to num_ports. > > Is there something else I need to handle? > > Yes, per the spec, enhanced SP0 supports PortCounters. All your > switches likely support AllPortSelect so it's not an issue there. Ok I see now. Wasn't aware of it. I'll get a patch together. Thanks, Al > -- Hal > > > Al > > > >> -- Hal > >> > > -- > > Albert Chu > > chu11 at llnl.gov > > Computer Scientist > > High Performance Systems Division > > Lawrence Livermore National Laboratory > > > > > -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From chu11 at llnl.gov Wed Oct 8 17:40:50 2008 From: chu11 at llnl.gov (Al Chu) Date: Wed, 08 Oct 2008 17:40:50 -0700 Subject: [ofa-general] [infiniband-diags] specify -l(loop_ports) in ibclearerrors and ibclearcounters Message-ID: <1223512850.1197.154.camel@cardanus.llnl.gov> Hey Sasha, Specifies the -l option in these respective scripts when they call perfquery to clear counters/errors. This is the core of why I implemented the -l option in perfquery. These two tools were failing in a chunk of our switches. Note that these scripts specify -R (reset only) for perfquery, so there aren't any issues with multiple perfquery outputs that may need to be parsed multiple times/differently. Al -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-specify-loop_ports-when-resetting-counters-errors.patch Type: text/x-patch Size: 1399 bytes Desc: not available URL: From chu11 at llnl.gov Wed Oct 8 17:40:52 2008 From: chu11 at llnl.gov (Al Chu) Date: Wed, 08 Oct 2008 17:40:52 -0700 Subject: [ofa-general] [infiniband-diags] [trivial] fix comments in perfquery Message-ID: <1223512852.1197.156.camel@cardanus.llnl.gov> Hey Sasha, Dumb patch. I realized the comments weren't in the right place. Al -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0003-fix-perfquery-comment.patch Type: text/x-patch Size: 1220 bytes Desc: not available URL: From chu11 at llnl.gov Wed Oct 8 17:40:51 2008 From: chu11 at llnl.gov (Al Chu) Date: Wed, 08 Oct 2008 17:40:51 -0700 Subject: [ofa-general] [infiniband-diags] support ehanced port 0 with --loop-ports in perfquery Message-ID: <1223512851.1197.155.camel@cardanus.llnl.gov> Hey Sasha, Fixes the enhanced port 0 issue Hal referred to in the previous thread. Al -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0002-support-ehanced-port-0-with-loop_ports.patch Type: text/x-patch Size: 2047 bytes Desc: not available URL: From hal.rosenstock at gmail.com Wed Oct 8 18:54:19 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 8 Oct 2008 21:54:19 -0400 Subject: ***SPAM*** Re: [ofa-general] [infiniband-diags] specify -l(loop_ports) in ibclearerrors and ibclearcounters In-Reply-To: <1223512850.1197.154.camel@cardanus.llnl.gov> References: <1223512850.1197.154.camel@cardanus.llnl.gov> Message-ID: Hi Al, On Wed, Oct 8, 2008 at 8:40 PM, Al Chu wrote: > Hey Sasha, > > Specifies the -l option in these respective scripts when they call > perfquery to clear counters/errors. This is the core of why I > implemented the -l option in perfquery. These two tools were failing in > a chunk of our switches. > > Note that these scripts specify -R (reset only) for perfquery, so there > aren't any issues with multiple perfquery outputs that may need to be > parsed multiple times/differently. But this takes many more resets to work, right ? Why not have loop ports (for reset) only do this is all ports is not supported ? -- Hal > > Al > > -- > Albert Chu > chu11 at llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From rdreier at cisco.com Wed Oct 8 20:09:24 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Oct 2008 20:09:24 -0700 Subject: [ofa-general] Re: [PATCH] IB/mlx4: Set RLKEY bit in QP context In-Reply-To: <48EB8433.1020108@mellanox.co.il> (Vladimir Sokolovsky's message of "Tue, 07 Oct 2008 17:45:55 +0200") References: <48EB8433.1020108@mellanox.co.il> Message-ID: thanks, applied. From rdreier at cisco.com Wed Oct 8 20:13:15 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Oct 2008 20:13:15 -0700 Subject: [ofa-general] [Fwd: [PATCH] ib: release locks in the proper order] In-Reply-To: <48ED0302.4010704@redhat.com> (Steven Rostedt's message of "Wed, 08 Oct 2008 14:59:14 -0400") References: <1223481379.11102.241.camel@firewall.xsintricity.com> <48ED0302.4010704@redhat.com> Message-ID: > The problem arised with RT. RT converts spin_locks and rwlocks as > well as rwsems into priority inheritance mutexes. With rwlocks and > rwsems it becomes a bit more complex, since they can have multiple > owners. To accomplish this, the tasks have an array field of all > reader locks (rwlocks or sems) that they hold. But the unlock expected > the last taken lock to be released, to keep the array clean (just > decrement the length). I see -- basically this is an internal implementation quirk of how RT handles rwsems. Sort of like the way spin_lock_irqsave() and spin_unlock_irqrestore() used to have to be in the same function, with a local flags variable, because of strange details of the sparc architecture. I'm actually OK with applying this patch because of that implementation quirk, assuming that it helps with the RT tree, and assuming someone is actually using this codepath with the RT patch applied and getting bitten by this in practice. But I'm even more OK with dropping the patch if you're going to fix this RT quirk anyway ;) So let me know what you think. From srostedt at redhat.com Wed Oct 8 20:20:16 2008 From: srostedt at redhat.com (Steven Rostedt) Date: Wed, 08 Oct 2008 23:20:16 -0400 Subject: [ofa-general] [Fwd: [PATCH] ib: release locks in the proper order] In-Reply-To: References: <1223481379.11102.241.camel@firewall.xsintricity.com> <48ED0302.4010704@redhat.com> Message-ID: <48ED7870.6090401@redhat.com> Roland Dreier wrote: > > The problem arised with RT. RT converts spin_locks and rwlocks as > > well as rwsems into priority inheritance mutexes. With rwlocks and > > rwsems it becomes a bit more complex, since they can have multiple > > owners. To accomplish this, the tasks have an array field of all > > reader locks (rwlocks or sems) that they hold. But the unlock expected > > the last taken lock to be released, to keep the array clean (just > > decrement the length). > > I see -- basically this is an internal implementation quirk of how RT > handles rwsems. Sort of like the way spin_lock_irqsave() and > spin_unlock_irqrestore() used to have to be in the same function, with a > local flags variable, because of strange details of the sparc architecture. > > I'm actually OK with applying this patch because of that implementation > quirk, assuming that it helps with the RT tree, and assuming someone is > actually using this codepath with the RT patch applied and getting > bitten by this in practice. But I'm even more OK with dropping the > patch if you're going to fix this RT quirk anyway ;) > We found this quirk from someone reporting hitting it in the ib driver ;-) but ... > So let me know what you think. > I just fixed the quirk. You can drop the patch. Thanks, -- Steve From addcorporate at linuxmail.org Wed Oct 8 21:00:12 2008 From: addcorporate at linuxmail.org (Financial Services) Date: Wed, 8 Oct 2008 21:00:12 -0700 Subject: [ofa-general] ***SPAM*** Update 1.423 Message-ID: <05ba32a952710ef181f5e7fd00192ca7@linuxmail.org> An HTML attachment was scrubbed... URL: From locore64 at alkyltechnology.com Thu Oct 9 02:14:13 2008 From: locore64 at alkyltechnology.com (Toru Nishimura) Date: Thu, 9 Oct 2008 18:14:13 +0900 Subject: [ofa-general] ***SPAM*** PPC64 autobuild machine Message-ID: Hi, OpenFabrics guys, I'm gathering info about PPC and OFED combination. My understanding is that OFED autobuild system is generating PPC64 binaries, in possibly self (native) build way. What type of PPC64 machine is used for that? Machine model name and configuration are wanted. Thanks in advance. Toru Nishimura / ALKYL Technology From vlad at lists.openfabrics.org Thu Oct 9 03:23:46 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 9 Oct 2008 03:23:46 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081009-0200 daily build status Message-ID: <20081009102346.D0BF3E60E3B@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: Build failed on ppc64 with linux-2.6.24 Log: /home/vlad/tmp/ofa_1_4_kernel-20081009-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c: In function 'ehca_poll_eqs': /home/vlad/tmp/ofa_1_4_kernel-20081009-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:942: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type /home/vlad/tmp/ofa_1_4_kernel-20081009-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:946: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081009-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081009-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081009-0200_linux-2.6.24_ppc64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081009-0200_linux-2.6.24_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.24' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From hal.rosenstock at gmail.com Thu Oct 9 05:37:51 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 9 Oct 2008 08:37:51 -0400 Subject: [ofa-general] [infiniband-diags] add --loop_ports option to perfquery In-Reply-To: <1223505919.1197.140.camel@cardanus.llnl.gov> References: <1223420045.1197.117.camel@cardanus.llnl.gov> <1223461589.8503.19.camel@whatsup> <1223505919.1197.140.camel@cardanus.llnl.gov> Message-ID: Hi again Al, On Wed, Oct 8, 2008 at 6:45 PM, Al Chu wrote: > Hey Hal, > > On Wed, 2008-10-08 at 09:44 -0400, Hal Rosenstock wrote: >> Hi Al, >> >> On Wed, Oct 8, 2008 at 6:26 AM, Al Chu wrote: >> > Hey Hal, >> > >> > On Wed, 2008-10-08 at 07:03 -0400, Hal Rosenstock wrote: >> >> Al, >> >> >> >> On Tue, Oct 7, 2008 at 6:54 PM, Al Chu wrote: >> >> > Hey Sasha, >> >> > >> >> > We have a switch here that does not report the AllPortSelect flag as a >> >> > capability. It's pretty annoying typing each port on the switch or >> >> > always having to script around this one oddball switch we have. So I >> >> > added an option --loop_ports for perfquery. If you want to do something >> >> > to all the ports on the CA/Switch, but AllPortSelect isn't available, it >> >> > loops through all the available ports instead. >> >> >> >> Why not add simulated AllPortSelect for multiple ports rather than add >> >> another perquery option for this ? >> > >> > I did try that, and it did seem to work for the switches we had. But >> > when I read the IB spec, it said something to the affect that if a >> > system doesn't support AllPortSelect, setting the PortSelect field to >> > 0xFF was undefined behavior. >> >> I was suggesting that the emulation support (when AllPortSelect is not >> supported) be enhanced for multiple ports and work on both CAs and all >> switches. The one difference is one response for AllPortSelect >> (whether emulated or not) v. many responses for port loop. > > Oh. I thought you were referring the the workaround "simulation" that > was in the original code. But you're referring to aggregating the > data/output make it look like AllPortSelect was supported. I'll put > this on the TODO. So it seems that the reason for adding an additional option for this is that the lack of this support ? Are there any other uses ? -- Hal >> >> >> > There was already a workaround in the tool for a CA that did not support >> >> > the AllPortSelect flag. I get the feeling the workaround may have been >> >> > for a specific hardware, so I kept the workaround in there. >> >> >> >> > Al >> >> > >> >> > -- >> >> > Albert Chu >> >> > chu11 at llnl.gov >> >> > Computer Scientist >> >> > High Performance Systems Division >> >> > Lawrence Livermore National Laboratory >> >> > >> >> > _______________________________________________ >> >> > general mailing list >> >> > general at lists.openfabrics.org >> >> > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> > >> >> > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general >> >> > >> >> >> >> There are also 2 for loops which are not correct for some switches: >> >> for (i = 1; i <= num_ports; i++) >> > >> > I guess I've never seen a switch that doesn't go from 1 to num_ports. >> > Is there something else I need to handle? >> >> Yes, per the spec, enhanced SP0 supports PortCounters. All your >> switches likely support AllPortSelect so it's not an issue there. > > Ok I see now. Wasn't aware of it. I'll get a patch together. > > Thanks, > Al > >> -- Hal >> >> > Al >> > >> >> -- Hal >> >> >> > -- >> > Albert Chu >> > chu11 at llnl.gov >> > Computer Scientist >> > High Performance Systems Division >> > Lawrence Livermore National Laboratory >> > >> > >> > -- > Albert Chu > chu11 at llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > > From chu11 at llnl.gov Thu Oct 9 09:17:39 2008 From: chu11 at llnl.gov (Al Chu) Date: Thu, 09 Oct 2008 09:17:39 -0700 Subject: [ofa-general] [infiniband-diags] specify -l(loop_ports) in ibclearerrors and ibclearcounters In-Reply-To: References: <1223512850.1197.154.camel@cardanus.llnl.gov> Message-ID: <1223569059.1197.163.camel@cardanus.llnl.gov> Hey Hal, On Wed, 2008-10-08 at 21:54 -0400, Hal Rosenstock wrote: > Hi Al, > > On Wed, Oct 8, 2008 at 8:40 PM, Al Chu wrote: > > Hey Sasha, > > > > Specifies the -l option in these respective scripts when they call > > perfquery to clear counters/errors. This is the core of why I > > implemented the -l option in perfquery. These two tools were failing in > > a chunk of our switches. > > > > Note that these scripts specify -R (reset only) for perfquery, so there > > aren't any issues with multiple perfquery outputs that may need to be > > parsed multiple times/differently. > > But this takes many more resets to work, right ? Why not have loop > ports (for reset) only do this is all ports is not supported ? Do you mean remove the --loop_ports option and loop the ports by default even if AllPortSelect isn't supported? I suppose that'd be fine. The reason I added an option was I just didn't want to change default behavior. But if everyone is happy just making it happen by default, I'm ok with that as well. Al > -- Hal > > > > > Al > > > > -- > > Albert Chu > > chu11 at llnl.gov > > Computer Scientist > > High Performance Systems Division > > Lawrence Livermore National Laboratory > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > > > -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From chu11 at llnl.gov Thu Oct 9 09:25:34 2008 From: chu11 at llnl.gov (Al Chu) Date: Thu, 09 Oct 2008 09:25:34 -0700 Subject: [ofa-general] [infiniband-diags] add --loop_ports option to perfquery In-Reply-To: References: <1223420045.1197.117.camel@cardanus.llnl.gov> <1223461589.8503.19.camel@whatsup> <1223505919.1197.140.camel@cardanus.llnl.gov> Message-ID: <1223569534.1197.170.camel@cardanus.llnl.gov> Hey Hal, On Thu, 2008-10-09 at 08:37 -0400, Hal Rosenstock wrote: > Hi again Al, > > On Wed, Oct 8, 2008 at 6:45 PM, Al Chu wrote: > > Hey Hal, > > > > On Wed, 2008-10-08 at 09:44 -0400, Hal Rosenstock wrote: > >> Hi Al, > >> > >> On Wed, Oct 8, 2008 at 6:26 AM, Al Chu wrote: > >> > Hey Hal, > >> > > >> > On Wed, 2008-10-08 at 07:03 -0400, Hal Rosenstock wrote: > >> >> Al, > >> >> > >> >> On Tue, Oct 7, 2008 at 6:54 PM, Al Chu wrote: > >> >> > Hey Sasha, > >> >> > > >> >> > We have a switch here that does not report the AllPortSelect flag as a > >> >> > capability. It's pretty annoying typing each port on the switch or > >> >> > always having to script around this one oddball switch we have. So I > >> >> > added an option --loop_ports for perfquery. If you want to do something > >> >> > to all the ports on the CA/Switch, but AllPortSelect isn't available, it > >> >> > loops through all the available ports instead. > >> >> > >> >> Why not add simulated AllPortSelect for multiple ports rather than add > >> >> another perquery option for this ? > >> > > >> > I did try that, and it did seem to work for the switches we had. But > >> > when I read the IB spec, it said something to the affect that if a > >> > system doesn't support AllPortSelect, setting the PortSelect field to > >> > 0xFF was undefined behavior. > >> > >> I was suggesting that the emulation support (when AllPortSelect is not > >> supported) be enhanced for multiple ports and work on both CAs and all > >> switches. The one difference is one response for AllPortSelect > >> (whether emulated or not) v. many responses for port loop. > > > > Oh. I thought you were referring the the workaround "simulation" that > > was in the original code. But you're referring to aggregating the > > data/output make it look like AllPortSelect was supported. I'll put > > this on the TODO. > > So it seems that the reason for adding an additional option for this > is that the lack of this support ? Are there any other uses ? I made the --loop_ports option b/c I just didn't want to change the default behavior perfquery. But we could easily make it an automatic if the AllPortsSelect flag isn't supported. Thinking about it, there is a bit of a subtlety in the command line options and expectations. If a user inputs a port of '255', to me this means that the user wants to do an AllPortSelect and we should error out if AllPortSelect isn't supported. If a user inputs the '-a' option, it suggests that they want perfquery to operate on every port, suggesting that we could automatically loop if they don't input --loop_ports. Al > -- Hal > > >> > >> >> > There was already a workaround in the tool for a CA that did not support > >> >> > the AllPortSelect flag. I get the feeling the workaround may have been > >> >> > for a specific hardware, so I kept the workaround in there. > >> >> > >> >> > Al > >> >> > > >> >> > -- > >> >> > Albert Chu > >> >> > chu11 at llnl.gov > >> >> > Computer Scientist > >> >> > High Performance Systems Division > >> >> > Lawrence Livermore National Laboratory > >> >> > > >> >> > _______________________________________________ > >> >> > general mailing list > >> >> > general at lists.openfabrics.org > >> >> > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> >> > > >> >> > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > >> >> > > >> >> > >> >> There are also 2 for loops which are not correct for some switches: > >> >> for (i = 1; i <= num_ports; i++) > >> > > >> > I guess I've never seen a switch that doesn't go from 1 to num_ports. > >> > Is there something else I need to handle? > >> > >> Yes, per the spec, enhanced SP0 supports PortCounters. All your > >> switches likely support AllPortSelect so it's not an issue there. > > > > Ok I see now. Wasn't aware of it. I'll get a patch together. > > > > Thanks, > > Al > > > >> -- Hal > >> > >> > Al > >> > > >> >> -- Hal > >> >> > >> > -- > >> > Albert Chu > >> > chu11 at llnl.gov > >> > Computer Scientist > >> > High Performance Systems Division > >> > Lawrence Livermore National Laboratory > >> > > >> > > >> > > -- > > Albert Chu > > chu11 at llnl.gov > > Computer Scientist > > High Performance Systems Division > > Lawrence Livermore National Laboratory > > > > > -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From sashak at voltaire.com Thu Oct 9 09:30:00 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 9 Oct 2008 18:30:00 +0200 Subject: [ofa-general] ***SPAM*** ibdm network topology format In-Reply-To: References: <20080930121252.GA7396@sashak.voltaire.com> <829ded920810010207r475d82abu269d47cd3baddb3f@mail.gmail.com> <20081001203813.GL7396@sashak.voltaire.com> <20081002022430.GQ7396@sashak.voltaire.com> <20081002170033.GI25831@sashak.voltaire.com> <20081008003239.GI7563@sashak.voltaire.com> Message-ID: <20081009163000.GB4912@sashak.voltaire.com> Hi Hal, On 07:08 Wed 08 Oct , Hal Rosenstock wrote: > > The question was whether you are sure that class 0x81 is not being > registered with ibdiagnet ? Was this actually verified at some level > rather than code inspection of ibutils ? Just to be sure... Yes of course, umad2sim catches all application's ioctl()s... Sasha > >> That's the first thing to confirm or maybe you've already confirmed > >> this and it wasn't clear to me in what you wrote. If so, I have a > >> theory about what could be occuring. It may be the case that it is an > >> effect of the kernel MAD layer in that a MAD agent can send any class > >> and when using request/response it matches on transaction ID which > >> contains the MAD agent. Unsolicited messages on that other class > >> wouldn't get through though. I just ran a simple test of this and that > >> appears to be the case. > > > > This could explain the phenomena. And then it seems that similar > > mechanism should be implemented in umad2sim. > > Yes. > > -- Hal > > > Sasha > > From chu11 at llnl.gov Thu Oct 9 09:31:41 2008 From: chu11 at llnl.gov (Al Chu) Date: Thu, 09 Oct 2008 09:31:41 -0700 Subject: [ofa-general] [infiniband-diags] specify -l(loop_ports) in ibclearerrors and ibclearcounters In-Reply-To: References: <1223512850.1197.154.camel@cardanus.llnl.gov> Message-ID: <1223569901.1197.175.camel@cardanus.llnl.gov> Hey Hal, On Wed, 2008-10-08 at 21:54 -0400, Hal Rosenstock wrote: > Hi Al, > > On Wed, Oct 8, 2008 at 8:40 PM, Al Chu wrote: > > Hey Sasha, > > > > Specifies the -l option in these respective scripts when they call > > perfquery to clear counters/errors. This is the core of why I > > implemented the -l option in perfquery. These two tools were failing in > > a chunk of our switches. > > > > Note that these scripts specify -R (reset only) for perfquery, so there > > aren't any issues with multiple perfquery outputs that may need to be > > parsed multiple times/differently. > > But this takes many more resets to work, right ? Why not have loop > ports (for reset) only do this is all ports is not supported ? I just re-read this sentence, now I get what you're asking. I made -- loop_ports only loop if AllPortSelect isn't supported. So if the switches already support AllPortSelect, the number of resets should be the same. Al > > -- Hal > > > > > Al > > > > -- > > Albert Chu > > chu11 at llnl.gov > > Computer Scientist > > High Performance Systems Division > > Lawrence Livermore National Laboratory > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > > > -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From sashak at voltaire.com Thu Oct 9 09:39:33 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 9 Oct 2008 18:39:33 +0200 Subject: [ofa-general] [PATCH 0/6] opensm: Unicast Routing Cache In-Reply-To: <48EA9ABA.6010509@dev.mellanox.co.il> References: <48E96928.8030200@dev.mellanox.co.il> <48EA9ABA.6010509@dev.mellanox.co.il> Message-ID: <20081009163933.GC4912@sashak.voltaire.com> Hi Yevgeny, On 01:09 Tue 07 Oct , Yevgeny Kliteynik wrote: > > Actually, I was thinking about something else: > Currently we have switch LFT implemented as osm_fwd_tbl_t. > I can remove the unnecessary complexity of the osm_fwd_tbl_t by replacing > it with a simple uint8_t array (same as LFT buffer). Then by simple > comparison I will check whether the recently calculated LFT > matches the switch's LFT, and if there is a match, then lft_buf > can be freed. In this case only the switches that have LFT different > from the recently calculated LFT will have both tables, which would be > rare and temporary - on the next heavy sweep the LFTs would match, and > lft_buf would be freed. > Effectively, it won't have memory penalty. > It can be done in a separate patch. Agree about separate patch. And would be really nice to have it in OFED 1.4 days. >> Are you sure all the memory allocation failures are handled properly >> within the routing cache code ? What I mean is that NULL is returned >> and does this always result in a caching not used/routing recalculated >> ? Also, in that case, should some log message be indicated rather than >> hiding this ? > > I will check it. I think Hal is about non-checked malloc() in __cache_add_port() function. >> Nit: doc/current-routing.txt should also be updated for this feature. > > OK, separate patch. Agree, it is needed too. Sasha From sashak at voltaire.com Thu Oct 9 09:44:22 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 9 Oct 2008 18:44:22 +0200 Subject: [ofa-general] [PATCH 0/6] opensm: Unicast Routing Cache In-Reply-To: References: <48E96928.8030200@dev.mellanox.co.il> <48EA9ABA.6010509@dev.mellanox.co.il> Message-ID: <20081009164422.GD4912@sashak.voltaire.com> On 09:22 Tue 07 Oct , Hal Rosenstock wrote: > > > > Actually, I was thinking about something else: > > Currently we have switch LFT implemented as osm_fwd_tbl_t. > > I can remove the unnecessary complexity of the osm_fwd_tbl_t by replacing > > it with a simple uint8_t array (same as LFT buffer). Then by simple > > comparison I will check whether the recently calculated LFT > > matches the switch's LFT, and if there is a match, then lft_buf > > can be freed. In this case only the switches that have LFT different > > from the recently calculated LFT will have both tables, which would be > > rare and temporary - on the next heavy sweep the LFTs would match, and > > lft_buf would be freed. > > Can the forwarding tables be removed ? How would paths be > calculated/walked end to end on an SA PathRecord/MultiPathRecord query > ? Would that then require query of the LFTs in the switches ? No. As far as I understand whole idea is to keep LFT images in raw buffer (as they really are) similar to lft_buf instead of osm_fwd_tbl_t. IMO this simplifies the code in general and makes described optimization possible. Sasha From weiny2 at llnl.gov Thu Oct 9 09:57:10 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 9 Oct 2008 09:57:10 -0700 Subject: [ofa-general] [infiniband-diags] specify -l(loop_ports) in ibclearerrors and ibclearcounters In-Reply-To: <1223569059.1197.163.camel@cardanus.llnl.gov> References: <1223512850.1197.154.camel@cardanus.llnl.gov> <1223569059.1197.163.camel@cardanus.llnl.gov> Message-ID: <20081009095710.63a2a2a0.weiny2@llnl.gov> On Thu, 09 Oct 2008 09:17:39 -0700 Al Chu wrote: > Hey Hal, > > On Wed, 2008-10-08 at 21:54 -0400, Hal Rosenstock wrote: > > Hi Al, > > > > On Wed, Oct 8, 2008 at 8:40 PM, Al Chu wrote: > > > > > > Note that these scripts specify -R (reset only) for perfquery, so there > > > aren't any issues with multiple perfquery outputs that may need to be > > > parsed multiple times/differently. > > > > But this takes many more resets to work, right ? Why not have loop > > ports (for reset) only do this is all ports is not supported ? > > Do you mean remove the --loop_ports option and loop the ports by default > even if AllPortSelect isn't supported? > > I suppose that'd be fine. The reason I added an option was I just > didn't want to change default behavior. But if everyone is happy just > making it happen by default, I'm ok with that as well. > I vote yes. Have the loop ports happen "under the covers" if the user specifies '-a' this includes summing the values just like AllPortSelect would in the switch. If the user specifies "0xff" I think that would be a good way to request specific functionality and (as per Al's other email) return an error if AllPortSelect is not supported. Ira From sashak at voltaire.com Thu Oct 9 10:10:00 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 9 Oct 2008 19:10:00 +0200 Subject: [ofa-general] Re: [PATCH 5/6] opensm/Unicast Routing Cache: integrate cache into opensm In-Reply-To: <48E969FE.7050507@dev.mellanox.co.il> References: <48E969FE.7050507@dev.mellanox.co.il> Message-ID: <20081009171000.GE4912@sashak.voltaire.com> Hi Yevgeny, On 03:29 Mon 06 Oct , Yevgeny Kliteynik wrote: [snip...] > @@ -818,27 +826,37 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * const p_mgr) > /* > If there are no switches in the subnet, we are done. > */ > - if (cl_qmap_count(p_sw_guid_tbl) == 0 || > - ucast_mgr_setup_all_switches(p_mgr->p_subn) < 0) > + if (cl_qmap_count(p_sw_guid_tbl) == 0) > goto Exit; > > p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_NONE; > - while (p_routing_eng) { > - if (!ucast_mgr_route(p_routing_eng, p_osm)) > - break; > - p_routing_eng = p_routing_eng->next; > - } > + if (p_mgr->p_subn->opt.use_ucast_cache && > + osm_ucast_cache_is_valid(p_mgr->p_cache)) { > + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, > + "Configuring switch tables using cached routing\n"); > + osm_ucast_cache_apply(p_mgr->p_cache); > > - if (p_osm->routing_engine_used == OSM_ROUTING_ENGINE_TYPE_NONE) { > - /* If configured routing algorithm failed, use default MinHop */ > - osm_ucast_mgr_build_lid_matrices(p_mgr); > - ucast_mgr_build_lfts(p_mgr); > - p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_MINHOP; > - } > + } else { I think this will break some routing engines (such LASH) and logging because p_osm->routing_engine_used is leaved as OSM_ROUTING_ENGINE_TYPE_NONE. And since we have cache validation calls in do_sweep() anyway: + if (sm->p_subn->opt.use_ucast_cache) + osm_ucast_cache_validate(sm->ucast_mgr.p_cache); Isn't it would be better to not touch osm_ucast_mgr_process() at all and instead to replace lines above by: if (!sm->p_subn->opt.use_ucast_cache || osm_ucast_cache_process(sm->ucast_mgr.p_cache)) This also saves couple of public calls in ucast cache. I think then the patch could look like below. Agreed? Sasha >From a8db7582785d9103c74be1cd9224a8825ab0963a Mon Sep 17 00:00:00 2001 From: Yevgeny Kliteynik Date: Mon, 6 Oct 2008 03:29:34 +0200 Subject: [PATCH] opensm/Unicast Routing Cache: integrate cache into opensm Integrating unicast cache into the discovery and ucast manager. Signed-off-by: Yevgeny Kliteynik Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_ucast_cache.h | 68 ++---------------------------- opensm/include/opensm/osm_ucast_mgr.h | 6 +++ opensm/opensm/osm_drop_mgr.c | 13 +++++- opensm/opensm/osm_node_info_rcv.c | 9 ++++- opensm/opensm/osm_port_info_rcv.c | 9 ++++- opensm/opensm/osm_state_mgr.c | 12 +++++- opensm/opensm/osm_ucast_cache.c | 42 ++++++++----------- opensm/opensm/osm_ucast_mgr.c | 11 +++++ 8 files changed, 77 insertions(+), 93 deletions(-) diff --git a/opensm/include/opensm/osm_ucast_cache.h b/opensm/include/opensm/osm_ucast_cache.h index 2dc1c4e..7f01876 100644 --- a/opensm/include/opensm/osm_ucast_cache.h +++ b/opensm/include/opensm/osm_ucast_cache.h @@ -198,38 +198,6 @@ osm_ucast_cache_invalidate(osm_ucast_cache_t * p_cache); * Unicast Cache object *********/ -/****f* OpenSM: Unicast Cache/osm_ucast_cache_validate -* NAME -* osm_ucast_cache_validate -* -* DESCRIPTION -* The osm_ucast_cache_validate function checks -* whether or not the cached routing can be applied -* to the current subnet switches. -* -* SYNOPSIS -*/ -void -osm_ucast_cache_validate(osm_ucast_cache_t * p_cache); -/* -* PARAMETERS -* p_cache -* [in] Pointer to the object to check. -* -* RETURN VALUE -* This function does not return any value. -* -* NOTES -* This function checks the current subnet and the -* cached links, and decides whether or not there -* is a need to re-run unicast routing engine. -* If the cached routing can't be applied to the -* current subnet switches as is, cache is invalidated. -* -* SEE ALSO -* Unicast Cache object -*********/ - /****f* OpenSM: Unicast Cache/osm_ucast_cache_mark_valid * NAME * osm_ucast_cache_mark_valid @@ -256,31 +224,6 @@ osm_ucast_cache_mark_valid(osm_ucast_cache_t * p_cache); * Unicast Cache object *********/ -/****f* OpenSM: Unicast Cache/osm_ucast_cache_is_valid -* NAME -* osm_ucast_cache_is_valid -* -* DESCRIPTION -* Check whether the unicast cache is valid. -* -* SYNOPSIS -*/ -boolean_t -osm_ucast_cache_is_valid(osm_ucast_cache_t * p_cache); -/* -* PARAMETERS -* p_cache -* [in] Pointer to the object to check. -* -* RETURN VALUE -* TRUE if the cache is valid, FALSE otherwise. -* -* NOTES -* -* SEE ALSO -* Unicast Cache object -*********/ - /****f* OpenSM: Unicast Cache/osm_ucast_cache_check_new_link * NAME * osm_ucast_cache_check_new_link @@ -406,25 +349,24 @@ osm_ucast_cache_add_node(osm_ucast_cache_t * p_cache, * Unicast Cache object *********/ -/****f* OpenSM: Unicast Cache/osm_ucast_cache_apply +/****f* OpenSM: Unicast Cache/osm_ucast_cache_process * NAME -* osm_ucast_cache_apply +* osm_ucast_cache_process * * DESCRIPTION -* The osm_ucast_cache_apply function writes the +* The osm_ucast_cache_process function writes the * cached unicast routing on the subnet switches. * * SYNOPSIS */ -void -osm_ucast_cache_apply(osm_ucast_cache_t * p_cache); +int osm_ucast_cache_process(osm_ucast_cache_t * p_cache); /* * PARAMETERS * p_cache * [in] Pointer to the cache object to be used. * * RETURN VALUE -* This function does not return any value. +* This function returns zero on sucess and non-zero value otherwise. * * NOTES * Iterates through all the subnet switches and writes diff --git a/opensm/include/opensm/osm_ucast_mgr.h b/opensm/include/opensm/osm_ucast_mgr.h index 27e89e9..e4006bb 100644 --- a/opensm/include/opensm/osm_ucast_mgr.h +++ b/opensm/include/opensm/osm_ucast_mgr.h @@ -49,6 +49,7 @@ #include #include #include +#include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { @@ -77,6 +78,7 @@ BEGIN_C_DECLS * *********/ struct osm_sm; +struct _osm_ucast_cache; /****s* OpenSM: Unicast Manager/osm_ucast_mgr_t * NAME * osm_ucast_mgr_t @@ -97,6 +99,7 @@ typedef struct osm_ucast_mgr { cl_qlist_t port_order_list; boolean_t is_dor; boolean_t some_hop_count_set; + struct _osm_ucast_cache *p_cache; } osm_ucast_mgr_t; /* * FIELDS @@ -128,6 +131,9 @@ typedef struct osm_ucast_mgr { * tables calculation iteration cycle, set to TRUE to indicate * that some hop count changes were done. * +* p_cache +* Pointer to the Unicast Cache object. +* * SEE ALSO * Unicast Manager object *********/ diff --git a/opensm/opensm/osm_drop_mgr.c b/opensm/opensm/osm_drop_mgr.c index 8c6e7fb..5fc46a9 100644 --- a/opensm/opensm/osm_drop_mgr.c +++ b/opensm/opensm/osm_drop_mgr.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. * @@ -61,6 +61,7 @@ #include #include #include +#include /********************************************************************** **********************************************************************/ @@ -134,6 +135,13 @@ static void drop_mgr_clean_physp(osm_sm_t * sm, IN osm_physp_t * p_physp) (p_remote_physp->p_node)), p_remote_physp->port_num); + if (sm->p_subn->opt.use_ucast_cache) + osm_ucast_cache_add_link(sm->ucast_mgr.p_cache, + p_physp->p_node, + p_physp->port_num, + p_remote_physp->p_node, + p_remote_physp->port_num); + osm_physp_unlink(p_physp, p_remote_physp); } @@ -308,6 +316,9 @@ __osm_drop_mgr_process_node(osm_sm_t * sm, IN osm_node_t * p_node) "Unreachable node 0x%016" PRIx64 "\n", cl_ntoh64(osm_node_get_node_guid(p_node))); + if (sm->p_subn->opt.use_ucast_cache) + osm_ucast_cache_add_node(sm->ucast_mgr.p_cache, p_node); + /* Delete all the logical and physical port objects associated with this node. diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c index a37ce0a..94903b7 100644 --- a/opensm/opensm/osm_node_info_rcv.c +++ b/opensm/opensm/osm_node_info_rcv.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -59,6 +59,7 @@ #include #include #include +#include static void report_duplicated_guid(IN osm_sm_t * sm, @@ -240,6 +241,12 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm, cl_ntoh64(osm_node_get_node_guid(p_node)), port_num, cl_ntoh64(p_ni_context->node_guid), p_ni_context->port_num); + if (sm->p_subn->opt.use_ucast_cache) + osm_ucast_cache_check_new_link(sm->ucast_mgr.p_cache, + p_node, port_num, + p_neighbor_node, + p_ni_context->port_num); + osm_node_link(p_node, port_num, p_neighbor_node, p_ni_context->port_num); diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c index 73afd8e..d8d2021 100644 --- a/opensm/opensm/osm_port_info_rcv.c +++ b/opensm/opensm/osm_port_info_rcv.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -60,6 +60,7 @@ #include #include #include +#include /********************************************************************** **********************************************************************/ @@ -244,6 +245,12 @@ __osm_pi_rcv_process_switch_port(IN osm_sm_t * sm, (p_remote_node)), remote_port_num); + if (sm->p_subn->opt.use_ucast_cache) + osm_ucast_cache_add_link(sm->ucast_mgr.p_cache, + p_node, port_num, + p_remote_node, + remote_port_num); + osm_node_unlink(p_node, (uint8_t) port_num, p_remote_node, (uint8_t) remote_port_num); diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index b4eb87b..530a705 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -1075,6 +1075,10 @@ static void do_sweep(osm_sm_t * sm) /* Re-program the switches fully */ sm->p_subn->ignore_existing_lfts = TRUE; + /* we want to re-route, so cache should be invalidated */ + if (sm->p_subn->opt.use_ucast_cache) + osm_ucast_cache_invalidate(sm->ucast_mgr.p_cache); + osm_ucast_mgr_process(&sm->ucast_mgr); /* Reset flag */ @@ -1229,7 +1233,11 @@ _repeat_discovery: /* * Proceed with unicast forwarding table configuration. */ - osm_ucast_mgr_process(&sm->ucast_mgr); + + if (!sm->p_subn->opt.use_ucast_cache || + osm_ucast_cache_process(sm->ucast_mgr.p_cache)) + osm_ucast_mgr_process(&sm->ucast_mgr); + if (wait_for_pending_transactions(&sm->p_subn->p_osm->stats)) return; diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c index 3c32d35..57dc0a0 100644 --- a/opensm/opensm/osm_ucast_cache.c +++ b/opensm/opensm/osm_ucast_cache.c @@ -479,16 +479,6 @@ osm_ucast_cache_mark_valid(osm_ucast_cache_t * p_cache) /********************************************************************** **********************************************************************/ -boolean_t -osm_ucast_cache_is_valid(osm_ucast_cache_t * p_cache) -{ - CL_ASSERT(p_cache && p_cache->p_ucast_mgr); - return p_cache->valid; -} - -/********************************************************************** - **********************************************************************/ - void osm_ucast_cache_invalidate(osm_ucast_cache_t * p_cache) { @@ -520,8 +510,7 @@ Exit: /********************************************************************** **********************************************************************/ -void -osm_ucast_cache_validate(osm_ucast_cache_t * p_cache) +static void ucast_cache_validate(osm_ucast_cache_t * p_cache) { cache_switch_t * p_cache_sw; cache_switch_t * p_remote_cache_sw; @@ -1150,24 +1139,27 @@ Exit: /********************************************************************** **********************************************************************/ - -void -osm_ucast_cache_apply(osm_ucast_cache_t * p_cache) +int osm_ucast_cache_process(osm_ucast_cache_t * p_cache) { - osm_subn_t * p_subn = p_cache->p_ucast_mgr->p_subn; - osm_switch_t *p_sw; + cl_qmap_t *tbl = &p_cache->p_ucast_mgr->p_subn->sw_guid_tbl; + cl_map_item_t *item; + + if (!p_cache->p_ucast_mgr->p_subn->opt.use_ucast_cache) + return 1; + + ucast_cache_validate(p_cache); + if (!p_cache->valid) + return 1; - OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, - "Applying unicast cache\n"); - CL_ASSERT(p_cache && p_cache->p_ucast_mgr && p_cache->valid); + "Configuring switch tables using cached routing\n"); - for (p_sw = (osm_switch_t *) cl_qmap_head(&p_subn->sw_guid_tbl); - p_sw != (osm_switch_t *) cl_qmap_end(&p_subn->sw_guid_tbl); - p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item)) - osm_ucast_mgr_set_fwd_table(p_cache->p_ucast_mgr, p_sw); + for (item = cl_qmap_head(tbl); item != cl_qmap_end(tbl); + item = cl_qmap_next(item)) + osm_ucast_mgr_set_fwd_table(p_cache->p_ucast_mgr, + (osm_switch_t *)item); - OSM_LOG_EXIT(p_cache->p_ucast_mgr->p_log); + return 0; } /********************************************************************** diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index 2dc5dd4..34eddd0 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -73,6 +73,8 @@ void osm_ucast_mgr_destroy(IN osm_ucast_mgr_t * const p_mgr) CL_ASSERT(p_mgr); OSM_LOG_ENTER(p_mgr->p_log); + if (p_mgr->p_cache) + osm_ucast_cache_destroy(p_mgr->p_cache); OSM_LOG_EXIT(p_mgr->p_log); } @@ -92,6 +94,12 @@ osm_ucast_mgr_init(IN osm_ucast_mgr_t * const p_mgr, IN osm_sm_t * sm) p_mgr->p_subn = sm->p_subn; p_mgr->p_lock = sm->p_lock; + if (sm->p_subn->opt.use_ucast_cache){ + p_mgr->p_cache = osm_ucast_cache_construct(p_mgr); + if (!p_mgr->p_cache) + status = IB_INSUFFICIENT_MEMORY; + } + OSM_LOG_EXIT(p_mgr->p_log); return (status); } @@ -840,6 +848,9 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * const p_mgr) "%s tables configured on all switches\n", osm_routing_engine_type_str(p_osm->routing_engine_used)); + if (p_mgr->p_subn->opt.use_ucast_cache) + osm_ucast_cache_mark_valid(p_mgr->p_cache); + Exit: CL_PLOCK_RELEASE(p_mgr->p_lock); OSM_LOG_EXIT(p_mgr->p_log); -- 1.6.0.1.196.g01914 From sashak at voltaire.com Thu Oct 9 10:11:03 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 9 Oct 2008 19:11:03 +0200 Subject: [ofa-general] Re: [PATCH 0/6] opensm: Unicast Routing Cache In-Reply-To: <48E96928.8030200@dev.mellanox.co.il> References: <48E96928.8030200@dev.mellanox.co.il> Message-ID: <20081009171103.GF4912@sashak.voltaire.com> Hi Evgeny, On 03:26 Mon 06 Oct , Yevgeny Kliteynik wrote: > > The patches are: > - patch 1/6: move lft_buf from ucast_mgr to osm_switch > - patch 2/6: Add "-A" or "--ucast_cache" option to opensm > - patch 3/6: adding osm_ucast_cache.{c,h} files (this is > the cache implementation itself) > - patch 4/6: adding new cache files to makefile > - patch 5/6: integrating unicast cache into the discovery > and ucast manager > - patch 6/6: man entry for cached routing Pathes 1,2,4,6 look fine for me. Will comment on others. Sasha From hal.rosenstock at gmail.com Thu Oct 9 10:13:21 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 9 Oct 2008 13:13:21 -0400 Subject: ***SPAM*** Re: [ofa-general] [infiniband-diags] add --loop_ports option to perfquery In-Reply-To: <1223569534.1197.170.camel@cardanus.llnl.gov> References: <1223420045.1197.117.camel@cardanus.llnl.gov> <1223461589.8503.19.camel@whatsup> <1223505919.1197.140.camel@cardanus.llnl.gov> <1223569534.1197.170.camel@cardanus.llnl.gov> Message-ID: Hi Al, On Thu, Oct 9, 2008 at 12:25 PM, Al Chu wrote: > Hey Hal, > > On Thu, 2008-10-09 at 08:37 -0400, Hal Rosenstock wrote: >> Hi again Al, >> >> On Wed, Oct 8, 2008 at 6:45 PM, Al Chu wrote: >> > Hey Hal, >> > >> > On Wed, 2008-10-08 at 09:44 -0400, Hal Rosenstock wrote: >> >> Hi Al, >> >> >> >> On Wed, Oct 8, 2008 at 6:26 AM, Al Chu wrote: >> >> > Hey Hal, >> >> > >> >> > On Wed, 2008-10-08 at 07:03 -0400, Hal Rosenstock wrote: >> >> >> Al, >> >> >> >> >> >> On Tue, Oct 7, 2008 at 6:54 PM, Al Chu wrote: >> >> >> > Hey Sasha, >> >> >> > >> >> >> > We have a switch here that does not report the AllPortSelect flag as a >> >> >> > capability. It's pretty annoying typing each port on the switch or >> >> >> > always having to script around this one oddball switch we have. So I >> >> >> > added an option --loop_ports for perfquery. If you want to do something >> >> >> > to all the ports on the CA/Switch, but AllPortSelect isn't available, it >> >> >> > loops through all the available ports instead. >> >> >> >> >> >> Why not add simulated AllPortSelect for multiple ports rather than add >> >> >> another perquery option for this ? >> >> > >> >> > I did try that, and it did seem to work for the switches we had. But >> >> > when I read the IB spec, it said something to the affect that if a >> >> > system doesn't support AllPortSelect, setting the PortSelect field to >> >> > 0xFF was undefined behavior. >> >> >> >> I was suggesting that the emulation support (when AllPortSelect is not >> >> supported) be enhanced for multiple ports and work on both CAs and all >> >> switches. The one difference is one response for AllPortSelect >> >> (whether emulated or not) v. many responses for port loop. >> > >> > Oh. I thought you were referring the the workaround "simulation" that >> > was in the original code. But you're referring to aggregating the >> > data/output make it look like AllPortSelect was supported. I'll put >> > this on the TODO. >> >> So it seems that the reason for adding an additional option for this >> is that the lack of this support ? Are there any other uses ? > > I made the --loop_ports option b/c I just didn't want to change the > default behavior perfquery. But we could easily make it an automatic if > the AllPortsSelect flag isn't supported. > > Thinking about it, there is a bit of a subtlety in the command line > options and expectations. > > If a user inputs a port of '255', to me this means that the user wants > to do an AllPortSelect and we should error out if AllPortSelect isn't > supported. > > If a user inputs the '-a' option, it suggests that they want perfquery > to operate on every port, suggesting that we could automatically loop if > they don't input --loop_ports. Yes, there is some redundancy now with port 255 and -a option (both do the same thing) and they could be made subtly different as you indicate. In the case of query rather than reset, we are still left with the question of whether to return 1 aggregated response or 1 response/port. Do we need to support both ? AllPortsSelect has this as aggregated. I'm also not sure that the loop ports option is needed. IMO all that was needed was to loop on the ports when all ports is not supported by the PMA and aggregate the counters and then nothing else needs to change. -- Hal > Al > >> -- Hal >> >> >> >> >> >> > There was already a workaround in the tool for a CA that did not support >> >> >> > the AllPortSelect flag. I get the feeling the workaround may have been >> >> >> > for a specific hardware, so I kept the workaround in there. >> >> >> >> >> >> > Al >> >> >> > >> >> >> > -- >> >> >> > Albert Chu >> >> >> > chu11 at llnl.gov >> >> >> > Computer Scientist >> >> >> > High Performance Systems Division >> >> >> > Lawrence Livermore National Laboratory >> >> >> > >> >> >> > _______________________________________________ >> >> >> > general mailing list >> >> >> > general at lists.openfabrics.org >> >> >> > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> >> > >> >> >> > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general >> >> >> > >> >> >> >> >> >> There are also 2 for loops which are not correct for some switches: >> >> >> for (i = 1; i <= num_ports; i++) >> >> > >> >> > I guess I've never seen a switch that doesn't go from 1 to num_ports. >> >> > Is there something else I need to handle? >> >> >> >> Yes, per the spec, enhanced SP0 supports PortCounters. All your >> >> switches likely support AllPortSelect so it's not an issue there. >> > >> > Ok I see now. Wasn't aware of it. I'll get a patch together. >> > >> > Thanks, >> > Al >> > >> >> -- Hal >> >> >> >> > Al >> >> > >> >> >> -- Hal >> >> >> >> >> > -- >> >> > Albert Chu >> >> > chu11 at llnl.gov >> >> > Computer Scientist >> >> > High Performance Systems Division >> >> > Lawrence Livermore National Laboratory >> >> > >> >> > >> >> >> > -- >> > Albert Chu >> > chu11 at llnl.gov >> > Computer Scientist >> > High Performance Systems Division >> > Lawrence Livermore National Laboratory >> > >> > >> > -- > Albert Chu > chu11 at llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > > From hal.rosenstock at gmail.com Thu Oct 9 10:13:48 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 9 Oct 2008 13:13:48 -0400 Subject: [ofa-general] [infiniband-diags] specify -l(loop_ports) in ibclearerrors and ibclearcounters In-Reply-To: <20081009095710.63a2a2a0.weiny2@llnl.gov> References: <1223512850.1197.154.camel@cardanus.llnl.gov> <1223569059.1197.163.camel@cardanus.llnl.gov> <20081009095710.63a2a2a0.weiny2@llnl.gov> Message-ID: On Thu, Oct 9, 2008 at 12:57 PM, Ira Weiny wrote: > On Thu, 09 Oct 2008 09:17:39 -0700 > Al Chu wrote: > >> Hey Hal, >> >> On Wed, 2008-10-08 at 21:54 -0400, Hal Rosenstock wrote: >> > Hi Al, >> > >> > On Wed, Oct 8, 2008 at 8:40 PM, Al Chu wrote: >> > > >> > > Note that these scripts specify -R (reset only) for perfquery, so there >> > > aren't any issues with multiple perfquery outputs that may need to be >> > > parsed multiple times/differently. >> > >> > But this takes many more resets to work, right ? Why not have loop >> > ports (for reset) only do this is all ports is not supported ? >> >> Do you mean remove the --loop_ports option and loop the ports by default >> even if AllPortSelect isn't supported? >> >> I suppose that'd be fine. The reason I added an option was I just >> didn't want to change default behavior. But if everyone is happy just >> making it happen by default, I'm ok with that as well. >> > > I vote yes. Have the loop ports happen "under the covers" if the user > specifies '-a' this includes summing the values just like AllPortSelect would > in the switch. If the user specifies "0xff" I think that would be a good way > to request specific functionality and (as per Al's other email) return an error > if AllPortSelect is not supported. I think this neglects whether a single response to the user is intended, This would cause one response per port on a read. -- Hal > Ira > > From chu11 at llnl.gov Thu Oct 9 10:20:33 2008 From: chu11 at llnl.gov (Al Chu) Date: Thu, 09 Oct 2008 10:20:33 -0700 Subject: [ofa-general] [infiniband-diags] add --loop_ports option to perfquery In-Reply-To: References: <1223420045.1197.117.camel@cardanus.llnl.gov> <1223461589.8503.19.camel@whatsup> <1223505919.1197.140.camel@cardanus.llnl.gov> <1223569534.1197.170.camel@cardanus.llnl.gov> Message-ID: <1223572833.1197.181.camel@cardanus.llnl.gov> On Thu, 2008-10-09 at 13:13 -0400, Hal Rosenstock wrote: > Hi Al, > > On Thu, Oct 9, 2008 at 12:25 PM, Al Chu wrote: > > Hey Hal, > > > > On Thu, 2008-10-09 at 08:37 -0400, Hal Rosenstock wrote: > >> Hi again Al, > >> > >> On Wed, Oct 8, 2008 at 6:45 PM, Al Chu wrote: > >> > Hey Hal, > >> > > >> > On Wed, 2008-10-08 at 09:44 -0400, Hal Rosenstock wrote: > >> >> Hi Al, > >> >> > >> >> On Wed, Oct 8, 2008 at 6:26 AM, Al Chu wrote: > >> >> > Hey Hal, > >> >> > > >> >> > On Wed, 2008-10-08 at 07:03 -0400, Hal Rosenstock wrote: > >> >> >> Al, > >> >> >> > >> >> >> On Tue, Oct 7, 2008 at 6:54 PM, Al Chu wrote: > >> >> >> > Hey Sasha, > >> >> >> > > >> >> >> > We have a switch here that does not report the AllPortSelect flag as a > >> >> >> > capability. It's pretty annoying typing each port on the switch or > >> >> >> > always having to script around this one oddball switch we have. So I > >> >> >> > added an option --loop_ports for perfquery. If you want to do something > >> >> >> > to all the ports on the CA/Switch, but AllPortSelect isn't available, it > >> >> >> > loops through all the available ports instead. > >> >> >> > >> >> >> Why not add simulated AllPortSelect for multiple ports rather than add > >> >> >> another perquery option for this ? > >> >> > > >> >> > I did try that, and it did seem to work for the switches we had. But > >> >> > when I read the IB spec, it said something to the affect that if a > >> >> > system doesn't support AllPortSelect, setting the PortSelect field to > >> >> > 0xFF was undefined behavior. > >> >> > >> >> I was suggesting that the emulation support (when AllPortSelect is not > >> >> supported) be enhanced for multiple ports and work on both CAs and all > >> >> switches. The one difference is one response for AllPortSelect > >> >> (whether emulated or not) v. many responses for port loop. > >> > > >> > Oh. I thought you were referring the the workaround "simulation" that > >> > was in the original code. But you're referring to aggregating the > >> > data/output make it look like AllPortSelect was supported. I'll put > >> > this on the TODO. > >> > >> So it seems that the reason for adding an additional option for this > >> is that the lack of this support ? Are there any other uses ? > > > > I made the --loop_ports option b/c I just didn't want to change the > > default behavior perfquery. But we could easily make it an automatic if > > the AllPortsSelect flag isn't supported. > > > > Thinking about it, there is a bit of a subtlety in the command line > > options and expectations. > > > > If a user inputs a port of '255', to me this means that the user wants > > to do an AllPortSelect and we should error out if AllPortSelect isn't > > supported. > > > > If a user inputs the '-a' option, it suggests that they want perfquery > > to operate on every port, suggesting that we could automatically loop if > > they don't input --loop_ports. > > Yes, there is some redundancy now with port 255 and -a option (both do > the same thing) and they could be made subtly different as you > indicate. > > In the case of query rather than reset, we are still left with the > question of whether to return 1 aggregated response or 1 > response/port. Do we need to support both ? AllPortsSelect has this as > aggregated. I agree that we should aggregate it. One of the reasons I made -- loop_ports an option was b/c I didn't want to change default behavior. If we're going to change the default behavior and remove the option, then I need to aggregate it. > I'm also not sure that the loop ports option is needed. > > IMO all that was needed was to loop on the ports when all ports is not > supported by the PMA and aggregate the counters and then nothing else > needs to change. Should we remove the previous workaround in the code? if (allports == 1) pc[1] = ALL_PORTS; /* fake PortSelect */ Presumably we wouldn't need this anymore. But I'm not sure if this was a specific need for some specific hardware or something. Al > -- Hal > > > Al > > > >> -- Hal > >> > >> >> > >> >> >> > There was already a workaround in the tool for a CA that did not support > >> >> >> > the AllPortSelect flag. I get the feeling the workaround may have been > >> >> >> > for a specific hardware, so I kept the workaround in there. > >> >> >> > >> >> >> > Al > >> >> >> > > >> >> >> > -- > >> >> >> > Albert Chu > >> >> >> > chu11 at llnl.gov > >> >> >> > Computer Scientist > >> >> >> > High Performance Systems Division > >> >> >> > Lawrence Livermore National Laboratory > >> >> >> > > >> >> >> > _______________________________________________ > >> >> >> > general mailing list > >> >> >> > general at lists.openfabrics.org > >> >> >> > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> >> >> > > >> >> >> > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > >> >> >> > > >> >> >> > >> >> >> There are also 2 for loops which are not correct for some switches: > >> >> >> for (i = 1; i <= num_ports; i++) > >> >> > > >> >> > I guess I've never seen a switch that doesn't go from 1 to num_ports. > >> >> > Is there something else I need to handle? > >> >> > >> >> Yes, per the spec, enhanced SP0 supports PortCounters. All your > >> >> switches likely support AllPortSelect so it's not an issue there. > >> > > >> > Ok I see now. Wasn't aware of it. I'll get a patch together. > >> > > >> > Thanks, > >> > Al > >> > > >> >> -- Hal > >> >> > >> >> > Al > >> >> > > >> >> >> -- Hal > >> >> >> > >> >> > -- > >> >> > Albert Chu > >> >> > chu11 at llnl.gov > >> >> > Computer Scientist > >> >> > High Performance Systems Division > >> >> > Lawrence Livermore National Laboratory > >> >> > > >> >> > > >> >> > >> > -- > >> > Albert Chu > >> > chu11 at llnl.gov > >> > Computer Scientist > >> > High Performance Systems Division > >> > Lawrence Livermore National Laboratory > >> > > >> > > >> > > -- > > Albert Chu > > chu11 at llnl.gov > > Computer Scientist > > High Performance Systems Division > > Lawrence Livermore National Laboratory > > > > > -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From hal.rosenstock at gmail.com Thu Oct 9 10:23:34 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 9 Oct 2008 13:23:34 -0400 Subject: [ofa-general] [infiniband-diags] add --loop_ports option to perfquery In-Reply-To: <1223572833.1197.181.camel@cardanus.llnl.gov> References: <1223420045.1197.117.camel@cardanus.llnl.gov> <1223461589.8503.19.camel@whatsup> <1223505919.1197.140.camel@cardanus.llnl.gov> <1223569534.1197.170.camel@cardanus.llnl.gov> <1223572833.1197.181.camel@cardanus.llnl.gov> Message-ID: On Thu, Oct 9, 2008 at 1:20 PM, Al Chu wrote: > On Thu, 2008-10-09 at 13:13 -0400, Hal Rosenstock wrote: >> Hi Al, >> >> On Thu, Oct 9, 2008 at 12:25 PM, Al Chu wrote: >> > Hey Hal, >> > >> > On Thu, 2008-10-09 at 08:37 -0400, Hal Rosenstock wrote: >> >> Hi again Al, >> >> >> >> On Wed, Oct 8, 2008 at 6:45 PM, Al Chu wrote: >> >> > Hey Hal, >> >> > >> >> > On Wed, 2008-10-08 at 09:44 -0400, Hal Rosenstock wrote: >> >> >> Hi Al, >> >> >> >> >> >> On Wed, Oct 8, 2008 at 6:26 AM, Al Chu wrote: >> >> >> > Hey Hal, >> >> >> > >> >> >> > On Wed, 2008-10-08 at 07:03 -0400, Hal Rosenstock wrote: >> >> >> >> Al, >> >> >> >> >> >> >> >> On Tue, Oct 7, 2008 at 6:54 PM, Al Chu wrote: >> >> >> >> > Hey Sasha, >> >> >> >> > >> >> >> >> > We have a switch here that does not report the AllPortSelect flag as a >> >> >> >> > capability. It's pretty annoying typing each port on the switch or >> >> >> >> > always having to script around this one oddball switch we have. So I >> >> >> >> > added an option --loop_ports for perfquery. If you want to do something >> >> >> >> > to all the ports on the CA/Switch, but AllPortSelect isn't available, it >> >> >> >> > loops through all the available ports instead. >> >> >> >> >> >> >> >> Why not add simulated AllPortSelect for multiple ports rather than add >> >> >> >> another perquery option for this ? >> >> >> > >> >> >> > I did try that, and it did seem to work for the switches we had. But >> >> >> > when I read the IB spec, it said something to the affect that if a >> >> >> > system doesn't support AllPortSelect, setting the PortSelect field to >> >> >> > 0xFF was undefined behavior. >> >> >> >> >> >> I was suggesting that the emulation support (when AllPortSelect is not >> >> >> supported) be enhanced for multiple ports and work on both CAs and all >> >> >> switches. The one difference is one response for AllPortSelect >> >> >> (whether emulated or not) v. many responses for port loop. >> >> > >> >> > Oh. I thought you were referring the the workaround "simulation" that >> >> > was in the original code. But you're referring to aggregating the >> >> > data/output make it look like AllPortSelect was supported. I'll put >> >> > this on the TODO. >> >> >> >> So it seems that the reason for adding an additional option for this >> >> is that the lack of this support ? Are there any other uses ? >> > >> > I made the --loop_ports option b/c I just didn't want to change the >> > default behavior perfquery. But we could easily make it an automatic if >> > the AllPortsSelect flag isn't supported. >> > >> > Thinking about it, there is a bit of a subtlety in the command line >> > options and expectations. >> > >> > If a user inputs a port of '255', to me this means that the user wants >> > to do an AllPortSelect and we should error out if AllPortSelect isn't >> > supported. >> > >> > If a user inputs the '-a' option, it suggests that they want perfquery >> > to operate on every port, suggesting that we could automatically loop if >> > they don't input --loop_ports. >> >> Yes, there is some redundancy now with port 255 and -a option (both do >> the same thing) and they could be made subtly different as you >> indicate. >> >> In the case of query rather than reset, we are still left with the >> question of whether to return 1 aggregated response or 1 >> response/port. Do we need to support both ? AllPortsSelect has this as >> aggregated. > > I agree that we should aggregate it. One of the reasons I made -- > loop_ports an option was b/c I didn't want to change default behavior. > If we're going to change the default behavior and remove the option, > then I need to aggregate it. > >> I'm also not sure that the loop ports option is needed. >> >> IMO all that was needed was to loop on the ports when all ports is not >> supported by the PMA and aggregate the counters and then nothing else >> needs to change. > > Should we remove the previous workaround in the code? > > if (allports == 1) > pc[1] = ALL_PORTS; /* fake PortSelect */ > > Presumably we wouldn't need this anymore. But I'm not sure if this was > a specific need for some specific hardware or something. That was just an easy way to say all ports for those who didn't know 255 was a magic number. That's not for any specific hardware. The original assumption was that all PMAs supported AllPortSelect option but that turned out not to be the case so I started adding the emulation for a single port (which was the one case I was aware of). Clearly, there are other cases. -- Hal > > Al > >> -- Hal >> >> > Al >> > >> >> -- Hal >> >> >> >> >> >> >> >> >> > There was already a workaround in the tool for a CA that did not support >> >> >> >> > the AllPortSelect flag. I get the feeling the workaround may have been >> >> >> >> > for a specific hardware, so I kept the workaround in there. >> >> >> >> >> >> >> >> > Al >> >> >> >> > >> >> >> >> > -- >> >> >> >> > Albert Chu >> >> >> >> > chu11 at llnl.gov >> >> >> >> > Computer Scientist >> >> >> >> > High Performance Systems Division >> >> >> >> > Lawrence Livermore National Laboratory >> >> >> >> > >> >> >> >> > _______________________________________________ >> >> >> >> > general mailing list >> >> >> >> > general at lists.openfabrics.org >> >> >> >> > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> >> >> > >> >> >> >> > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general >> >> >> >> > >> >> >> >> >> >> >> >> There are also 2 for loops which are not correct for some switches: >> >> >> >> for (i = 1; i <= num_ports; i++) >> >> >> > >> >> >> > I guess I've never seen a switch that doesn't go from 1 to num_ports. >> >> >> > Is there something else I need to handle? >> >> >> >> >> >> Yes, per the spec, enhanced SP0 supports PortCounters. All your >> >> >> switches likely support AllPortSelect so it's not an issue there. >> >> > >> >> > Ok I see now. Wasn't aware of it. I'll get a patch together. >> >> > >> >> > Thanks, >> >> > Al >> >> > >> >> >> -- Hal >> >> >> >> >> >> > Al >> >> >> > >> >> >> >> -- Hal >> >> >> >> >> >> >> > -- >> >> >> > Albert Chu >> >> >> > chu11 at llnl.gov >> >> >> > Computer Scientist >> >> >> > High Performance Systems Division >> >> >> > Lawrence Livermore National Laboratory >> >> >> > >> >> >> > >> >> >> >> >> > -- >> >> > Albert Chu >> >> > chu11 at llnl.gov >> >> > Computer Scientist >> >> > High Performance Systems Division >> >> > Lawrence Livermore National Laboratory >> >> > >> >> > >> >> >> > -- >> > Albert Chu >> > chu11 at llnl.gov >> > Computer Scientist >> > High Performance Systems Division >> > Lawrence Livermore National Laboratory >> > >> > >> > -- > Albert Chu > chu11 at llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > > From sashak at voltaire.com Thu Oct 9 11:32:23 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 9 Oct 2008 20:32:23 +0200 Subject: [ofa-general] Re: [PATCH 3/6] opensm/Unicast Routing Cache: add osm_ucast_cache.{c,h} files In-Reply-To: <48E969A6.1000607@dev.mellanox.co.il> References: <48E969A6.1000607@dev.mellanox.co.il> Message-ID: <20081009183223.GG4912@sashak.voltaire.com> Hi Yevgeny, Comments are below. Also I made couple of incremental changes (which were obvious IMHO), will post later... On 03:28 Mon 06 Oct , Yevgeny Kliteynik wrote: > Implementation of the osm unicast routing cache. Would be nice to have more detailed comment here and also for the "integration" patch. > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/include/opensm/osm_ucast_cache.h | 439 ++++++++++++ > opensm/opensm/osm_ucast_cache.c | 1176 +++++++++++++++++++++++++++++++ > 2 files changed, 1615 insertions(+), 0 deletions(-) > create mode 100644 opensm/include/opensm/osm_ucast_cache.h > create mode 100644 opensm/opensm/osm_ucast_cache.c > > diff --git a/opensm/include/opensm/osm_ucast_cache.h b/opensm/include/opensm/osm_ucast_cache.h > new file mode 100644 > index 0000000..2dc1c4e > --- /dev/null > +++ b/opensm/include/opensm/osm_ucast_cache.h [snip...] > +/****s* OpenSM: Unicast Cache/osm_ucast_cache_t > +* NAME > +* osm_ucast_cache_t > +* > +* DESCRIPTION > +* Unicast Cache structure. > +* > +* This object should be treated as opaque and should > +* be manipulated only through the provided functions. > +* > +* SYNOPSIS > +*/ > +typedef struct _osm_ucast_cache { There are no _osm_* structure names in OpenSM. Please keep things consistent. (Will post the patch) > + cl_qmap_t sw_tbl; > + boolean_t valid; > + struct osm_ucast_mgr * p_ucast_mgr; > +} osm_ucast_cache_t; The object itself is pretty small, actually there are only sw_tbl map and valid flag. Why to not do it as part of struct osm_ucast_mgr and to save a lot of code like p_cache->p_ucast_mgr->..., construct, etc...? [snip...] > diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c > new file mode 100644 > index 0000000..2c2154a > --- /dev/null > +++ b/opensm/opensm/osm_ucast_cache.c [snip...] > +static cache_switch_t * > +__cache_sw_new(uint16_t lid_ho) > +{ > + cache_switch_t * p_cache_sw = > + (cache_switch_t *)malloc(sizeof(cache_switch_t)); > + if (!p_cache_sw) > + return NULL; > + > + memset(p_cache_sw, 0, sizeof(cache_switch_t)); > + > + p_cache_sw->ports = (cache_port_t *)malloc(sizeof(cache_port_t)); > + if (!p_cache_sw->ports) { > + free(p_cache_sw); > + return NULL; > + } Is it really helpful to alloc only one port at init time and realloc later? Normally cache will not be huge, OTOH it saves some flow like (port_num >= p_cache_sw->num_ports). Maybe I'm missing more complicated cases? > + > + /* port[0] fields represent this switch details - lid and type */ > + p_cache_sw->ports[0].remote_lid_ho = lid_ho; > + p_cache_sw->ports[0].is_leaf = FALSE; > + > + return p_cache_sw; > +} [snip...] > +/********************************************************************** > + **********************************************************************/ > + > +static void > +__cache_add_port(osm_ucast_cache_t * p_cache, > + uint16_t lid_ho, > + uint8_t port_num, > + uint16_t remote_lid_ho, > + boolean_t is_ca) > +{ > + cache_switch_t * p_cache_sw; > + > + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); > + > + if (!lid_ho || !remote_lid_ho || !port_num) > + goto Exit; > + > + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, > + "Caching switch port: lid %u [port %u] -> lid %u (%s)\n", > + lid_ho, port_num, remote_lid_ho, > + (is_ca)? "CA/RTR" : "SW"); > + > + p_cache_sw = __cache_get_or_add_sw(p_cache, lid_ho); > + if (!p_cache_sw) { > + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_ERROR, > + "ERR AD01: Out of memory - cache is invalid\n"); > + osm_ucast_cache_invalidate(p_cache); > + goto Exit; > + } > + > + if (port_num >= p_cache_sw->num_ports) { > + cache_port_t * ports = (cache_port_t *) > + malloc(sizeof(cache_port_t)*(port_num+1)); As Hal already noted, no malloc() result check. Also here and in other places - malloc() return 'void *' so why casting? IMO it is confused. > + memset(ports, 0, sizeof(cache_port_t)*(port_num+1)); [snip...] > +static void > +__cache_check_link_change(osm_ucast_cache_t * p_cache, > + osm_physp_t * p_physp_1, > + osm_physp_t * p_physp_2) > +{ > + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); > + CL_ASSERT(p_physp_1 && p_physp_2); > + > + if (!p_cache->valid) > + goto Exit; > + > + if (!p_physp_1->p_remote_physp && !p_physp_2->p_remote_physp) > + /* both ports were down - new link */ > + goto Exit; > + > + /* unicast cache cannot tolerate any link location change */ > + > + if ((p_physp_1->p_remote_physp && > + p_physp_1->p_remote_physp->p_remote_physp) || > + (p_physp_2->p_remote_physp && > + p_physp_2->p_remote_physp->p_remote_physp)) { Will this handle port moving during discovery? When duplicated guid detection is passed ports may have old "remotes". > + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, > + "Link location change discovered - cache is invalid\n"); Here and in other places in this file: OSM_LOG_INFO is pretty overused (it is likely this file has more OSM_LOG_INFO than all other OpenSM parts together), I think all OSM_LOG_VERBOSE should actually be replaced by OSM_LOG_DEBUG and all OSM_LOG_INFO should become OSM_LOG_VERBOSE. > + osm_ucast_cache_invalidate(p_cache); > + goto Exit; > + } > +Exit: > + OSM_LOG_EXIT(p_cache->p_ucast_mgr->p_log); > +} [snip...] > +static void > +__cache_restore_ucast_info(osm_ucast_cache_t * p_cache, > + cache_switch_t * p_cache_sw, > + osm_switch_t * p_sw) > +{ > + if (!p_cache->valid) > + return; > + > + /* when seting unicast info, the cached port > + should have all the required info */ > + CL_ASSERT(p_cache_sw->max_lid_ho && p_cache_sw->lft && > + p_cache_sw->num_hops && p_cache_sw->hops); > + > + p_sw->max_lid_ho = p_cache_sw->max_lid_ho; > + > + if (p_sw->lft_buf) > + free(p_sw->lft_buf); > + p_sw->lft_buf = p_cache_sw->lft; > + p_cache_sw->lft = NULL; > + > + p_sw->num_hops = p_cache_sw->num_hops; > + p_cache_sw->num_hops = 0; > + if (p_sw->hops) > + free(p_sw->hops); > + p_sw->hops = p_cache_sw->hops; > + p_cache_sw->hops = NULL; This is nice :). sw->hops is array of pointers which could be allocated by routing engine so in generic case we will need to free all sub-buffers first. As far as I can see this function will be used for freshly discovered switches only, if so it looks correct for me. Just to be sure... [snip...] > +void > +osm_ucast_cache_validate(osm_ucast_cache_t * p_cache) > +{ > + cache_switch_t * p_cache_sw; > + cache_switch_t * p_remote_cache_sw; > + unsigned port_num; > + unsigned max_ports; > + uint8_t remote_node_type; > + uint16_t lid_ho; > + uint16_t remote_lid_ho; > + osm_switch_t * p_sw; > + osm_switch_t * p_remote_sw; > + osm_node_t * p_node; > + osm_physp_t * p_physp; > + osm_physp_t * p_remote_physp; > + osm_port_t * p_remote_port; > + cl_qmap_t * p_node_guid_tbl; > + > + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); > + if (!p_cache->valid) > + goto Exit; > + > + /* > + * Scan all the physical switch ports in the subnet. > + * If the port need_update flag is on, check whether > + * it's just some node/port reset or a cached topology > + * change. Otherwise the cache is invalid. > + */ > + p_node_guid_tbl = &p_cache->p_ucast_mgr->p_subn->node_guid_tbl; Then it should be: p_sw_tbl = &p_cache->p_ucast_mgr->p_subn->sw_guid_tbl... Will send the patch... > + for (p_node = (osm_node_t *) cl_qmap_head(p_node_guid_tbl); > + p_node != (osm_node_t *) cl_qmap_end(p_node_guid_tbl); > + p_node = (osm_node_t *) cl_qmap_next(&p_node->map_item)) { > + > + if (osm_node_get_type(p_node) != IB_NODE_TYPE_SWITCH) > + continue; > + > + lid_ho = cl_ntoh16(osm_node_get_base_lid(p_node,0)); > + p_cache_sw = __cache_get_sw(p_cache, lid_ho); > + > + p_sw = p_node->sw; > + max_ports = osm_node_get_num_physp(p_node); > + > + /* skip port 0 */ > + for (port_num = 1; port_num < max_ports; port_num++) { > + > + p_physp = osm_node_get_physp_ptr(p_node, port_num); > + > + if (!p_physp || !p_physp->p_remote_physp || > + !osm_physp_link_exists(p_physp, p_physp->p_remote_physp)) > + /* no valid link */ > + continue; > + > + /* > + * While scanning all the physical ports in the subnet, > + * mark corresponding leaf switches in the cache. > + */ > + if (p_cache_sw && > + !p_cache_sw->dropped && > + !__cache_sw_is_leaf(p_cache_sw) && > + p_physp->p_remote_physp->p_node && > + osm_node_get_type( > + p_physp->p_remote_physp->p_node) != > + IB_NODE_TYPE_SWITCH) > + __cache_sw_set_leaf(p_cache_sw); > + > + if (!p_physp->need_update) > + continue; > + > + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, > + "Checking switch lid %u, port %u\n", > + lid_ho, port_num); > + > + p_remote_physp = osm_physp_get_remote(p_physp); > + remote_node_type = osm_node_get_type(p_remote_physp->p_node); > + > + if (remote_node_type == IB_NODE_TYPE_SWITCH) > + remote_lid_ho = cl_ntoh16(osm_node_get_base_lid( > + p_remote_physp->p_node, 0)); > + else > + remote_lid_ho = cl_ntoh16(osm_node_get_base_lid( > + p_remote_physp->p_node, > + osm_physp_get_port_num(p_remote_physp))); > + > + if (!p_cache_sw || > + port_num >= p_cache_sw->num_ports || > + !p_cache_sw->ports[port_num].remote_lid_ho) { > + /* > + * There is some uncached change on the port. > + * In general, the reasons might be as follows: > + * - switch reset > + * - port reset (or port down/up) > + * - quick connection location change > + * - new link (or new switch) > + * > + * First two reasons allow cache usage, while > + * the last two reasons should invalidate cache. > + * > + * In case of quick connection location change, > + * cache would have been invalidated by > + * osm_ucast_cache_check_new_link() function. > + * > + * In case of new link between two known nodes, > + * cache also would have been invalidated by > + * osm_ucast_cache_check_new_link() function. > + * > + * Another reason is cached link between two > + * known switches went back. In this case the > + * osm_ucast_cache_check_new_link() function would > + * clear both sides of the link from the cache > + * during the discovery process, so effectively > + * this would be equivalent to port reset. > + * > + * So three possible reasons remain: > + * - switch reset > + * - port reset (or port down/up) > + * - link of a new switch > + * > + * To validate cache, we need to check only the > + * third reason - link of a new node/switch: > + * - If this is the local switch that is new, > + * then it should have (p_sw->need_update == 2). > + * - If the remote node is switch and it's new, > + * then it also should have > + * (p_sw->need_update == 2). > + * - If the remote node is CA/RTR and it's new, > + * then its port should have is_new flag on. > + */ > + if (p_sw->need_update == 2) { > + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, > + "New switch found (lid %u) - " > + "cache is invalid\n", > + lid_ho); > + osm_ucast_cache_invalidate(p_cache); > + goto Exit; > + } > + > + if (remote_node_type == IB_NODE_TYPE_SWITCH) { > + > + p_remote_sw = p_remote_physp->p_node->sw; > + if (p_remote_sw->need_update == 2) { > + /* this could also be case of > + switch coming back with an > + additional link that it > + didn't have before */ > + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, > + "New switch/link found (lid %u) - " > + "cache is invalid\n", > + remote_lid_ho); > + osm_ucast_cache_invalidate(p_cache); > + goto Exit; > + } Maybe not related to cache directly, but anyway related to "fast" switch reset. When it happened we need to resend whole LFTs (recalculated or from cache) to this switch. Is it handled? Was it handled? [snip...] > +void > +osm_ucast_cache_check_new_link(osm_ucast_cache_t * p_cache, > + osm_node_t * p_node_1, > + uint8_t port_num_1, > + osm_node_t * p_node_2, > + uint8_t port_num_2) > +{ > + uint16_t lid_ho_1; > + uint16_t lid_ho_2; > + > + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); > + > + if (!p_cache->valid) > + goto Exit; > + > + __cache_check_link_change(p_cache, > + osm_node_get_physp_ptr(p_node_1, port_num_1), > + osm_node_get_physp_ptr(p_node_2, port_num_2)); > + > + if (!p_cache->valid) > + goto Exit; > + > + if (osm_node_get_type(p_node_1) != IB_NODE_TYPE_SWITCH && > + osm_node_get_type(p_node_2) != IB_NODE_TYPE_SWITCH) { > + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, > + "Found CA/RTR-2-CA/RTR link - cache is invalid\n"); > + osm_ucast_cache_invalidate(p_cache); > + goto Exit; > + } Here and in other places, should we care about back-to-back connections? Maybe we need just to disable cache at all there is no switches in a fabric... [snip...] > +void > +osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, > + osm_node_t * p_node_1, > + uint8_t port_num_1, > + osm_node_t * p_node_2, > + uint8_t port_num_2) I looked at places where this function is used and think it would be simpler to use prototype like: osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, osm_physp_t *physp1, osm_physp_t *physp2) Will send the patch. > +{ > + uint16_t lid_ho_1; > + uint16_t lid_ho_2; > + > + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); > + > + if (!p_cache->valid) > + goto Exit; > + > + if (osm_node_get_type(p_node_1) != IB_NODE_TYPE_SWITCH && > + osm_node_get_type(p_node_2) != IB_NODE_TYPE_SWITCH) { > + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, > + "Dropping CA-2-CA link - cache invalid\n"); > + osm_ucast_cache_invalidate(p_cache); > + goto Exit; > + } back-to-back again... > + > + if (((osm_node_get_type(p_node_1) == IB_NODE_TYPE_SWITCH) && > + (!osm_node_get_physp_ptr(p_node_1, 0) || > + !osm_physp_is_valid(osm_node_get_physp_ptr(p_node_1, 0)))) || > + ((osm_node_get_type(p_node_2) == IB_NODE_TYPE_SWITCH) && > + (!osm_node_get_physp_ptr(p_node_2, 0) || > + !osm_physp_is_valid(osm_node_get_physp_ptr(p_node_2, 0))))) { > + /* we're caching a link when one of the nodes > + has already been dropped and cached */ osm_node_get_physp_ptr() already checks port validity and return NULL if it is not. Patch... [snip...] > +void > +osm_ucast_cache_add_node(osm_ucast_cache_t * p_cache, > + osm_node_t * p_node) > +{ > + uint16_t lid_ho; > + uint8_t max_ports; > + uint8_t port_num; > + osm_physp_t * p_physp; > + osm_node_t * p_remote_node; > + cache_switch_t * p_cache_sw; > + > + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); > + > + if (!p_cache->valid) > + goto Exit; > + > + if (osm_node_get_type(p_node) == IB_NODE_TYPE_SWITCH) { > + > + lid_ho = cl_ntoh16(osm_node_get_base_lid(p_node,0)); > + > + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, > + "Caching dropped switch lid %u\n", lid_ho); > + > + if (!p_node->sw) { > + /* something is wrong - forget about cache */ > + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_ERROR, > + "ERR AD02: no switch info for node lid %u -" > + " clearing cache\n", lid_ho); > + osm_ucast_cache_invalidate(p_cache); > + goto Exit; > + } > + > + /* unlink (add to cache) all the ports of this switch */ > + max_ports = osm_node_get_num_physp(p_node); > + for (port_num = 1; port_num < max_ports; port_num++) { > + > + p_physp = osm_node_get_physp_ptr(p_node, port_num); > + if (!p_physp || !p_physp->p_node || > + !p_physp->p_remote_physp || > + !p_physp->p_remote_physp->p_node) Can p_physp->p_node be NULL? > + continue; > + > + osm_ucast_cache_add_link(p_cache, p_node, port_num, > + p_physp->p_remote_physp->p_node, > + p_physp->p_remote_physp->port_num); > + } > + > + /* > + * All the ports have been dropped (cached). > + * If one of the ports was connected to CA/RTR, > + * then the cached switch would be marked as leaf. > + * If it isn't, then the dropped switch isn't a leaf, > + * and cache can't handle it. > + */ > + > + p_cache_sw = __cache_get_sw(p_cache, lid_ho); > + CL_ASSERT(p_cache_sw); > + > + if (!__cache_sw_is_leaf(p_cache_sw)) { > + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, > + "Dropped non-leaf switch (lid %u) - " > + "cache is invalid\n", lid_ho); > + osm_ucast_cache_invalidate(p_cache); > + goto Exit; > + } > + > + p_cache_sw->dropped = TRUE; > + > + if (!p_node->sw->num_hops || !p_node->sw->hops) { > + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, > + "No LID matrices for switch lid %u - " > + "cache is invalid\n", lid_ho); > + osm_ucast_cache_invalidate(p_cache); > + goto Exit; > + } > + > + /* lid matrices */ > + > + p_cache_sw->num_hops = p_node->sw->num_hops; > + p_node->sw->num_hops = 0; > + p_cache_sw->hops = p_node->sw->hops; > + p_node->sw->hops = NULL; > + > + /* linear forwarding table */ > + > + p_cache_sw->lft = p_node->sw->lft_buf; > + p_node->sw->lft_buf = NULL; > + p_cache_sw->max_lid_ho = p_node->sw->max_lid_ho; > + } > + else { > + /* dropping CA/RTR: add to cache all the ports of this switch */ > + max_ports = osm_node_get_num_physp(p_node); > + for (port_num = 0; port_num < max_ports; port_num++) { Any reason to start from port 0 and not 1? > + > + p_physp = osm_node_get_physp_ptr(p_node, port_num); > + if (!p_physp || !p_physp->p_node || > + !p_physp->p_remote_physp || > + !p_physp->p_remote_physp->p_node) > + continue; > + > + p_remote_node = p_physp->p_remote_physp->p_node; > + if (osm_node_get_type(p_remote_node) != > + IB_NODE_TYPE_SWITCH) { > + /* CA/RTR to CA/RTR connection */ > + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, > + "Dropping CA/RTR to CA/RTR connection - " > + "cache is invalid\n"); > + osm_ucast_cache_invalidate(p_cache); > + goto Exit; > + } back-to-back connection? Will send the patches. Sasha From sashak at voltaire.com Thu Oct 9 11:34:02 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 9 Oct 2008 20:34:02 +0200 Subject: [ofa-general] [PATCH] opensm/ucast_cache: remove redundant osm_physp_is_valid() check In-Reply-To: <48E969A6.1000607@dev.mellanox.co.il> References: <48E969A6.1000607@dev.mellanox.co.il> Message-ID: <20081009183402.GH4912@sashak.voltaire.com> osm_node_get_physp_ptr() return non-NULL values only for valid physical ports - remove redundant checks. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_ucast_cache.c | 10 ++++------ 1 files changed, 4 insertions(+), 6 deletions(-) diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c index 355716a..3c32d35 100644 --- a/opensm/opensm/osm_ucast_cache.c +++ b/opensm/opensm/osm_ucast_cache.c @@ -963,12 +963,10 @@ osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, goto Exit; } - if (((osm_node_get_type(p_node_1) == IB_NODE_TYPE_SWITCH) && - (!osm_node_get_physp_ptr(p_node_1, 0) || - !osm_physp_is_valid(osm_node_get_physp_ptr(p_node_1, 0)))) || - ((osm_node_get_type(p_node_2) == IB_NODE_TYPE_SWITCH) && - (!osm_node_get_physp_ptr(p_node_2, 0) || - !osm_physp_is_valid(osm_node_get_physp_ptr(p_node_2, 0))))) { + if ((osm_node_get_type(p_node_1) == IB_NODE_TYPE_SWITCH && + !osm_node_get_physp_ptr(p_node_1, 0)) || + (osm_node_get_type(p_node_2) == IB_NODE_TYPE_SWITCH && + !osm_node_get_physp_ptr(p_node_2, 0))) { /* we're caching a link when one of the nodes has already been dropped and cached */ OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, -- 1.6.0.1.196.g01914 From sashak at voltaire.com Thu Oct 9 11:35:08 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 9 Oct 2008 20:35:08 +0200 Subject: [ofa-general] [PATCH] opensm/ucast_cache: rename struct _osm_ucast_cache to osm_ucast_cache In-Reply-To: <48E969A6.1000607@dev.mellanox.co.il> References: <48E969A6.1000607@dev.mellanox.co.il> Message-ID: <20081009183508.GI4912@sashak.voltaire.com> To be consistent with other names rename struct _osm_ucast_cache to osm_ucast_cache. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_ucast_cache.h | 2 +- opensm/include/opensm/osm_ucast_mgr.h | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/opensm/include/opensm/osm_ucast_cache.h b/opensm/include/opensm/osm_ucast_cache.h index 7f01876..caffc83 100644 --- a/opensm/include/opensm/osm_ucast_cache.h +++ b/opensm/include/opensm/osm_ucast_cache.h @@ -90,7 +90,7 @@ struct osm_ucast_mgr; * * SYNOPSIS */ -typedef struct _osm_ucast_cache { +typedef struct osm_ucast_cache { cl_qmap_t sw_tbl; boolean_t valid; struct osm_ucast_mgr * p_ucast_mgr; diff --git a/opensm/include/opensm/osm_ucast_mgr.h b/opensm/include/opensm/osm_ucast_mgr.h index e4006bb..5575d20 100644 --- a/opensm/include/opensm/osm_ucast_mgr.h +++ b/opensm/include/opensm/osm_ucast_mgr.h @@ -78,7 +78,7 @@ BEGIN_C_DECLS * *********/ struct osm_sm; -struct _osm_ucast_cache; +struct osm_ucast_cache; /****s* OpenSM: Unicast Manager/osm_ucast_mgr_t * NAME * osm_ucast_mgr_t @@ -99,7 +99,7 @@ typedef struct osm_ucast_mgr { cl_qlist_t port_order_list; boolean_t is_dor; boolean_t some_hop_count_set; - struct _osm_ucast_cache *p_cache; + struct osm_ucast_cache *p_cache; } osm_ucast_mgr_t; /* * FIELDS -- 1.6.0.1.196.g01914 From sashak at voltaire.com Thu Oct 9 11:35:47 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 9 Oct 2008 20:35:47 +0200 Subject: [ofa-general] Re: [PATCH 3/6] opensm/Unicast Routing Cache: add osm_ucast_cache.{c,h} files In-Reply-To: <48E969A6.1000607@dev.mellanox.co.il> References: <48E969A6.1000607@dev.mellanox.co.il> Message-ID: <20081009183547.GJ4912@sashak.voltaire.com> >From 8f361ce2159c8a37a68a472abd495208b02a7279 Mon Sep 17 00:00:00 2001 From: Sasha Khapyorsky Date: Wed, 8 Oct 2008 15:42:40 +0200 Subject: [PATCH] opensm/ucast_cache: simplify osm_ucast_cache_add_link() prototype Simplify osm_ucast_cache_add_link() prototype - instead of pair of node, port number parameter list get just two pointers to physical ports. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_ucast_cache.h | 19 +++--------- opensm/opensm/osm_drop_mgr.c | 5 +-- opensm/opensm/osm_port_info_rcv.c | 5 +-- opensm/opensm/osm_ucast_cache.c | 49 +++++++++++++----------------- 4 files changed, 29 insertions(+), 49 deletions(-) diff --git a/opensm/include/opensm/osm_ucast_cache.h b/opensm/include/opensm/osm_ucast_cache.h index caffc83..3ecd2d8 100644 --- a/opensm/include/opensm/osm_ucast_cache.h +++ b/opensm/include/opensm/osm_ucast_cache.h @@ -282,26 +282,17 @@ osm_ucast_cache_check_new_link(osm_ucast_cache_t * p_cache, */ void osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, - osm_node_t * p_node_1, - uint8_t port_num_1, - osm_node_t * p_node_2, - uint8_t port_num_2); + osm_physp_t *physp1, osm_physp_t *physp2); /* * PARAMETERS * p_cache * [in] Pointer to the cache object. * -* p_node_1 -* [in] Pointer to the first node of the link. -* -* port_num_1 -* [in] Port number on the first node of the link. +* physp1 +* [in] Pointer to the first physical port of the link. * -* p_node_2 -* [in] Pointer to the second node of the link. -* -* port_num_2 -* [in] Port number on the second node of the link. +* physp2 +* [in] Pointer to the second physical port of the link. * * RETURN VALUE * This function does not return any value. diff --git a/opensm/opensm/osm_drop_mgr.c b/opensm/opensm/osm_drop_mgr.c index 5fc46a9..eceb9a6 100644 --- a/opensm/opensm/osm_drop_mgr.c +++ b/opensm/opensm/osm_drop_mgr.c @@ -137,10 +137,7 @@ static void drop_mgr_clean_physp(osm_sm_t * sm, IN osm_physp_t * p_physp) if (sm->p_subn->opt.use_ucast_cache) osm_ucast_cache_add_link(sm->ucast_mgr.p_cache, - p_physp->p_node, - p_physp->port_num, - p_remote_physp->p_node, - p_remote_physp->port_num); + p_physp, p_remote_physp); osm_physp_unlink(p_physp, p_remote_physp); diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c index d8d2021..8004fa8 100644 --- a/opensm/opensm/osm_port_info_rcv.c +++ b/opensm/opensm/osm_port_info_rcv.c @@ -247,9 +247,8 @@ __osm_pi_rcv_process_switch_port(IN osm_sm_t * sm, if (sm->p_subn->opt.use_ucast_cache) osm_ucast_cache_add_link(sm->ucast_mgr.p_cache, - p_node, port_num, - p_remote_node, - remote_port_num); + p_physp, + p_remote_physp); osm_node_unlink(p_node, (uint8_t) port_num, p_remote_node, diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c index 57dc0a0..e5264cd 100644 --- a/opensm/opensm/osm_ucast_cache.c +++ b/opensm/opensm/osm_ucast_cache.c @@ -929,15 +929,11 @@ Exit: /********************************************************************** **********************************************************************/ -void -osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, - osm_node_t * p_node_1, - uint8_t port_num_1, - osm_node_t * p_node_2, - uint8_t port_num_2) +void osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, + osm_physp_t *physp1, osm_physp_t *physp2) { - uint16_t lid_ho_1; - uint16_t lid_ho_2; + osm_node_t *p_node_1 = physp1->p_node, *p_node_2 = physp2->p_node; + uint16_t lid_ho_1, lid_ho_2; OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); @@ -961,7 +957,7 @@ osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, "Port %u <-> port %u: port0 on one of the nodes" "has already been dropped and cached\n", - port_num_1, port_num_2); + physp1->port_num, physp2->port_num); goto Exit; } @@ -969,12 +965,11 @@ osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, simplicity, make sure that it's the first node. */ if (osm_node_get_type(p_node_1) != IB_NODE_TYPE_SWITCH) { - osm_node_t * tmp_node = p_node_1; - uint8_t tmp_port_num = port_num_1; - p_node_1 = p_node_2; - port_num_1 = port_num_2; - p_node_2 = tmp_node; - port_num_2 = tmp_port_num; + osm_physp_t *tmp = physp1; + physp1 = physp2; + physp2 = tmp; + p_node_1 = physp1->p_node; + p_node_2 = physp2->p_node; } if (!p_node_1->sw) { @@ -983,7 +978,7 @@ osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, goto Exit; } - lid_ho_1 = cl_ntoh16(osm_node_get_base_lid(p_node_1,0)); + lid_ho_1 = cl_ntoh16(osm_node_get_base_lid(p_node_1, 0)); if (osm_node_get_type(p_node_2) == IB_NODE_TYPE_SWITCH) { @@ -993,20 +988,19 @@ osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, goto Exit; } - lid_ho_2 = cl_ntoh16(osm_node_get_base_lid(p_node_2,0)); + lid_ho_2 = cl_ntoh16(osm_node_get_base_lid(p_node_2, 0)); /* lost switch-2-switch link - cache both sides */ - __cache_add_port(p_cache, lid_ho_1, port_num_1, + __cache_add_port(p_cache, lid_ho_1, physp1->port_num, lid_ho_2, FALSE); - __cache_add_port(p_cache, lid_ho_2, port_num_2, + __cache_add_port(p_cache, lid_ho_2, physp2->port_num, lid_ho_1, FALSE); } else { - lid_ho_2 = cl_ntoh16( - osm_node_get_base_lid(p_node_2, port_num_2)); + lid_ho_2 = cl_ntoh16(osm_physp_get_base_lid(physp2)); /* lost link to CA/RTR - cache only switch side */ - __cache_add_port(p_cache, lid_ho_1, port_num_1, + __cache_add_port(p_cache, lid_ho_1, physp1->port_num, lid_ho_2, TRUE); } @@ -1059,9 +1053,8 @@ osm_ucast_cache_add_node(osm_ucast_cache_t * p_cache, !p_physp->p_remote_physp->p_node) continue; - osm_ucast_cache_add_link(p_cache, p_node, port_num, - p_physp->p_remote_physp->p_node, - p_physp->p_remote_physp->port_num); + osm_ucast_cache_add_link(p_cache, p_physp, + p_physp->p_remote_physp); } /* @@ -1128,9 +1121,9 @@ osm_ucast_cache_add_node(osm_ucast_cache_t * p_cache, goto Exit; } - osm_ucast_cache_add_link(p_cache, p_remote_node, - p_physp->p_remote_physp->port_num, - p_node, port_num); + osm_ucast_cache_add_link(p_cache, + p_physp->p_remote_physp, + p_physp); } } Exit: -- 1.6.0.1.196.g01914 From sashak at voltaire.com Thu Oct 9 11:36:35 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 9 Oct 2008 20:36:35 +0200 Subject: [ofa-general] [PATCH] opensm/ucast_cache: loop over switches table to scan switches In-Reply-To: <48E969A6.1000607@dev.mellanox.co.il> References: <48E969A6.1000607@dev.mellanox.co.il> Message-ID: <20081009183635.GK4912@sashak.voltaire.com> Cache validator scans all switches by looping over all fabric nodes table. This patch improves this slightly by using a fabric switches table instead. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_ucast_cache.c | 16 ++++++---------- 1 files changed, 6 insertions(+), 10 deletions(-) diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c index e5264cd..df8b7d2 100644 --- a/opensm/opensm/osm_ucast_cache.c +++ b/opensm/opensm/osm_ucast_cache.c @@ -525,7 +525,7 @@ static void ucast_cache_validate(osm_ucast_cache_t * p_cache) osm_physp_t * p_physp; osm_physp_t * p_remote_physp; osm_port_t * p_remote_port; - cl_qmap_t * p_node_guid_tbl; + cl_qmap_t * p_sw_tbl; OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); if (!p_cache->valid) @@ -537,18 +537,14 @@ static void ucast_cache_validate(osm_ucast_cache_t * p_cache) * it's just some node/port reset or a cached topology * change. Otherwise the cache is invalid. */ - p_node_guid_tbl = &p_cache->p_ucast_mgr->p_subn->node_guid_tbl; - for (p_node = (osm_node_t *) cl_qmap_head(p_node_guid_tbl); - p_node != (osm_node_t *) cl_qmap_end(p_node_guid_tbl); - p_node = (osm_node_t *) cl_qmap_next(&p_node->map_item)) { - - if (osm_node_get_type(p_node) != IB_NODE_TYPE_SWITCH) - continue; + p_sw_tbl = &p_cache->p_ucast_mgr->p_subn->sw_guid_tbl; + for (p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl); + p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl); + p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item)) { + p_node = p_sw->p_node; lid_ho = cl_ntoh16(osm_node_get_base_lid(p_node,0)); p_cache_sw = __cache_get_sw(p_cache, lid_ho); - - p_sw = p_node->sw; max_ports = osm_node_get_num_physp(p_node); /* skip port 0 */ -- 1.6.0.1.196.g01914 From sashak at voltaire.com Thu Oct 9 12:01:29 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 9 Oct 2008 21:01:29 +0200 Subject: [ofa-general] Re: [PATCH][MINOR] infiniband-diags/ibsysstat.c: Fix a couple of latent bugs In-Reply-To: <48E1399D.6070606@obsidianresearch.com> References: <48E1399D.6070606@obsidianresearch.com> Message-ID: <20081009190129.GO4912@sashak.voltaire.com> On 14:25 Mon 29 Sep , Hal Rosenstock wrote: > Sasha, > > This patch is based on a code inspection of ibsysstat.c due to the buffer > overflow observed with more than 2 CPUs. It fixes a couple of latent bugs > although won't improve that particular issue. > > -- Hal > infiniband-diags/ibsysstat.c: Fix latent bugs related to build_cpuinfo > (exceeding MAX_CPUS and not being able to open /proc/cpuinfo) > Also, issue warning when cpuinfo is truncated due to lack of room > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Thu Oct 9 12:04:38 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 9 Oct 2008 21:04:38 +0200 Subject: [ofa-general] Re: [PATCHv2][TRIVIAL] OpenSM: Display port number in decimal in log messages In-Reply-To: References: <48E67B71.8050508@obsidianresearch.com> <20081007235303.GG7563@sashak.voltaire.com> Message-ID: <20081009190438.GP4912@sashak.voltaire.com> Hi Hal, On 07:04 Wed 08 Oct , Hal Rosenstock wrote: > > v2 means a replacement patch not built on top of the previous one. Missed 'v2' :( . Anyway comment (as in v1 or modified) was needed. Sasha From sashak at voltaire.com Thu Oct 9 12:05:49 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 9 Oct 2008 21:05:49 +0200 Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer overflow In-Reply-To: References: <20081008012149.GK7563@sashak.voltaire.com> Message-ID: <20081009190549.GQ4912@sashak.voltaire.com> On 07:04 Wed 08 Oct , Hal Rosenstock wrote: > > Minor simplification as it seems like this could just be: > > if (++lanes_needed > p_lash->vl_min) > goto Error_Not_Enough_Lanes; Works for me. Thanks! Sasha From sashak at voltaire.com Thu Oct 9 12:39:07 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 9 Oct 2008 21:39:07 +0200 Subject: [ofa-general] [infiniband-diags] specify -l(loop_ports) in ibclearerrors and ibclearcounters In-Reply-To: <1223569901.1197.175.camel@cardanus.llnl.gov> References: <1223512850.1197.154.camel@cardanus.llnl.gov> <1223569901.1197.175.camel@cardanus.llnl.gov> Message-ID: <20081009193907.GR4912@sashak.voltaire.com> On 09:31 Thu 09 Oct , Al Chu wrote: > > I just re-read this sentence, now I get what you're asking. I made -- > loop_ports only loop if AllPortSelect isn't supported. So if the > switches already support AllPortSelect, the number of resets should be > the same. OTOH perfquery is diagnostic tool. Reset with AllPortSelect may not work as expected or an user may have another reason to loop over ports regardless to AllPortSelect support. Sasha From sashak at voltaire.com Thu Oct 9 12:42:17 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 9 Oct 2008 21:42:17 +0200 Subject: [ofa-general] Re: [infiniband-diags] support ehanced port 0 with --loop-ports in perfquery In-Reply-To: <1223512851.1197.155.camel@cardanus.llnl.gov> References: <1223512851.1197.155.camel@cardanus.llnl.gov> Message-ID: <20081009194217.GS4912@sashak.voltaire.com> On 17:40 Wed 08 Oct , Al Chu wrote: > Hey Sasha, > > Fixes the enhanced port 0 issue Hal referred to in the previous thread. > > Al > > -- > Albert Chu > chu11 at llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > From cd7726543c2e0f6569a02806875ec1935d213c01 Mon Sep 17 00:00:00 2001 > From: Albert Chu > Date: Wed, 8 Oct 2008 17:28:40 -0700 > Subject: [PATCH] support ehanced port 0 with --loop_ports > > > Signed-off-by: Albert Chu Applied. Thanks. Sasha From sashak at voltaire.com Thu Oct 9 12:44:07 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 9 Oct 2008 21:44:07 +0200 Subject: [ofa-general] Re: [infiniband-diags] [trivial] fix comments in perfquery In-Reply-To: <1223512852.1197.156.camel@cardanus.llnl.gov> References: <1223512852.1197.156.camel@cardanus.llnl.gov> Message-ID: <20081009194407.GT4912@sashak.voltaire.com> On 17:40 Wed 08 Oct , Al Chu wrote: > Hey Sasha, > > Dumb patch. I realized the comments weren't in the right place. > > Al > > -- > Albert Chu > chu11 at llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > From f09c9a379069c96b4281ff3144359c46b5c96f60 Mon Sep 17 00:00:00 2001 > From: Albert Chu > Date: Wed, 8 Oct 2008 17:28:54 -0700 > Subject: [PATCH] fix perfquery comment > > > Signed-off-by: Albert Chu Applied. Thanks. Sasha From chu11 at llnl.gov Thu Oct 9 13:32:14 2008 From: chu11 at llnl.gov (Al Chu) Date: Thu, 09 Oct 2008 13:32:14 -0700 Subject: [ofa-general] [infiniband-diags] specify -l(loop_ports) in ibclearerrors and ibclearcounters In-Reply-To: <20081009193907.GR4912@sashak.voltaire.com> References: <1223512850.1197.154.camel@cardanus.llnl.gov> <1223569901.1197.175.camel@cardanus.llnl.gov> <20081009193907.GR4912@sashak.voltaire.com> Message-ID: <1223584334.1197.186.camel@cardanus.llnl.gov> Hey Sasha, Hal, all, Now I'm thinking this: if user specifies port 255 - they are expecting AllPortSelect. So error out. if user specifies '-a' - Do whatever it takes to query/reset on all ports. If it means looping do that. Aggregate counters appropriately for a single output. if user specifies --loop-ports - loop through the ports one by one no matter what. This includes outputting port counters for each port. Sound like a good idea? Al On Thu, 2008-10-09 at 21:39 +0200, Sasha Khapyorsky wrote: > On 09:31 Thu 09 Oct , Al Chu wrote: > > > > I just re-read this sentence, now I get what you're asking. I made -- > > loop_ports only loop if AllPortSelect isn't supported. So if the > > switches already support AllPortSelect, the number of resets should be > > the same. > > OTOH perfquery is diagnostic tool. Reset with AllPortSelect may not work > as expected or an user may have another reason to loop over ports > regardless to AllPortSelect support. > > Sasha -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From sashak at voltaire.com Thu Oct 9 13:47:18 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 9 Oct 2008 22:47:18 +0200 Subject: [ofa-general] [infiniband-diags] specify -l(loop_ports) in ibclearerrors and ibclearcounters In-Reply-To: <1223584334.1197.186.camel@cardanus.llnl.gov> References: <1223512850.1197.154.camel@cardanus.llnl.gov> <1223569901.1197.175.camel@cardanus.llnl.gov> <20081009193907.GR4912@sashak.voltaire.com> <1223584334.1197.186.camel@cardanus.llnl.gov> Message-ID: <20081009204718.GU4912@sashak.voltaire.com> Hi Al, On 13:32 Thu 09 Oct , Al Chu wrote: > > if user specifies port 255 - they are expecting AllPortSelect. So error > out. > > if user specifies '-a' - Do whatever it takes to query/reset on all > ports. If it means looping do that. Aggregate counters appropriately > for a single output. > > if user specifies --loop-ports - loop through the ports one by one no > matter what. This includes outputting port counters for each port. > > Sound like a good idea? Yes. For me. Sasha From hal.rosenstock at gmail.com Thu Oct 9 13:52:06 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 9 Oct 2008 16:52:06 -0400 Subject: ***SPAM*** Re: [ofa-general] [infiniband-diags] specify -l(loop_ports) in ibclearerrors and ibclearcounters In-Reply-To: <1223584334.1197.186.camel@cardanus.llnl.gov> References: <1223512850.1197.154.camel@cardanus.llnl.gov> <1223569901.1197.175.camel@cardanus.llnl.gov> <20081009193907.GR4912@sashak.voltaire.com> <1223584334.1197.186.camel@cardanus.llnl.gov> Message-ID: On Thu, Oct 9, 2008 at 4:32 PM, Al Chu wrote: > Hey Sasha, Hal, all, > > Now I'm thinking this: > > if user specifies port 255 - they are expecting AllPortSelect. So error > out. > > if user specifies '-a' - Do whatever it takes to query/reset on all > ports. If it means looping do that. Aggregate counters appropriately > for a single output. > > if user specifies --loop-ports - loop through the ports one by one no > matter what. This includes outputting port counters for each port. > > Sound like a good idea? Sounds good to me too. -- Hal > Al > > On Thu, 2008-10-09 at 21:39 +0200, Sasha Khapyorsky wrote: >> On 09:31 Thu 09 Oct , Al Chu wrote: >> > >> > I just re-read this sentence, now I get what you're asking. I made -- >> > loop_ports only loop if AllPortSelect isn't supported. So if the >> > switches already support AllPortSelect, the number of resets should be >> > the same. >> >> OTOH perfquery is diagnostic tool. Reset with AllPortSelect may not work >> as expected or an user may have another reason to loop over ports >> regardless to AllPortSelect support. >> >> Sasha > -- > Albert Chu > chu11 at llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > > From weiny2 at llnl.gov Thu Oct 9 14:59:44 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 9 Oct 2008 14:59:44 -0700 Subject: [ofa-general] ***SPAM*** [PATCH] Add osm_config.h file Message-ID: <20081009145944.53e78835.weiny2@llnl.gov> >From 0b4e9b0b21a039051fc729568104e7e82a249f53 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Thu, 9 Oct 2008 14:44:13 -0700 Subject: [PATCH] Add osm_config.h file The defines in this file are required for plugin and third party tool compatibility. Signed-off-by: Ira Weiny --- opensm/configure.in | 7 +++- opensm/include/opensm/osm_config.h.in | 61 ++++++++++++++++++++++++++++++ opensm/include/opensm/osm_event_plugin.h | 1 + opensm/opensm/Makefile.am | 3 +- 4 files changed, 70 insertions(+), 2 deletions(-) create mode 100644 opensm/include/opensm/osm_config.h.in diff --git a/opensm/configure.in b/opensm/configure.in index 7da932b..680e6a0 100644 --- a/opensm/configure.in +++ b/opensm/configure.in @@ -4,12 +4,17 @@ AC_PREREQ(2.57) AC_INIT(opensm, 3.2.2, general at lists.openfabrics.org) AC_CONFIG_SRCDIR([opensm/osm_opensm.c]) AC_CONFIG_AUX_DIR(config) -AC_CONFIG_HEADERS(include/config.h) +AC_CONFIG_HEADERS(include/config.h include/opensm/osm_config.h) AM_INIT_AUTOMAKE AC_SUBST(RELEASE, ${RELEASE:-unknown}) AC_SUBST(TARBALL, ${TARBALL:-${PACKAGE}-${VERSION}.tar.gz}) +dnl NOTE: AC_DEFINE's and AC_DEFINE_UNQUOTED's which are used in header files +dnl MUST have a corresponding entry in include/opensm/osm_config.h.in to +dnl ensure plugin compatibility. +AC_DEFINE(_OSM_CONFIG_H_, 1, mark config.h inclusion) + dnl Defines the Language AC_LANG_C diff --git a/opensm/include/opensm/osm_config.h.in b/opensm/include/opensm/osm_config.h.in new file mode 100644 index 0000000..6781af7 --- /dev/null +++ b/opensm/include/opensm/osm_config.h.in @@ -0,0 +1,61 @@ +/* include/osm_config.h.in + * + * Defines various OpenSM configuration parameters to be used by various + * plugins and third party tools. + * + * NOTE: Defines used in header files MUST be included here to ensure plugin + * compatibility. + */ + +#ifndef _OSM_CONFIG_H_ +#define _OSM_CONFIG_H_ + +/* Define as 1 if you want Dual Sided RMPP Support */ +#undef DUAL_SIDED_RMPP + +/* Define as 1 if you want to enable a console on a socket connection */ +#undef ENABLE_OSM_CONSOLE_SOCKET + +/* Define as 1 if you want to enable the event plugin */ +#undef ENABLE_OSM_DEFAULT_EVENT_PLUGIN + +/* Define as 1 if you want to enable the performance manager */ +#undef ENABLE_OSM_PERF_MGR + +/* Define as 1 if you want to enable the performance manager profiling code */ +#undef ENABLE_OSM_PERF_MGR_PROFILE + +/* Define a default node name map file */ +#undef HAVE_DEFAULT_NODENAME_MAP + +/* Define a default OpenSM config file */ +#undef HAVE_DEFAULT_OPENSM_CONFIG_FILE + +/* Define a Partition config file */ +#undef HAVE_DEFAULT_PARTITION_CONFIG_FILE + +/* Define a Prefix Routes config file */ +#undef HAVE_DEFAULT_PREFIX_ROUTES_FILE + +/* Define a QOS policy config file */ +#undef HAVE_DEFAULT_QOS_POLICY_FILE + +/* Define OpenSM config directory */ +#undef OPENSM_CONFIG_DIR + +/* Define as 1 for vapi vendor */ +#undef OSM_VENDOR_INTF_MTL + +/* Define as 1 for OpenIB vendor */ +#undef OSM_VENDOR_INTF_OPENIB + +/* Define as 1 for sim vendor */ +#undef OSM_VENDOR_INTF_SIM + +/* Define as 1 for ts vendor */ +#undef OSM_VENDOR_INTF_TS + +/* Define as 1 if you want Vendor RMPP Support */ +#undef VENDOR_RMPP_SUPPORT + +#endif /* _OSM_CONFIG_H_ */ diff --git a/opensm/include/opensm/osm_event_plugin.h b/opensm/include/opensm/osm_event_plugin.h index e44a78a..d70dfde 100644 --- a/opensm/include/opensm/osm_event_plugin.h +++ b/opensm/include/opensm/osm_event_plugin.h @@ -37,6 +37,7 @@ #include #include #include +#include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { diff --git a/opensm/opensm/Makefile.am b/opensm/opensm/Makefile.am index e95a482..409c147 100644 --- a/opensm/opensm/Makefile.am +++ b/opensm/opensm/Makefile.am @@ -116,7 +116,8 @@ opensminclude_HEADERS = \ $(srcdir)/../include/opensm/osm_switch.h \ $(srcdir)/../include/opensm/osm_ucast_mgr.h \ $(srcdir)/../include/opensm/osm_vl15intf.h \ - $(top_builddir)/include/opensm/osm_version.h + $(top_builddir)/include/opensm/osm_version.h \ + $(top_builddir)/include/opensm/osm_config.h BUILT_SOURCES = osm_version osm_qos_parser_y.h osm_version: -- 1.5.4.5 From mashirle at us.ibm.com Thu Oct 9 15:48:48 2008 From: mashirle at us.ibm.com (Shirley Ma) Date: Thu, 09 Oct 2008 15:48:48 -0700 Subject: [ofa-general] Re: openSM for supporting IPv6 SNM MGIDs consolidation In-Reply-To: <20081008032215.GR7563@sashak.voltaire.com> References: <1223007639.8387.22.camel@IBM-29AB850785D.beaverton.ibm.com> <20081003153900.GC6566@sashak.voltaire.com> <1223049716.8387.24.camel@IBM-29AB850785D.beaverton.ibm.com> <20081003160448.GF6566@sashak.voltaire.com> <1223051358.8387.26.camel@IBM-29AB850785D.beaverton.ibm.com> <1223433892.24201.16.camel@IBM-29AB850785D.beaverton.ibm.com> <20081008032215.GR7563@sashak.voltaire.com> Message-ID: <1223592528.4355.10.camel@IBM-29AB850785D.beaverton.ibm.com> Hello Sasha, 在 2008-10-08三的 05:22 +0200,Sasha Khapyorsky写道: > > I didn't install it, I just peeked the source code, it looks like > the > > OFED-1.4 will have one IB MGC with ff10601b...01ff000000, the > management > > packages for July release is ff10601b...01ffXXXXXX (first node IPv6 > SNM > > 24 bit interface id). Is that right? > > Yes. The customer has tested openSM-3.2.2 from managment link website, it works well for MGLDs consolidation. They also tried OFED-1.4-RC2, the consolidation doesn't work, it still has all IPv6 MGLID groups, and they didn't see this MGC with ff10601b...01ff000000. Is there any known bug in the recent patch or any other parameter needs special configured? If this is a bug, do we need to open a bug for the fix to be in next RC? I have cheked the other attributes, they are the same: MTU, partition, rate ... The tests were under same nodes. Here is two nodes results for OFED-1.4 openSM: Oct 08 18:20:52 584507 [B11AEF80] 0x07 -> OpenSM 3.2.2 ------------------------------------------------- OpenSM 3.2.2 Reading Cached Option File: /etc/opensm/opensm.conf Loading Cached Option:sweep_interval = 60 Loading Cached Option:port_prof_ignore_file = (null) Loading Cached Option:routing_engine = (null) Loading Cached Option:lid_matrix_dump_file = (null) Loading Cached Option:root_guid_file = (null) Loading Cached Option:cn_guid_file = (null) Loading Cached Option:ids_guid_file = (null) Loading Cached Option:guid_routing_order_file = (null) Loading Cached Option:sa_db_file = (null) Loading Cached Option:sm_priority = 15 Loading Cached Option:daemon = TRUE Loading Cached Option:event_plugin_name = (null) Loading Cached Option:node_name_map_name = (null) Loading Cached Option:consolidate_ipv6_snm_req = TRUE MCMemberRecord group dump: MGID....................ff12:401b:ffff::ffff:ffff Mlid....................0xC000 Mtu.....................0x84 pkey....................0xFFFF Rate....................0x83 SL......................0x0 MCMemberRecord group dump: MGID....................ff12:401b:ffff::1 Mlid....................0xC001 Mtu.....................0x84 pkey....................0xFFFF Rate....................0x83 SL......................0x0 MCMemberRecord group dump: MGID....................ff12:601b:ffff::1:ff21:d59d Mlid....................0xC002 Mtu.....................0x84 pkey....................0xFFFF Rate....................0x83 SL......................0x0 MCMemberRecord group dump: MGID....................ff12:601b:ffff::1 Mlid....................0xC003 Mtu.....................0x84 pkey....................0xFFFF Rate....................0x83 SL......................0x0 MCMemberRecord group dump: MGID....................ff12:601b:ffff::1:ff23:1651 Mlid....................0xC004 Mtu.....................0x84 pkey....................0xFFFF Rate....................0x83 SL......................0x0 thanks Shirley From richard.frank at oracle.com Thu Oct 9 16:59:37 2008 From: richard.frank at oracle.com (Richard Frank) Date: Thu, 09 Oct 2008 19:59:37 -0400 Subject: [ofa-general] Is there a tool that will display an IB switch name given it's GUID ? Message-ID: <48EE9AE9.9010609@oracle.com> Is there a way to procure a human-understandable identifier for the IB switches - if a name cannot be procured. From chu11 at llnl.gov Thu Oct 9 17:09:48 2008 From: chu11 at llnl.gov (Al Chu) Date: Thu, 09 Oct 2008 17:09:48 -0700 Subject: [ofa-general] Is there a tool that will display an IB switch name given it's GUID ? In-Reply-To: <48EE9AE9.9010609@oracle.com> References: <48EE9AE9.9010609@oracle.com> Message-ID: <1223597388.1197.192.camel@cardanus.llnl.gov> Hi Richard, I think the -U option on saquery is what you're looking for. Al On Thu, 2008-10-09 at 19:59 -0400, Richard Frank wrote: > Is there a way to procure a human-understandable identifier for the IB > switches - if a name cannot be procured. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From rdreier at cisco.com Thu Oct 9 17:41:32 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Oct 2008 17:41:32 -0700 Subject: [ofa-general] Re: [PATCH] RDMA/nes: Fix slab corruption In-Reply-To: <200810032043.m93KhL9P013848@velma.neteffect.com> (Chien Tung's message of "Fri, 3 Oct 2008 15:43:21 -0500") References: <200810032043.m93KhL9P013848@velma.neteffect.com> Message-ID: thanks, applied From kliteyn at dev.mellanox.co.il Thu Oct 9 17:44:09 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Fri, 10 Oct 2008 02:44:09 +0200 Subject: [ofa-general] [PATCH 0/6] opensm: Unicast Routing Cache In-Reply-To: <20081009164422.GD4912@sashak.voltaire.com> References: <48E96928.8030200@dev.mellanox.co.il> <48EA9ABA.6010509@dev.mellanox.co.il> <20081009164422.GD4912@sashak.voltaire.com> Message-ID: <48EEA559.1090806@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 09:22 Tue 07 Oct , Hal Rosenstock wrote: >>> Actually, I was thinking about something else: >>> Currently we have switch LFT implemented as osm_fwd_tbl_t. >>> I can remove the unnecessary complexity of the osm_fwd_tbl_t by replacing >>> it with a simple uint8_t array (same as LFT buffer). Then by simple >>> comparison I will check whether the recently calculated LFT >>> matches the switch's LFT, and if there is a match, then lft_buf >>> can be freed. In this case only the switches that have LFT different >>> from the recently calculated LFT will have both tables, which would be >>> rare and temporary - on the next heavy sweep the LFTs would match, and >>> lft_buf would be freed. >> Can the forwarding tables be removed ? How would paths be >> calculated/walked end to end on an SA PathRecord/MultiPathRecord query >> ? Would that then require query of the LFTs in the switches ? > > No. As far as I understand whole idea is to keep LFT images in raw buffer > (as they really are) similar to lft_buf instead of osm_fwd_tbl_t. Right, that is what I meant. -- Yevgeny > IMO this simplifies the code in general and makes described optimization > possible. > > Sasha > From kliteyn at dev.mellanox.co.il Thu Oct 9 17:45:03 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Fri, 10 Oct 2008 02:45:03 +0200 Subject: [ofa-general] [PATCH 0/6] opensm: Unicast Routing Cache In-Reply-To: <20081009163933.GC4912@sashak.voltaire.com> References: <48E96928.8030200@dev.mellanox.co.il> <48EA9ABA.6010509@dev.mellanox.co.il> <20081009163933.GC4912@sashak.voltaire.com> Message-ID: <48EEA58F.1050300@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 01:09 Tue 07 Oct , Yevgeny Kliteynik wrote: >> Actually, I was thinking about something else: >> Currently we have switch LFT implemented as osm_fwd_tbl_t. >> I can remove the unnecessary complexity of the osm_fwd_tbl_t by replacing >> it with a simple uint8_t array (same as LFT buffer). Then by simple >> comparison I will check whether the recently calculated LFT >> matches the switch's LFT, and if there is a match, then lft_buf >> can be freed. In this case only the switches that have LFT different >> from the recently calculated LFT will have both tables, which would be >> rare and temporary - on the next heavy sweep the LFTs would match, and >> lft_buf would be freed. >> Effectively, it won't have memory penalty. >> It can be done in a separate patch. > > Agree about separate patch. And would be really nice to have it in OFED > 1.4 days. > >>> Are you sure all the memory allocation failures are handled properly >>> within the routing cache code ? What I mean is that NULL is returned >>> and does this always result in a caching not used/routing recalculated >>> ? Also, in that case, should some log message be indicated rather than >>> hiding this ? >> I will check it. > > I think Hal is about non-checked malloc() in __cache_add_port() function. Will take care of it - Yevgeny >>> Nit: doc/current-routing.txt should also be updated for this feature. >> OK, separate patch. > > Agree, it is needed too. > > Sasha > From rdreier at cisco.com Thu Oct 9 17:48:36 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Oct 2008 17:48:36 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get the first batch of 2.6.28 merges -- pretty much all low-level hardware driver and IP-over-IB fixes, along with a few little things elsewhere, as the dirstat shows: 14.4% drivers/infiniband/hw/ehca/ 22.0% drivers/infiniband/hw/mthca/ 33.1% drivers/infiniband/hw/nes/ 26.9% drivers/infiniband/ulp/ipoib/ 3.4% drivers/infiniband/ Full list of patches below: Alexander Schmidt (1): IB/ehca: Generate flush status CQ entries Bob Sharp (4): RDMA/nes: Free NIC TX buffers when destroying NIC QP RDMA/nes: Enable MC/UC after changing MTU RDMA/nes: Correct MAX TSO frags value RDMA/nes: Fix routed RDMA connections Chien Tung (10): RDMA/nes: Add support for 4-port 1G HP blade card RDMA/nes: Module parameter permissions RDMA/nes: Add wqm_quanta module option RDMA/nes: Fix MDC setting RDMA/nes: Fill in firmware version for ethtool RDMA/nes: Correct tso_wqe_length RDMA/nes: Stop spurious MAC interrupts RDMA/nes: Limit critical error interrupts RDMA/nes: Correct error_module bit mask RDMA/nes: Fix slab corruption Faisal Latif (2): RDMA/nes: Make mini_cm_connect() static RDMA/nes: Handle AE bounds violation Hefty, Sean (1): IB/cm: Correctly free cm_device structure John Lacombe (1): RDMA/nes: Use ethtool timer value Jon Mason (1): RDMA/cxgb3: Set active_mtu in ib_port_attr Julia Lawall (1): IB: Drop code after return statement Michael Brooks (1): IB/mad: Don't discard BMA responses in kernel Ralph Campbell (1): IB/ipath: Fix SLID generation for RC/UC QPs when LMC > 0 Roland Dreier (4): IPoIB: Fix crash when path record fails after path flush IB/mthca: Use pci_request_regions() IPoIB: Use netif_tx_lock() and get rid of private tx_lock, LLTX Merge branches 'cma', 'cxgb3', 'ehca', 'ipath', 'ipoib', 'mad', 'misc', 'mlx4', 'mthca' and 'nes' into for-next Vadim Makhervaks (1): RDMA/nes: Enhanced PFT management scheme Vladimir Sokolovsky (1): IB/mlx4: Set RLKEY bit for kernel QPs Yannick Cote (1): IB/ipath: Fix hang on module unload drivers/infiniband/core/cm.c | 2 + drivers/infiniband/core/mad.c | 5 +- drivers/infiniband/hw/amso1100/c2_provider.c | 1 - drivers/infiniband/hw/cxgb3/iwch_provider.c | 9 +- drivers/infiniband/hw/ehca/ehca_classes.h | 14 ++- drivers/infiniband/hw/ehca/ehca_cq.c | 3 + drivers/infiniband/hw/ehca/ehca_iverbs.h | 2 + drivers/infiniband/hw/ehca/ehca_qp.c | 225 ++++++++++++++++++++++-- drivers/infiniband/hw/ehca/ehca_reqs.c | 211 +++++++++++++++++++--- drivers/infiniband/hw/ipath/ipath_rc.c | 3 +- drivers/infiniband/hw/ipath/ipath_ruc.c | 3 +- drivers/infiniband/hw/ipath/ipath_verbs.c | 7 + drivers/infiniband/hw/mlx4/qp.c | 3 + drivers/infiniband/hw/mthca/mthca_catas.c | 15 +-- drivers/infiniband/hw/mthca/mthca_eq.c | 51 +----- drivers/infiniband/hw/mthca/mthca_main.c | 59 +------ drivers/infiniband/hw/nes/nes.c | 95 +++++++++- drivers/infiniband/hw/nes/nes.h | 2 +- drivers/infiniband/hw/nes/nes_cm.c | 41 ++++- drivers/infiniband/hw/nes/nes_hw.c | 205 ++++++++++++++++++---- drivers/infiniband/hw/nes/nes_hw.h | 6 + drivers/infiniband/hw/nes/nes_nic.c | 122 +++++++++++--- drivers/infiniband/hw/nes/nes_verbs.c | 3 - drivers/infiniband/ulp/ipoib/ipoib.h | 8 +- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 88 ++++++---- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 30 +++- drivers/infiniband/ulp/ipoib/ipoib_main.c | 76 ++++----- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 31 ++-- 28 files changed, 961 insertions(+), 359 deletions(-) From kliteyn at dev.mellanox.co.il Thu Oct 9 18:07:04 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Fri, 10 Oct 2008 03:07:04 +0200 Subject: [ofa-general] Re: [PATCH] opensm/ucast_cache: remove redundant osm_physp_is_valid() check In-Reply-To: <20081009183402.GH4912@sashak.voltaire.com> References: <48E969A6.1000607@dev.mellanox.co.il> <20081009183402.GH4912@sashak.voltaire.com> Message-ID: <48EEAAB8.5090608@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > osm_node_get_physp_ptr() return non-NULL values only for valid physical > ports - remove redundant checks. Looks fine, thanks. -- Yevgeny > Signed-off-by: Sasha Khapyorsky > --- > opensm/opensm/osm_ucast_cache.c | 10 ++++------ > 1 files changed, 4 insertions(+), 6 deletions(-) > > diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c > index 355716a..3c32d35 100644 > --- a/opensm/opensm/osm_ucast_cache.c > +++ b/opensm/opensm/osm_ucast_cache.c > @@ -963,12 +963,10 @@ osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, > goto Exit; > } > > - if (((osm_node_get_type(p_node_1) == IB_NODE_TYPE_SWITCH) && > - (!osm_node_get_physp_ptr(p_node_1, 0) || > - !osm_physp_is_valid(osm_node_get_physp_ptr(p_node_1, 0)))) || > - ((osm_node_get_type(p_node_2) == IB_NODE_TYPE_SWITCH) && > - (!osm_node_get_physp_ptr(p_node_2, 0) || > - !osm_physp_is_valid(osm_node_get_physp_ptr(p_node_2, 0))))) { > + if ((osm_node_get_type(p_node_1) == IB_NODE_TYPE_SWITCH && > + !osm_node_get_physp_ptr(p_node_1, 0)) || > + (osm_node_get_type(p_node_2) == IB_NODE_TYPE_SWITCH && > + !osm_node_get_physp_ptr(p_node_2, 0))) { > /* we're caching a link when one of the nodes > has already been dropped and cached */ > OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, From kliteyn at dev.mellanox.co.il Thu Oct 9 18:08:49 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Fri, 10 Oct 2008 03:08:49 +0200 Subject: [ofa-general] Re: [PATCH] opensm/ucast_cache: rename struct _osm_ucast_cache to osm_ucast_cache In-Reply-To: <20081009183508.GI4912@sashak.voltaire.com> References: <48E969A6.1000607@dev.mellanox.co.il> <20081009183508.GI4912@sashak.voltaire.com> Message-ID: <48EEAB21.2030408@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > To be consistent with other names rename struct _osm_ucast_cache to > osm_ucast_cache. Sure, looks fine. Thanks. -- Yevgeny > Signed-off-by: Sasha Khapyorsky > --- > opensm/include/opensm/osm_ucast_cache.h | 2 +- > opensm/include/opensm/osm_ucast_mgr.h | 4 ++-- > 2 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/opensm/include/opensm/osm_ucast_cache.h b/opensm/include/opensm/osm_ucast_cache.h > index 7f01876..caffc83 100644 > --- a/opensm/include/opensm/osm_ucast_cache.h > +++ b/opensm/include/opensm/osm_ucast_cache.h > @@ -90,7 +90,7 @@ struct osm_ucast_mgr; > * > * SYNOPSIS > */ > -typedef struct _osm_ucast_cache { > +typedef struct osm_ucast_cache { > cl_qmap_t sw_tbl; > boolean_t valid; > struct osm_ucast_mgr * p_ucast_mgr; > diff --git a/opensm/include/opensm/osm_ucast_mgr.h b/opensm/include/opensm/osm_ucast_mgr.h > index e4006bb..5575d20 100644 > --- a/opensm/include/opensm/osm_ucast_mgr.h > +++ b/opensm/include/opensm/osm_ucast_mgr.h > @@ -78,7 +78,7 @@ BEGIN_C_DECLS > * > *********/ > struct osm_sm; > -struct _osm_ucast_cache; > +struct osm_ucast_cache; > /****s* OpenSM: Unicast Manager/osm_ucast_mgr_t > * NAME > * osm_ucast_mgr_t > @@ -99,7 +99,7 @@ typedef struct osm_ucast_mgr { > cl_qlist_t port_order_list; > boolean_t is_dor; > boolean_t some_hop_count_set; > - struct _osm_ucast_cache *p_cache; > + struct osm_ucast_cache *p_cache; > } osm_ucast_mgr_t; > /* > * FIELDS From kliteyn at dev.mellanox.co.il Thu Oct 9 18:10:11 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Fri, 10 Oct 2008 03:10:11 +0200 Subject: [ofa-general] Re: [PATCH] opensm/ucast_cache: loop over switches table to scan switches In-Reply-To: <20081009183635.GK4912@sashak.voltaire.com> References: <48E969A6.1000607@dev.mellanox.co.il> <20081009183635.GK4912@sashak.voltaire.com> Message-ID: <48EEAB73.4010209@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > Cache validator scans all switches by looping over all fabric nodes > table. This patch improves this slightly by using a fabric switches > table instead. Right, it does improve the loop slightly. Thanks. Patch looks fine. -- Yevgeny > Signed-off-by: Sasha Khapyorsky > --- > opensm/opensm/osm_ucast_cache.c | 16 ++++++---------- > 1 files changed, 6 insertions(+), 10 deletions(-) > > diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c > index e5264cd..df8b7d2 100644 > --- a/opensm/opensm/osm_ucast_cache.c > +++ b/opensm/opensm/osm_ucast_cache.c > @@ -525,7 +525,7 @@ static void ucast_cache_validate(osm_ucast_cache_t * p_cache) > osm_physp_t * p_physp; > osm_physp_t * p_remote_physp; > osm_port_t * p_remote_port; > - cl_qmap_t * p_node_guid_tbl; > + cl_qmap_t * p_sw_tbl; > > OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); > if (!p_cache->valid) > @@ -537,18 +537,14 @@ static void ucast_cache_validate(osm_ucast_cache_t * p_cache) > * it's just some node/port reset or a cached topology > * change. Otherwise the cache is invalid. > */ > - p_node_guid_tbl = &p_cache->p_ucast_mgr->p_subn->node_guid_tbl; > - for (p_node = (osm_node_t *) cl_qmap_head(p_node_guid_tbl); > - p_node != (osm_node_t *) cl_qmap_end(p_node_guid_tbl); > - p_node = (osm_node_t *) cl_qmap_next(&p_node->map_item)) { > - > - if (osm_node_get_type(p_node) != IB_NODE_TYPE_SWITCH) > - continue; > + p_sw_tbl = &p_cache->p_ucast_mgr->p_subn->sw_guid_tbl; > + for (p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl); > + p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl); > + p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item)) { > > + p_node = p_sw->p_node; > lid_ho = cl_ntoh16(osm_node_get_base_lid(p_node,0)); > p_cache_sw = __cache_get_sw(p_cache, lid_ho); > - > - p_sw = p_node->sw; > max_ports = osm_node_get_num_physp(p_node); > > /* skip port 0 */ From kliteyn at dev.mellanox.co.il Thu Oct 9 18:19:21 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Fri, 10 Oct 2008 03:19:21 +0200 Subject: [ofa-general] Re: [PATCH 3/6] opensm/Unicast Routing Cache: add osm_ucast_cache.{c, h} files In-Reply-To: <20081009183547.GJ4912@sashak.voltaire.com> References: <48E969A6.1000607@dev.mellanox.co.il> <20081009183547.GJ4912@sashak.voltaire.com> Message-ID: <48EEAD99.8050701@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: >>From 8f361ce2159c8a37a68a472abd495208b02a7279 Mon Sep 17 00:00:00 2001 > From: Sasha Khapyorsky > Date: Wed, 8 Oct 2008 15:42:40 +0200 > Subject: [PATCH] opensm/ucast_cache: simplify osm_ucast_cache_add_link() prototype > > Simplify osm_ucast_cache_add_link() prototype - instead of pair of node, > port number parameter list get just two pointers to physical ports. Good idea. Patch is fine, thanks. -- Yevgeny > Signed-off-by: Sasha Khapyorsky > --- > opensm/include/opensm/osm_ucast_cache.h | 19 +++--------- > opensm/opensm/osm_drop_mgr.c | 5 +-- > opensm/opensm/osm_port_info_rcv.c | 5 +-- > opensm/opensm/osm_ucast_cache.c | 49 +++++++++++++----------------- > 4 files changed, 29 insertions(+), 49 deletions(-) > > diff --git a/opensm/include/opensm/osm_ucast_cache.h b/opensm/include/opensm/osm_ucast_cache.h > index caffc83..3ecd2d8 100644 > --- a/opensm/include/opensm/osm_ucast_cache.h > +++ b/opensm/include/opensm/osm_ucast_cache.h > @@ -282,26 +282,17 @@ osm_ucast_cache_check_new_link(osm_ucast_cache_t * p_cache, > */ > void > osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, > - osm_node_t * p_node_1, > - uint8_t port_num_1, > - osm_node_t * p_node_2, > - uint8_t port_num_2); > + osm_physp_t *physp1, osm_physp_t *physp2); > /* > * PARAMETERS > * p_cache > * [in] Pointer to the cache object. > * > -* p_node_1 > -* [in] Pointer to the first node of the link. > -* > -* port_num_1 > -* [in] Port number on the first node of the link. > +* physp1 > +* [in] Pointer to the first physical port of the link. > * > -* p_node_2 > -* [in] Pointer to the second node of the link. > -* > -* port_num_2 > -* [in] Port number on the second node of the link. > +* physp2 > +* [in] Pointer to the second physical port of the link. > * > * RETURN VALUE > * This function does not return any value. > diff --git a/opensm/opensm/osm_drop_mgr.c b/opensm/opensm/osm_drop_mgr.c > index 5fc46a9..eceb9a6 100644 > --- a/opensm/opensm/osm_drop_mgr.c > +++ b/opensm/opensm/osm_drop_mgr.c > @@ -137,10 +137,7 @@ static void drop_mgr_clean_physp(osm_sm_t * sm, IN osm_physp_t * p_physp) > > if (sm->p_subn->opt.use_ucast_cache) > osm_ucast_cache_add_link(sm->ucast_mgr.p_cache, > - p_physp->p_node, > - p_physp->port_num, > - p_remote_physp->p_node, > - p_remote_physp->port_num); > + p_physp, p_remote_physp); > > osm_physp_unlink(p_physp, p_remote_physp); > > diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c > index d8d2021..8004fa8 100644 > --- a/opensm/opensm/osm_port_info_rcv.c > +++ b/opensm/opensm/osm_port_info_rcv.c > @@ -247,9 +247,8 @@ __osm_pi_rcv_process_switch_port(IN osm_sm_t * sm, > > if (sm->p_subn->opt.use_ucast_cache) > osm_ucast_cache_add_link(sm->ucast_mgr.p_cache, > - p_node, port_num, > - p_remote_node, > - remote_port_num); > + p_physp, > + p_remote_physp); > > osm_node_unlink(p_node, (uint8_t) port_num, > p_remote_node, > diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c > index 57dc0a0..e5264cd 100644 > --- a/opensm/opensm/osm_ucast_cache.c > +++ b/opensm/opensm/osm_ucast_cache.c > @@ -929,15 +929,11 @@ Exit: > /********************************************************************** > **********************************************************************/ > > -void > -osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, > - osm_node_t * p_node_1, > - uint8_t port_num_1, > - osm_node_t * p_node_2, > - uint8_t port_num_2) > +void osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, > + osm_physp_t *physp1, osm_physp_t *physp2) > { > - uint16_t lid_ho_1; > - uint16_t lid_ho_2; > + osm_node_t *p_node_1 = physp1->p_node, *p_node_2 = physp2->p_node; > + uint16_t lid_ho_1, lid_ho_2; > > OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); > > @@ -961,7 +957,7 @@ osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, > OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, > "Port %u <-> port %u: port0 on one of the nodes" > "has already been dropped and cached\n", > - port_num_1, port_num_2); > + physp1->port_num, physp2->port_num); > goto Exit; > } > > @@ -969,12 +965,11 @@ osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, > simplicity, make sure that it's the first node. */ > > if (osm_node_get_type(p_node_1) != IB_NODE_TYPE_SWITCH) { > - osm_node_t * tmp_node = p_node_1; > - uint8_t tmp_port_num = port_num_1; > - p_node_1 = p_node_2; > - port_num_1 = port_num_2; > - p_node_2 = tmp_node; > - port_num_2 = tmp_port_num; > + osm_physp_t *tmp = physp1; > + physp1 = physp2; > + physp2 = tmp; > + p_node_1 = physp1->p_node; > + p_node_2 = physp2->p_node; > } > > if (!p_node_1->sw) { > @@ -983,7 +978,7 @@ osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, > goto Exit; > } > > - lid_ho_1 = cl_ntoh16(osm_node_get_base_lid(p_node_1,0)); > + lid_ho_1 = cl_ntoh16(osm_node_get_base_lid(p_node_1, 0)); > > if (osm_node_get_type(p_node_2) == IB_NODE_TYPE_SWITCH) { > > @@ -993,20 +988,19 @@ osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, > goto Exit; > } > > - lid_ho_2 = cl_ntoh16(osm_node_get_base_lid(p_node_2,0)); > + lid_ho_2 = cl_ntoh16(osm_node_get_base_lid(p_node_2, 0)); > > /* lost switch-2-switch link - cache both sides */ > - __cache_add_port(p_cache, lid_ho_1, port_num_1, > + __cache_add_port(p_cache, lid_ho_1, physp1->port_num, > lid_ho_2, FALSE); > - __cache_add_port(p_cache, lid_ho_2, port_num_2, > + __cache_add_port(p_cache, lid_ho_2, physp2->port_num, > lid_ho_1, FALSE); > } > else { > - lid_ho_2 = cl_ntoh16( > - osm_node_get_base_lid(p_node_2, port_num_2)); > + lid_ho_2 = cl_ntoh16(osm_physp_get_base_lid(physp2)); > > /* lost link to CA/RTR - cache only switch side */ > - __cache_add_port(p_cache, lid_ho_1, port_num_1, > + __cache_add_port(p_cache, lid_ho_1, physp1->port_num, > lid_ho_2, TRUE); > } > > @@ -1059,9 +1053,8 @@ osm_ucast_cache_add_node(osm_ucast_cache_t * p_cache, > !p_physp->p_remote_physp->p_node) > continue; > > - osm_ucast_cache_add_link(p_cache, p_node, port_num, > - p_physp->p_remote_physp->p_node, > - p_physp->p_remote_physp->port_num); > + osm_ucast_cache_add_link(p_cache, p_physp, > + p_physp->p_remote_physp); > } > > /* > @@ -1128,9 +1121,9 @@ osm_ucast_cache_add_node(osm_ucast_cache_t * p_cache, > goto Exit; > } > > - osm_ucast_cache_add_link(p_cache, p_remote_node, > - p_physp->p_remote_physp->port_num, > - p_node, port_num); > + osm_ucast_cache_add_link(p_cache, > + p_physp->p_remote_physp, > + p_physp); > } > } > Exit: From hal.rosenstock at gmail.com Thu Oct 9 18:28:40 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 9 Oct 2008 21:28:40 -0400 Subject: [ofa-general] Is there a tool that will display an IB switch name given it's GUID ? In-Reply-To: <48EE9AE9.9010609@oracle.com> References: <48EE9AE9.9010609@oracle.com> Message-ID: On Thu, Oct 9, 2008 at 7:59 PM, Richard Frank wrote: > Is there a way to procure a human-understandable identifier for the IB > switches - if a name cannot be procured. Are you asking about the case where the switches NodeDesc is useful or not useful ? -- Hal > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From kliteyn at dev.mellanox.co.il Thu Oct 9 18:29:50 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Fri, 10 Oct 2008 03:29:50 +0200 Subject: [ofa-general] Re: [PATCH 0/6] opensm: Unicast Routing Cache In-Reply-To: <20081009171103.GF4912@sashak.voltaire.com> References: <48E96928.8030200@dev.mellanox.co.il> <20081009171103.GF4912@sashak.voltaire.com> Message-ID: <48EEB00E.7000209@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > Hi Evgeny, > > On 03:26 Mon 06 Oct , Yevgeny Kliteynik wrote: >> The patches are: >> - patch 1/6: move lft_buf from ucast_mgr to osm_switch >> - patch 2/6: Add "-A" or "--ucast_cache" option to opensm >> - patch 3/6: adding osm_ucast_cache.{c,h} files (this is >> the cache implementation itself) >> - patch 4/6: adding new cache files to makefile >> - patch 5/6: integrating unicast cache into the discovery >> and ucast manager >> - patch 6/6: man entry for cached routing > > Pathes 1,2,4,6 look fine for me. Will comment on others. Thanks for the review and the patches. Didn't manage to address all your comments yet - will do it tomorrow. One question though: how to deal with the incremental patches that you sent me? Should I apply them to my branch and then issue one V2 patch instead of the old one, or will you apply the original patch, followed by all the incremental (yours and mine)? -- Yevgeny > Sasha > From richard.frank at oracle.com Thu Oct 9 18:52:45 2008 From: richard.frank at oracle.com (Richard Frank) Date: Thu, 09 Oct 2008 21:52:45 -0400 Subject: [ofa-general] Is there a tool that will display an IB switch name given it's GUID ? In-Reply-To: <48EE9AE9.9010609@oracle.com> References: <48EE9AE9.9010609@oracle.com> Message-ID: <48EEB56D.3040100@oracle.com> I'm sorry I meant to say NetApp - I was thinking about Sun and IBOE... sorry.. Richard Frank wrote: > Is there a way to procure a human-understandable identifier for the IB > switches - if a name cannot be procured. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From richard.frank at oracle.com Thu Oct 9 18:55:09 2008 From: richard.frank at oracle.com (Richard Frank) Date: Thu, 09 Oct 2008 21:55:09 -0400 Subject: [ofa-general] Is there a tool that will display an IB switch name given it's GUID ? In-Reply-To: <48EEB56D.3040100@oracle.com> References: <48EE9AE9.9010609@oracle.com> <48EEB56D.3040100@oracle.com> Message-ID: <48EEB5FD.9080203@oracle.com> Woops - ignore that last comment... Richard Frank wrote: > I'm sorry I meant to say NetApp - I was thinking about Sun and IBOE... > > sorry.. > > > Richard Frank wrote: >> Is there a way to procure a human-understandable identifier for the >> IB switches - if a name cannot be procured. >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Fri Oct 10 01:10:21 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 10 Oct 2008 10:10:21 +0200 Subject: [ofa-general] Re: openSM for supporting IPv6 SNM MGIDs consolidation In-Reply-To: <1223592528.4355.10.camel@IBM-29AB850785D.beaverton.ibm.com> References: <1223007639.8387.22.camel@IBM-29AB850785D.beaverton.ibm.com> <20081003153900.GC6566@sashak.voltaire.com> <1223049716.8387.24.camel@IBM-29AB850785D.beaverton.ibm.com> <20081003160448.GF6566@sashak.voltaire.com> <1223051358.8387.26.camel@IBM-29AB850785D.beaverton.ibm.com> <1223433892.24201.16.camel@IBM-29AB850785D.beaverton.ibm.com> <20081008032215.GR7563@sashak.voltaire.com> <1223592528.4355.10.camel@IBM-29AB850785D.beaverton.ibm.com> Message-ID: <20081010081021.GV4912@sashak.voltaire.com> Hi Shirley, On 15:48 Thu 09 Oct , Shirley Ma wrote: > > They also tried OFED-1.4-RC2, the consolidation doesn't work, it still > has all IPv6 MGLID groups, and they didn't see this MGC with > ff10601b...01ff000000. Is there any known bug in the recent patch or any > other parameter needs special configured? Now it is known :) . Thanks for reporting. And there is a fix: diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c index aef6a3d..fe0e320 100644 --- a/opensm/opensm/osm_sa_mcmember_record.c +++ b/opensm/opensm/osm_sa_mcmember_record.c @@ -992,13 +992,10 @@ static unsigned match_and_update_ipv6_snm_mgid(ib_gid_t *mgid) osm_mgrp_t *osm_get_mgrp_by_mgid(IN osm_sa_t *sa, IN ib_gid_t *p_mgid) { - ib_gid_t mgid; int i; - memcpy(&mgid, p_mgid, sizeof(mgid)); - if (sa->p_subn->opt.consolidate_ipv6_snm_req && - match_and_update_ipv6_snm_mgid(&mgid)) { + match_and_update_ipv6_snm_mgid(p_mgid)) { char gid_str[INET6_ADDRSTRLEN]; OSM_LOG(sa->p_log, OSM_LOG_DEBUG, "Special Case Solicited Node Mcast Join for MGID %s\n", @@ -1009,7 +1006,7 @@ osm_mgrp_t *osm_get_mgrp_by_mgid(IN osm_sa_t *sa, IN ib_gid_t *p_mgid) for (i = 0; i <= sa->p_subn->max_mcast_lid_ho - IB_LID_MCAST_START_HO; i++) if (sa->p_subn->mgroups[i] && - match_mgrp_by_mgid(sa->p_subn->mgroups[i], &mgid)) + match_mgrp_by_mgid(sa->p_subn->mgroups[i], p_mgid)) return sa->p_subn->mgroups[i]; return NULL; The behavior is slightly changed - now IPv6 SNM common MC group will have MGID ff10:601b::1:ff00:0 (three zeroed bytes at end). > If this is a bug, do we need > to open a bug for the fix to be in next RC? It is up to you. I will commit this soon anyway and send the patch to the list. Sasha From sashak at voltaire.com Fri Oct 10 01:20:55 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 10 Oct 2008 10:20:55 +0200 Subject: [ofa-general] [PATCH] opensm: fix broken IPv6 SNM consolidation code Message-ID: <20081010082055.GW4912@sashak.voltaire.com> I broke the original IPv6 SNM consolidation in one of the previous commits. There is a fix. The behavior is slightly changed now (due to performance reason). The common IPv6 SNM MC group will have ff10:601b::1:ff00:0 MGID - lower 24 bits are zeroed. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_sa_mcmember_record.c | 7 ++----- 1 files changed, 2 insertions(+), 5 deletions(-) diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c index aef6a3d..fe0e320 100644 --- a/opensm/opensm/osm_sa_mcmember_record.c +++ b/opensm/opensm/osm_sa_mcmember_record.c @@ -992,13 +992,10 @@ static unsigned match_and_update_ipv6_snm_mgid(ib_gid_t *mgid) osm_mgrp_t *osm_get_mgrp_by_mgid(IN osm_sa_t *sa, IN ib_gid_t *p_mgid) { - ib_gid_t mgid; int i; - memcpy(&mgid, p_mgid, sizeof(mgid)); - if (sa->p_subn->opt.consolidate_ipv6_snm_req && - match_and_update_ipv6_snm_mgid(&mgid)) { + match_and_update_ipv6_snm_mgid(p_mgid)) { char gid_str[INET6_ADDRSTRLEN]; OSM_LOG(sa->p_log, OSM_LOG_DEBUG, "Special Case Solicited Node Mcast Join for MGID %s\n", @@ -1009,7 +1006,7 @@ osm_mgrp_t *osm_get_mgrp_by_mgid(IN osm_sa_t *sa, IN ib_gid_t *p_mgid) for (i = 0; i <= sa->p_subn->max_mcast_lid_ho - IB_LID_MCAST_START_HO; i++) if (sa->p_subn->mgroups[i] && - match_mgrp_by_mgid(sa->p_subn->mgroups[i], &mgid)) + match_mgrp_by_mgid(sa->p_subn->mgroups[i], p_mgid)) return sa->p_subn->mgroups[i]; return NULL; -- 1.6.0.1.196.g01914 From sashak at voltaire.com Fri Oct 10 01:28:27 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 10 Oct 2008 10:28:27 +0200 Subject: [ofa-general] Re: [PATCH 0/6] opensm: Unicast Routing Cache In-Reply-To: <48EEB00E.7000209@dev.mellanox.co.il> References: <48E96928.8030200@dev.mellanox.co.il> <20081009171103.GF4912@sashak.voltaire.com> <48EEB00E.7000209@dev.mellanox.co.il> Message-ID: <20081010082827.GX4912@sashak.voltaire.com> Hi Yevgeny, On 03:29 Fri 10 Oct , Yevgeny Kliteynik wrote: > > Thanks for the review and the patches. Didn't manage to address > all your comments yet - will do it tomorrow. > One question though: how to deal with the incremental patches that > you sent me? Should I apply them to my branch and then issue one > V2 patch instead of the old one, or will you apply the original > patch, followed by all the incremental (yours and mine)? It is up to you. You can merge all in single V2 (guess it is simpler) or leave it unchanged and I will apply later. Except integration patch others are not critical IMHO. Sasha From sashak at voltaire.com Fri Oct 10 01:38:01 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 10 Oct 2008 10:38:01 +0200 Subject: [ofa-general] Re: [PATCH] Add osm_config.h file In-Reply-To: <20081009145944.53e78835.weiny2@llnl.gov> References: <20081009145944.53e78835.weiny2@llnl.gov> Message-ID: <20081010083801.GY4912@sashak.voltaire.com> On 14:59 Thu 09 Oct , Ira Weiny wrote: > From 0b4e9b0b21a039051fc729568104e7e82a249f53 Mon Sep 17 00:00:00 2001 > From: Ira Weiny > Date: Thu, 9 Oct 2008 14:44:13 -0700 > Subject: [PATCH] Add osm_config.h file > > The defines in this file are required for plugin and third party tool > compatibility. > > Signed-off-by: Ira Weiny Applied. Thanks. Sasha From vlad at lists.openfabrics.org Fri Oct 10 03:20:19 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 10 Oct 2008 03:20:19 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081010-0200 daily build status Message-ID: <20081010102019.3192EE609D0@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: Build failed on ppc64 with linux-2.6.24 Log: /home/vlad/tmp/ofa_1_4_kernel-20081010-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c: In function 'ehca_poll_eqs': /home/vlad/tmp/ofa_1_4_kernel-20081010-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:942: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type /home/vlad/tmp/ofa_1_4_kernel-20081010-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:946: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081010-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081010-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081010-0200_linux-2.6.24_ppc64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081010-0200_linux-2.6.24_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.24' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From sashak at voltaire.com Fri Oct 10 07:33:08 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 10 Oct 2008 16:33:08 +0200 Subject: [ofa-general] [PATCH] ibutils/ibis: prevent buffer overflows Message-ID: <20081010143308.GA6947@sashak.voltaire.com> There are couple of one byte buffer overflows in ibis*_wrap.c* files. Guess those files where generated originally, but I didn't find from where stuff like obj->log_file[1024] = '\0' is coming. So fising in place. Signed-off-by: Sasha Khapyorsky --- ibis/src/ibis_wrap.c | 4 ++-- ibis/src/ibissh_wrap.cpp | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/ibis/src/ibis_wrap.c b/ibis/src/ibis_wrap.c index 70bc3b2..85e72d5 100644 --- a/ibis/src/ibis_wrap.c +++ b/ibis/src/ibis_wrap.c @@ -44884,7 +44884,7 @@ static int TclsmVlArbTableCmd(ClientData clientData, Tcl_Interp *interp, int obj static ibsm_node_desc_str_t * _ibsm_node_desc_description_set(smNodeDesc *obj, ibsm_node_desc_str_t val[IB_NODE_DESCRIPTION_SIZE]) { { strncpy((char *)obj->description,(char *)val,IB_NODE_DESCRIPTION_SIZE - 1); - obj->description[IB_NODE_DESCRIPTION_SIZE] = '\0'; + obj->description[IB_NODE_DESCRIPTION_SIZE - 1] = '\0'; } return (ibsm_node_desc_str_t *) val; } @@ -72782,7 +72782,7 @@ static int _wrap_ibis_opt_t_log_flags_get(ClientData clientData, Tcl_Interp *int static char * _ibis_opt_log_file_set(ibis_opt_t *obj, char val[1024]) { { strncpy(obj->log_file,val,1024 - 1); - obj->log_file[1024] = '\0'; + obj->log_file[1023] = '\0'; } return (char *) val; } diff --git a/ibis/src/ibissh_wrap.cpp b/ibis/src/ibissh_wrap.cpp index a794cc4..ece7c9c 100644 --- a/ibis/src/ibissh_wrap.cpp +++ b/ibis/src/ibissh_wrap.cpp @@ -44946,7 +44946,7 @@ static int TclsmVlArbTableCmd(ClientData clientData, Tcl_Interp *interp, int obj static ibsm_node_desc_str_t * _ibsm_node_desc_description_set(smNodeDesc *obj, ibsm_node_desc_str_t val[IB_NODE_DESCRIPTION_SIZE]) { { strncpy((char *)obj->description,(char *)val,IB_NODE_DESCRIPTION_SIZE - 1); - obj->description[IB_NODE_DESCRIPTION_SIZE] = '\0'; + obj->description[IB_NODE_DESCRIPTION_SIZE - 1] = '\0'; } return (ibsm_node_desc_str_t *) val; } @@ -72844,7 +72844,7 @@ static int _wrap_ibis_opt_t_log_flags_get(ClientData clientData, Tcl_Interp *int static char * _ibis_opt_log_file_set(ibis_opt_t *obj, char val[1024]) { { strncpy(obj->log_file,val,1024 - 1); - obj->log_file[1024] = '\0'; + obj->log_file[1023] = '\0'; } return (char *) val; } -- 1.6.0.1.196.g01914 From sashak at voltaire.com Fri Oct 10 07:34:32 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 10 Oct 2008 16:34:32 +0200 Subject: [ofa-general] [PATCH] ibutils/ibis: link ibis dynamically Message-ID: <20081010143432.GB6947@sashak.voltaire.com> Otherwise when running against ibsim with libumad2sim.so preloaded it has two instance (static and dynamic) of libibumad with different internal initializations, etc. Signed-off-by: Sasha Khapyorsky --- ibis/src/Makefile.am | 5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) diff --git a/ibis/src/Makefile.am b/ibis/src/Makefile.am index e0b512f..4c6621c 100644 --- a/ibis/src/Makefile.am +++ b/ibis/src/Makefile.am @@ -74,8 +74,6 @@ LDADD = $(OSM_LDFLAGS) ibis_SOURCES = ibissh_wrap.cpp -ibis_LDFLAGS = -static -# note the order of the libraries does matter as we static link ibis_LDADD = -libiscom $(OSM_LDFLAGS) $(TCL_LIBS) @@ -153,7 +151,8 @@ EXTRA_DIST = swig_extended_obj.c fixSwigWrapper pkgIndex.tcl \ git_version.h # we do not want the temporary and the archive libs installed: -install-libLTLIBRARIES: +#install-libLTLIBRARIES: +# then objects should be linked into program and no library is created at all # this actually over write the lib install install-exec-am: install-binPROGRAMS -- 1.6.0.1.196.g01914 From sashak at voltaire.com Fri Oct 10 08:07:06 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 10 Oct 2008 17:07:06 +0200 Subject: [ofa-general] [PATCH] ibutils: add missing header file with std::sort() prototype Message-ID: <20081010150706.GC6947@sashak.voltaire.com> Function prototypes are mandatory with C++. The compilation fails with gcc-4.3.1. This patch adds header file (algorithm) inclusion needed for std::sort() function. Signed-off-by: Sasha Khapyorsky --- ibdm/datamodel/LinkCover.cpp | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/ibdm/datamodel/LinkCover.cpp b/ibdm/datamodel/LinkCover.cpp index 8f10273..0abeac9 100644 --- a/ibdm/datamodel/LinkCover.cpp +++ b/ibdm/datamodel/LinkCover.cpp @@ -39,6 +39,7 @@ #include #include #include +#include using namespace std; /* -- 1.6.0.1.196.g01914 From mashirle at us.ibm.com Fri Oct 10 08:09:47 2008 From: mashirle at us.ibm.com (Shirley Ma) Date: Fri, 10 Oct 2008 08:09:47 -0700 Subject: [ofa-general] Re: openSM for supporting IPv6 SNM MGIDs consolidation In-Reply-To: <20081010081021.GV4912@sashak.voltaire.com> References: <1223007639.8387.22.camel@IBM-29AB850785D.beaverton.ibm.com> <20081003153900.GC6566@sashak.voltaire.com> <1223049716.8387.24.camel@IBM-29AB850785D.beaverton.ibm.com> <20081003160448.GF6566@sashak.voltaire.com> <1223051358.8387.26.camel@IBM-29AB850785D.beaverton.ibm.com> <1223433892.24201.16.camel@IBM-29AB850785D.beaverton.ibm.com> <20081008032215.GR7563@sashak.voltaire.com> <1223592528.4355.10.camel@IBM-29AB850785D.beaverton.ibm.com> <20081010081021.GV4912@sashak.voltaire.com> Message-ID: <1223651387.4355.13.camel@IBM-29AB850785D.beaverton.ibm.com> Hello Sasha, 在 2008-10-10五的 10:10 +0200,Sasha Khapyorsky写道: > It is up to you. I will commit this soon anyway and send the patch to > the list. Thanks to fix the typo. I have opened a bug for tracking purpose to make sure OFED-1.4 RC3 to pick up this patch. Bug number is 1277. thanks Shirley From sashak at voltaire.com Fri Oct 10 10:03:03 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 10 Oct 2008 19:03:03 +0200 Subject: [ofa-general] [PATCH] ibutils/ibmgtsim: add missing header file for std::find() In-Reply-To: <20081010150706.GC6947@sashak.voltaire.com> References: <20081010150706.GC6947@sashak.voltaire.com> Message-ID: <20081010170303.GD6947@sashak.voltaire.com> Function prototypes are mandatory with C++. The compilation fails with gcc-4.3.1. This adds needed header inclusion (algorithm) for std::find() function. Signed-off-by: Sasha Khapyorsky --- Same story as with previous patch, but now in ibmgtsim (not included in default ibutils build). ibmgtsim/src/node.cpp | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/ibmgtsim/src/node.cpp b/ibmgtsim/src/node.cpp index 04f087c..8bc35ae 100644 --- a/ibmgtsim/src/node.cpp +++ b/ibmgtsim/src/node.cpp @@ -50,6 +50,7 @@ #include "node.h" #include "sim.h" #include "randmgr.h" +#include ////////////////////////////////////////////////////////////// // -- 1.6.0.1.196.g01914 From sashak at voltaire.com Fri Oct 10 10:10:25 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 10 Oct 2008 19:10:25 +0200 Subject: [ofa-general] Re: openSM for supporting IPv6 SNM MGIDs consolidation In-Reply-To: <1223651387.4355.13.camel@IBM-29AB850785D.beaverton.ibm.com> References: <1223007639.8387.22.camel@IBM-29AB850785D.beaverton.ibm.com> <20081003153900.GC6566@sashak.voltaire.com> <1223049716.8387.24.camel@IBM-29AB850785D.beaverton.ibm.com> <20081003160448.GF6566@sashak.voltaire.com> <1223051358.8387.26.camel@IBM-29AB850785D.beaverton.ibm.com> <1223433892.24201.16.camel@IBM-29AB850785D.beaverton.ibm.com> <20081008032215.GR7563@sashak.voltaire.com> <1223592528.4355.10.camel@IBM-29AB850785D.beaverton.ibm.com> <20081010081021.GV4912@sashak.voltaire.com> <1223651387.4355.13.camel@IBM-29AB850785D.beaverton.ibm.com> Message-ID: <20081010171025.GE6947@sashak.voltaire.com> On 08:09 Fri 10 Oct , Shirley Ma wrote: > > Thanks to fix the typo. I have opened a bug for tracking purpose to make > sure OFED-1.4 RC3 to pick up this patch. Bug number is 1277. Thanks. I committed the fix to main stream already. You can use OpenSM from master branch. Daily build should be ready tomorrow. Sasha From rdreier at cisco.com Fri Oct 10 11:58:52 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 10 Oct 2008 11:58:52 -0700 Subject: [ofa-general] Re: [PATCH v2] ipoib: fix hang while bringing down uninitialized interface In-Reply-To: <48CEA6DC.9000904@gmail.com> (Yossi Etigin's message of "Mon, 15 Sep 2008 21:18:04 +0300") References: <48CEA6DC.9000904@gmail.com> Message-ID: > Fix bug #1172: If a pkey for an interface is not found during > initialization, then poll_timer is left uninitialized. When the > device is brought down, ipoib tries to del_timer_sync() it. This > call hangs in an infinite loop in lock_timer_base(), because > timer_base is NULL. We should check whether the timer was really > initialized. Sorry for being so slow to get to this. But does it work to just make sure the timer is always initialized? Seems cleaner that way, and it makes the code an insignificant bit smaller as a bonus. ie does the patch below fix things too? drivers/infiniband/ulp/ipoib/ipoib_ib.c | 7 +++---- 1 files changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 0e748ae..28eb6f0 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -685,10 +685,6 @@ int ipoib_ib_dev_open(struct net_device *dev) queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, round_jiffies_relative(HZ)); - init_timer(&priv->poll_timer); - priv->poll_timer.function = ipoib_ib_tx_timer_func; - priv->poll_timer.data = (unsigned long)dev; - set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags); return 0; @@ -906,6 +902,9 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port) return -ENODEV; } + setup_timer(&priv->poll_timer, ipoib_ib_tx_timer_func, + (unsigned long) dev); + if (dev->flags & IFF_UP) { if (ipoib_ib_dev_open(dev)) { ipoib_transport_dev_cleanup(dev); From swise at opengridcomputing.com Fri Oct 10 12:21:28 2008 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 10 Oct 2008 14:21:28 -0500 Subject: [ofa-general] [PATCH] RDMA/cxgb3: Remove cmid reference on tid allocation failures. Message-ID: <20081010192128.17278.8317.stgit@dell3.ogc.int> From: Steve Wise The error path in iwch_connect() can fail to remove the cmid reference, which will cause the process to hang when destroying the cmid. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cm.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index c325c44..44e936e 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -1942,6 +1942,7 @@ fail4: fail3: cxgb3_free_atid(ep->com.tdev, ep->atid); fail2: + cm_id->rem_ref(cm_id); put_ep(&ep->com); out: return err; From kliteyn at dev.mellanox.co.il Fri Oct 10 13:21:37 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Fri, 10 Oct 2008 22:21:37 +0200 Subject: [ofa-general] Re: [PATCH] ibutils/ibis: prevent buffer overflows In-Reply-To: <20081010143308.GA6947@sashak.voltaire.com> References: <20081010143308.GA6947@sashak.voltaire.com> Message-ID: <48EFB951.50004@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > There are couple of one byte buffer overflows in ibis*_wrap.c* files. > Guess those files where generated originally, but I didn't find from > where stuff like obj->log_file[1024] = '\0' is coming. So fising in > place. Yeah, it wasn't so simple to find where do they come from. description[IB_NODE_DESCRIPTION_SIZE] was relatively easy, but the other one was tricky... I'll send a v2 of your patch with the files that have the origin of these bugs. -- Yevgeny > Signed-off-by: Sasha Khapyorsky > --- > ibis/src/ibis_wrap.c | 4 ++-- > ibis/src/ibissh_wrap.cpp | 4 ++-- > 2 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/ibis/src/ibis_wrap.c b/ibis/src/ibis_wrap.c > index 70bc3b2..85e72d5 100644 > --- a/ibis/src/ibis_wrap.c > +++ b/ibis/src/ibis_wrap.c > @@ -44884,7 +44884,7 @@ static int TclsmVlArbTableCmd(ClientData clientData, Tcl_Interp *interp, int obj > static ibsm_node_desc_str_t * _ibsm_node_desc_description_set(smNodeDesc *obj, ibsm_node_desc_str_t val[IB_NODE_DESCRIPTION_SIZE]) { > { > strncpy((char *)obj->description,(char *)val,IB_NODE_DESCRIPTION_SIZE - 1); > - obj->description[IB_NODE_DESCRIPTION_SIZE] = '\0'; > + obj->description[IB_NODE_DESCRIPTION_SIZE - 1] = '\0'; > } > return (ibsm_node_desc_str_t *) val; > } > @@ -72782,7 +72782,7 @@ static int _wrap_ibis_opt_t_log_flags_get(ClientData clientData, Tcl_Interp *int > static char * _ibis_opt_log_file_set(ibis_opt_t *obj, char val[1024]) { > { > strncpy(obj->log_file,val,1024 - 1); > - obj->log_file[1024] = '\0'; > + obj->log_file[1023] = '\0'; > } > return (char *) val; > } > diff --git a/ibis/src/ibissh_wrap.cpp b/ibis/src/ibissh_wrap.cpp > index a794cc4..ece7c9c 100644 > --- a/ibis/src/ibissh_wrap.cpp > +++ b/ibis/src/ibissh_wrap.cpp > @@ -44946,7 +44946,7 @@ static int TclsmVlArbTableCmd(ClientData clientData, Tcl_Interp *interp, int obj > static ibsm_node_desc_str_t * _ibsm_node_desc_description_set(smNodeDesc *obj, ibsm_node_desc_str_t val[IB_NODE_DESCRIPTION_SIZE]) { > { > strncpy((char *)obj->description,(char *)val,IB_NODE_DESCRIPTION_SIZE - 1); > - obj->description[IB_NODE_DESCRIPTION_SIZE] = '\0'; > + obj->description[IB_NODE_DESCRIPTION_SIZE - 1] = '\0'; > } > return (ibsm_node_desc_str_t *) val; > } > @@ -72844,7 +72844,7 @@ static int _wrap_ibis_opt_t_log_flags_get(ClientData clientData, Tcl_Interp *int > static char * _ibis_opt_log_file_set(ibis_opt_t *obj, char val[1024]) { > { > strncpy(obj->log_file,val,1024 - 1); > - obj->log_file[1024] = '\0'; > + obj->log_file[1023] = '\0'; > } > return (char *) val; > } From kliteyn at dev.mellanox.co.il Fri Oct 10 13:22:11 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Fri, 10 Oct 2008 22:22:11 +0200 Subject: [ofa-general] [PATCH v2] ibutils/ibis: prevent buffer overflows Message-ID: <48EFB973.4010105@dev.mellanox.co.il> Oren, As discovered by Sasha: > There are couple of one byte buffer overflows in ibis*_wrap.c* files. > Guess those files where generated originally, but I didn't find from > where stuff like obj->log_file[1024] = '\0' is coming. So fising in > place. > > Signed-off-by: Sasha Khapyorsky Fixing buffer overflows in the .i files. Note that one of them is in typemap of char array, which makes me wonder how this thing even worked... Please regenerate wrappers after this patch. Signed-off-by: Yevgeny Kliteynik --- ibis/src/ibis_typemaps.i | 2 +- ibis/src/ibsm.i | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/ibis/src/ibis_typemaps.i b/ibis/src/ibis_typemaps.i index b9449d2..4855e85 100644 --- a/ibis/src/ibis_typemaps.i +++ b/ibis/src/ibis_typemaps.i @@ -157,7 +157,7 @@ /* handle char arrays as members of a struct */ %typemap (tcl8, memberin) char [ANY] { strncpy($target,$source,$dim0 - 1); - $target[$dim0] = '\0'; + $target[$dim0 - 1] = '\0'; } %typemap(tcl8,out) ib_gid_t* { diff --git a/ibis/src/ibsm.i b/ibis/src/ibsm.i index 5979547..0e3d69b 100644 --- a/ibis/src/ibsm.i +++ b/ibis/src/ibsm.i @@ -642,7 +642,7 @@ typedef struct _ibsm_vl_arb_table } %typemap(tcl8,memberin) ibsm_node_desc_str_t[IB_NODE_DESCRIPTION_SIZE] { strncpy((char *)$target,(char *)$source,IB_NODE_DESCRIPTION_SIZE - 1); - $target[IB_NODE_DESCRIPTION_SIZE] = '\0'; + $target[IB_NODE_DESCRIPTION_SIZE - 1] = '\0'; } %typemap(tcl8,out) ibsm_node_desc_str_t[ANY] { -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Fri Oct 10 14:10:13 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Fri, 10 Oct 2008 23:10:13 +0200 Subject: [ofa-general] [PATCH v3] ibutils/ibis: prevent buffer overflows Message-ID: <48EFC4B5.2040907@dev.mellanox.co.il> Oren, [v3 of the patch - fixing the signed-off line] As discovered by Sasha, fixing two buffer overflows in the .i files. Note that one of them is in typemap of char array, which makes me wonder how this thing even worked... Please regenerate wrappers after this patch. Signed-off-by: Sasha Khapyorsky Signed-off-by: Yevgeny Kliteynik --- ibis/src/ibis_typemaps.i | 2 +- ibis/src/ibsm.i | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/ibis/src/ibis_typemaps.i b/ibis/src/ibis_typemaps.i index b9449d2..4855e85 100644 --- a/ibis/src/ibis_typemaps.i +++ b/ibis/src/ibis_typemaps.i @@ -157,7 +157,7 @@ /* handle char arrays as members of a struct */ %typemap (tcl8, memberin) char [ANY] { strncpy($target,$source,$dim0 - 1); - $target[$dim0] = '\0'; + $target[$dim0 - 1] = '\0'; } %typemap(tcl8,out) ib_gid_t* { diff --git a/ibis/src/ibsm.i b/ibis/src/ibsm.i index 5979547..0e3d69b 100644 --- a/ibis/src/ibsm.i +++ b/ibis/src/ibsm.i @@ -642,7 +642,7 @@ typedef struct _ibsm_vl_arb_table } %typemap(tcl8,memberin) ibsm_node_desc_str_t[IB_NODE_DESCRIPTION_SIZE] { strncpy((char *)$target,(char *)$source,IB_NODE_DESCRIPTION_SIZE - 1); - $target[IB_NODE_DESCRIPTION_SIZE] = '\0'; + $target[IB_NODE_DESCRIPTION_SIZE - 1] = '\0'; } %typemap(tcl8,out) ibsm_node_desc_str_t[ANY] { -- 1.5.1.4 From jon at opengridcomputing.com Fri Oct 10 14:17:54 2008 From: jon at opengridcomputing.com (Jon Mason) Date: Fri, 10 Oct 2008 16:17:54 -0500 Subject: [ofa-general] [RFC] rds: iWARP RDMA enablement Message-ID: <20081010211753.GA20735@opengridcomputing.com> Hey Andy, This patch contains all of the changes needed to get rds-rdma working on iWARP (with one FIXME left). This patch will apply to a stock OFED-1.4 kernel, and includes the patch I sent out previously with changes to enable RDMA READs. While not complete, I wanted to sendout code for review to help me diagnose any coding, design, or style errors. The remaining FIXME in the code is allowing for multiple rds connections (i.e., QPs) from the same host. This patch contains a stress bug that I have yet to determine root cause. Running rds-stress, large RDMA payloads may cause memory corruption. This is most likely caused by running over the bounds of one of the RDS rings under stress, as my rds-simple tests run without problems for long runs. Let me know what you think. Thanks, Jon Signed-Off-By: Jon Mason diff --git a/net/rds/ib.c b/net/rds/ib.c index 926de1e..437ef2a 100644 --- a/net/rds/ib.c +++ b/net/rds/ib.c @@ -43,11 +43,17 @@ unsigned int fmr_pool_size = RDS_FMR_POOL_SIZE; unsigned int fmr_message_size = RDS_FMR_SIZE + 1; /* +1 allows for unaligned MRs */ +unsigned int fastreg_pool_size = RDS_FASTREG_POOL_SIZE; +unsigned int fastreg_message_size = RDS_FASTREG_SIZE + 1; /* +1 allows for unaligned MRs */ module_param(fmr_pool_size, int, 0444); MODULE_PARM_DESC(fmr_pool_size, " Max number of fmr per HCA"); module_param(fmr_message_size, int, 0444); MODULE_PARM_DESC(fmr_message_size, " Max size of a RDMA transfer"); +module_param(fastreg_pool_size, int, 0444); +MODULE_PARM_DESC(fastreg_pool_size, " Max number of fastreg MRs per device"); +module_param(fastreg_message_size, int, 0444); +MODULE_PARM_DESC(fastreg_message_size, " Max size of a RDMA transfer (fastreg MRs)"); struct list_head rds_ib_devices; @@ -113,13 +119,17 @@ void rds_ib_add_one(struct ib_device *device) } else rds_ibdev->mr = NULL; + /* Tell the RDMA code to use the fastreg API */ + if (dev_attr->device_cap_flags & IB_DEVICE_MEM_MGT_EXTENSIONS) + rds_ibdev->use_fastreg = 1; + rds_ibdev->mr_pool = rds_ib_create_mr_pool(rds_ibdev); if (IS_ERR(rds_ibdev->mr_pool)) { rds_ibdev->mr_pool = NULL; goto err_mr; } - INIT_LIST_HEAD(&rds_ibdev->ipaddr_list); + INIT_LIST_HEAD(&rds_ibdev->cm_id_list); INIT_LIST_HEAD(&rds_ibdev->conn_list); list_add_tail(&rds_ibdev->list, &rds_ib_devices); @@ -128,7 +138,7 @@ void rds_ib_add_one(struct ib_device *device) goto free_attr; err_mr: - if (!rds_ibdev->dma_local_lkey) + if (rds_ibdev->mr) ib_dereg_mr(rds_ibdev->mr); err_pd: ib_dealloc_pd(rds_ibdev->pd); @@ -141,15 +151,15 @@ free_attr: void rds_ib_remove_one(struct ib_device *device) { struct rds_ib_device *rds_ibdev; - struct rds_ib_ipaddr *i_ipaddr, *i_next; + struct rds_ib_cm_id *i_cm_id, *next; rds_ibdev = ib_get_client_data(device, &rds_ib_client); if (!rds_ibdev) return; - list_for_each_entry_safe(i_ipaddr, i_next, &rds_ibdev->ipaddr_list, list) { - list_del(&i_ipaddr->list); - kfree(i_ipaddr); + list_for_each_entry_safe(i_cm_id, next, &rds_ibdev->cm_id_list, list) { + list_del(&i_cm_id->list); + kfree(i_cm_id); } rds_ib_remove_conns(rds_ibdev); @@ -157,7 +167,7 @@ void rds_ib_remove_one(struct ib_device *device) if (rds_ibdev->mr_pool) rds_ib_destroy_mr_pool(rds_ibdev->mr_pool); - if (!rds_ibdev->dma_local_lkey) + if (rds_ibdev->mr) ib_dereg_mr(rds_ibdev->mr); while (ib_dealloc_pd(rds_ibdev->pd)) { diff --git a/net/rds/ib.h b/net/rds/ib.h index 382c396..efba6fa 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -9,6 +9,8 @@ #define RDS_FMR_SIZE 256 #define RDS_FMR_POOL_SIZE 4096 +#define RDS_FASTREG_SIZE 20 +#define RDS_FASTREG_POOL_SIZE 2048 #define RDS_IB_MAX_SGE 8 #define RDS_IB_RECV_SGE 2 @@ -49,9 +51,32 @@ struct rds_ib_connect_private { __be32 dp_credit; /* non-zero enables flow ctl */ }; +struct rds_ib_scatterlist { + struct scatterlist *list; + unsigned int len; + int dma_len; + unsigned int dma_npages; + unsigned int bytes; +}; + +struct rds_ib_mapping { + spinlock_t m_lock; + struct list_head m_list; + struct rds_ib_mr *m_mr; + uint32_t m_rkey; + struct rds_ib_scatterlist m_sg; +}; + struct rds_ib_send_work { struct rds_message *s_rm; + + /* We should really put these into a union: */ struct rds_rdma_op *s_op; + struct rds_ib_mapping *s_mapping; + struct ib_mr *s_mr; + struct ib_fast_reg_page_list *s_page_list; + unsigned char s_remap_count; + struct ib_send_wr s_wr; struct ib_sge s_sge[RDS_IB_MAX_SGE]; unsigned long s_queued; @@ -126,8 +151,8 @@ struct rds_ib_connection { unsigned int i_flowctl : 1, /* enable/disable flow ctl */ i_iwarp : 1, /* this is actually iWARP not IB */ i_fastreg : 1, /* device supports fastreg */ - i_dma_local_lkey : 1; - + i_dma_local_lkey : 1, + i_fastreg_posted : 1; /* fastreg posted on this connection */ /* Batched completions */ unsigned int i_unsignaled_wrs; long i_unsignaled_bytes; @@ -139,9 +164,9 @@ struct rds_ib_connection { #define IB_SET_SEND_CREDITS(v) ((v) & 0xffff) #define IB_SET_POST_CREDITS(v) ((v) << 16) -struct rds_ib_ipaddr { +struct rds_ib_cm_id { struct list_head list; - __be32 ipaddr; + struct rdma_cm_id *cm_id; }; struct rds_ib_devconn { @@ -151,7 +176,7 @@ struct rds_ib_devconn { struct rds_ib_device { struct list_head list; - struct list_head ipaddr_list; + struct list_head cm_id_list; struct list_head conn_list; struct ib_device *dev; struct ib_pd *pd; @@ -253,6 +278,8 @@ extern struct ib_client rds_ib_client; extern unsigned int fmr_pool_size; extern unsigned int fmr_message_size; +extern unsigned int fastreg_pool_size; +extern unsigned int fastreg_message_size; /* ib_cm.c */ int rds_ib_conn_alloc(struct rds_connection *conn, gfp_t gfp); @@ -268,14 +295,14 @@ void __rds_ib_conn_error(struct rds_connection *conn, const char *, ...); __rds_ib_conn_error(conn, KERN_WARNING "RDS/IB: " fmt ) /* ib_rdma.c */ -int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr); +int rds_ib_update_cm_id(struct rds_ib_device *rds_ibdev, struct rdma_cm_id *cm_id); int rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn); void rds_ib_remove_conns(struct rds_ib_device *rds_ibdev); struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *); void rds_ib_get_mr_info(struct rds_ib_device *rds_ibdev, struct rds_info_ib_connection *iinfo); void rds_ib_destroy_mr_pool(struct rds_ib_mr_pool *); void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents, - __be32 ip_addr, u32 *key_ret); + struct rds_sock *rs, u32 *key_ret); void rds_ib_sync_mr(void *trans_private, int dir); void rds_ib_free_mr(void *trans_private, int invalidate); void rds_ib_flush_mrs(void); diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 5b47d72..ffa9f39 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -32,7 +32,6 @@ */ #include #include -#include #include "rds.h" #include "ib.h" @@ -140,7 +139,7 @@ static void rds_ib_connect_complete(struct rds_connection *conn, struct rdma_cm_ /* update ib_device with this local ipaddr & conn */ rds_ibdev = ib_get_client_data(ic->i_cm_id->device, &rds_ib_client); - err = rds_ib_update_ipaddr(rds_ibdev, conn->c_laddr); + err = rds_ib_update_cm_id(rds_ibdev, ic->i_cm_id); if (err) printk(KERN_ERR "rds_ib_update_ipaddr failed (%d)\n", err); err = rds_ib_add_conn(rds_ibdev, conn); @@ -210,8 +209,12 @@ static void rds_ib_qp_event_handler(struct ib_event *event, void *data) case IB_EVENT_COMM_EST: rdma_notify(ic->i_cm_id, IB_EVENT_COMM_EST); break; + case IB_EVENT_QP_REQ_ERR: + printk("Hit IB_EVENT_QP_REQ_ERR\n"); + ic->i_cm_id->qp = NULL; + break; default: - printk(KERN_WARNING "RDS/ib: unhandled QP event %u " + printk(KERN_WARNING "RDS/IB: unhandled QP event %u " "on connection to %u.%u.%u.%u\n", event->event, NIPQUAD(conn->c_faddr)); break; @@ -219,6 +222,79 @@ static void rds_ib_qp_event_handler(struct ib_event *event, void *data) } /* + * Create a QP + */ +static int rds_ib_init_qp_attrs(struct ib_qp_init_attr *attr, + struct rds_ib_device *rds_ibdev, + struct rds_ib_work_ring *send_ring, + void (*send_cq_handler)(struct ib_cq *, void *), + struct rds_ib_work_ring *recv_ring, + void (*recv_cq_handler)(struct ib_cq *, void *), + void *context) +{ + struct ib_device *dev = rds_ibdev->dev; + unsigned int send_size, recv_size; + int ret; + + /* The offset of 1 is to accomodate the additional ACK WR. */ + send_size = min_t(unsigned int, rds_ibdev->max_wrs, rds_ib_sysctl_max_send_wr + 1); + recv_size = min_t(unsigned int, rds_ibdev->max_wrs, rds_ib_sysctl_max_recv_wr + 1); + rds_ib_ring_resize(send_ring, send_size - 1); + rds_ib_ring_resize(recv_ring, recv_size - 1); + + memset(attr, 0, sizeof(*attr)); + attr->event_handler = rds_ib_qp_event_handler; + attr->qp_context = context; + attr->cap.max_send_wr = send_size; + attr->cap.max_recv_wr = recv_size; + attr->cap.max_send_sge = rds_ibdev->max_sge; + attr->cap.max_recv_sge = RDS_IB_RECV_SGE; + attr->sq_sig_type = IB_SIGNAL_REQ_WR; + attr->qp_type = IB_QPT_RC; + + attr->send_cq = ib_create_cq(dev, send_cq_handler, + rds_ib_cq_event_handler, + context, send_size, 0); + if (IS_ERR(attr->send_cq)) { + ret = PTR_ERR(attr->send_cq); + attr->send_cq = NULL; + rdsdebug("ib_create_cq send failed: %d\n", ret); + goto out; + } + + attr->recv_cq = ib_create_cq(dev, recv_cq_handler, + rds_ib_cq_event_handler, + context, recv_size, 0); + if (IS_ERR(attr->recv_cq)) { + ret = PTR_ERR(attr->recv_cq); + attr->recv_cq = NULL; + rdsdebug("ib_create_cq send failed: %d\n", ret); + goto out; + } + + ret = ib_req_notify_cq(attr->send_cq, IB_CQ_NEXT_COMP); + if (ret) { + rdsdebug("ib_req_notify_cq send failed: %d\n", ret); + goto out; + } + + ret = ib_req_notify_cq(attr->recv_cq, IB_CQ_SOLICITED); + if (ret) { + rdsdebug("ib_req_notify_cq recv failed: %d\n", ret); + goto out; + } + +out: + if (ret) { + if (attr->send_cq) + ib_destroy_cq(attr->send_cq); + if (attr->recv_cq) + ib_destroy_cq(attr->recv_cq); + } + return ret; +} + +/* * This needs to be very careful to not leave IS_ERR pointers around for * cleanup to trip over. */ @@ -243,60 +319,19 @@ static int rds_ib_setup_qp(struct rds_connection *conn) return -EOPNOTSUPP; } - if (rds_ibdev->max_wrs < ic->i_send_ring.w_nr + 1) - rds_ib_ring_resize(&ic->i_send_ring, rds_ibdev->max_wrs - 1); - if (rds_ibdev->max_wrs < ic->i_recv_ring.w_nr + 1) - rds_ib_ring_resize(&ic->i_recv_ring, rds_ibdev->max_wrs - 1); - /* Protection domain and memory range */ ic->i_pd = rds_ibdev->pd; ic->i_mr = rds_ibdev->mr; - ic->i_send_cq = ib_create_cq(dev, rds_ib_send_cq_comp_handler, - rds_ib_cq_event_handler, conn, - ic->i_send_ring.w_nr + 1, 0); - if (IS_ERR(ic->i_send_cq)) { - ret = PTR_ERR(ic->i_send_cq); - ic->i_send_cq = NULL; - rdsdebug("ib_create_cq send failed: %d\n", ret); - goto out; - } - - ic->i_recv_cq = ib_create_cq(dev, rds_ib_recv_cq_comp_handler, - rds_ib_cq_event_handler, conn, - ic->i_recv_ring.w_nr, 0); - if (IS_ERR(ic->i_recv_cq)) { - ret = PTR_ERR(ic->i_recv_cq); - ic->i_recv_cq = NULL; - rdsdebug("ib_create_cq recv failed: %d\n", ret); - goto out; - } - - ret = ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP); - if (ret) { - rdsdebug("ib_req_notify_cq send failed: %d\n", ret); + ret = rds_ib_init_qp_attrs(&attr, rds_ibdev, + &ic->i_send_ring, rds_ib_send_cq_comp_handler, + &ic->i_recv_ring, rds_ib_recv_cq_comp_handler, + conn); + if (ret < 0) goto out; - } - - ret = ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED); - if (ret) { - rdsdebug("ib_req_notify_cq recv failed: %d\n", ret); - goto out; - } - /* XXX negotiate max send/recv with remote? */ - memset(&attr, 0, sizeof(attr)); - attr.event_handler = rds_ib_qp_event_handler; - attr.qp_context = conn; - /* + 1 to allow for the single ack message */ - attr.cap.max_send_wr = ic->i_send_ring.w_nr + 1; - attr.cap.max_recv_wr = ic->i_recv_ring.w_nr + 1; - attr.cap.max_send_sge = rds_ibdev->max_sge; - attr.cap.max_recv_sge = RDS_IB_RECV_SGE; - attr.sq_sig_type = IB_SIGNAL_REQ_WR; - attr.qp_type = IB_QPT_RC; - attr.send_cq = ic->i_send_cq; - attr.recv_cq = ic->i_recv_cq; + ic->i_send_cq = attr.send_cq; + ic->i_recv_cq = attr.recv_cq; /* * XXX this can fail if max_*_wr is too large? Are we supposed @@ -487,7 +522,7 @@ static int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id, /* update ib_device with this local ipaddr & conn */ rds_ibdev = ib_get_client_data(ic->i_cm_id->device, &rds_ib_client); - err = rds_ib_update_ipaddr(rds_ibdev, dp->dp_saddr); + err = rds_ib_update_cm_id(rds_ibdev, cm_id); if (err) { rds_ib_conn_error(conn, "rds_ib_update_ipaddr failed (%d)\n", err); goto out; @@ -853,7 +888,7 @@ int __init rds_ib_listen_init(void) cm_id = rdma_create_id(rds_ib_cm_event_handler, NULL, RDMA_PS_TCP); if (IS_ERR(cm_id)) { ret = PTR_ERR(cm_id); - printk(KERN_ERR "RDS/ib: failed to setup listener, " + printk(KERN_ERR "RDS/IB: failed to setup listener, " "rdma_create_id() returned %d\n", ret); goto out; } @@ -868,14 +903,14 @@ int __init rds_ib_listen_init(void) */ ret = rdma_bind_addr(cm_id, (struct sockaddr *)&sin); if (ret) { - printk(KERN_ERR "RDS/ib: failed to setup listener, " + printk(KERN_ERR "RDS/IB: failed to setup listener, " "rdma_bind_addr() returned %d\n", ret); goto out; } ret = rdma_listen(cm_id, 128); if (ret) { - printk(KERN_ERR "RDS/ib: failed to setup listener, " + printk(KERN_ERR "RDS/IB: failed to setup listener, " "rdma_listen() returned %d\n", ret); goto out; } diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index 89e293a..89d1b24 100644 --- a/net/rds/ib_rdma.c +++ b/net/rds/ib_rdma.c @@ -45,100 +45,203 @@ extern struct list_head rds_ib_devices; struct rds_ib_mr { struct rds_ib_device *device; struct rds_ib_mr_pool *pool; - struct ib_fmr *fmr; - struct list_head list; - unsigned int remap_count; - - struct scatterlist * sg; - unsigned int sg_len; - u64 * dma; - int sg_dma_len; + + struct ib_qp *qp; + + union { + struct ib_fmr *fmr; + /* fastreg stuff and maybe others go here */ + struct { + struct ib_mr *mr; + struct ib_fast_reg_page_list *page_list; + } fastreg; + } u; + struct rds_ib_mapping mapping; + unsigned char remap_count; }; +#define fr_mr u.fastreg.mr +#define fr_page_list u.fastreg.page_list + /* * Our own little FMR pool */ struct rds_ib_mr_pool { + struct rds_ib_device *device; /* back ptr to the device that owns us */ + struct mutex flush_lock; /* serialize fmr invalidate */ struct work_struct flush_worker; /* flush worker */ spinlock_t list_lock; /* protect variables below */ atomic_t item_count; /* total # of MRs */ atomic_t dirty_count; /* # dirty of MRs */ - struct list_head drop_list; /* MRs that have reached their max_maps limit */ - struct list_head free_list; /* unused MRs */ + struct list_head dirty_list; /* dirty mappings */ struct list_head clean_list; /* unused & unamapped MRs */ atomic_t free_pinned; /* memory pinned by free MRs */ + unsigned long max_message_size; /* in pages */ unsigned long max_items; unsigned long max_items_soft; unsigned long max_free_pinned; struct ib_fmr_attr fmr_attr; + + struct rds_ib_mr_pool_ops *op; +}; + +struct rds_ib_mr_pool_ops { + int (*init)(struct rds_ib_mr_pool *, struct rds_ib_mr *); + int (*map)(struct rds_ib_mr_pool *pool, struct rds_ib_mr *ibmr, + struct scatterlist *sg, unsigned int sg_len); + void (*free)(struct rds_ib_mr_pool *pool, struct rds_ib_mr *ibmr); + unsigned int (*unmap)(struct rds_ib_mr_pool *, struct list_head *, + struct list_head *); + void (*destroy)(struct rds_ib_mr_pool *, struct rds_ib_mr *); }; + static int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, int free_all); static void rds_ib_teardown_mr(struct rds_ib_mr *ibmr); static void rds_ib_mr_pool_flush_worker(struct work_struct *work); +static int rds_ib_init_fmr(struct rds_ib_mr_pool *pool, struct rds_ib_mr *ibmr); +static int rds_ib_map_fmr(struct rds_ib_mr_pool *pool, + struct rds_ib_mr *ibmr, + struct scatterlist *sg, unsigned int nents); +static void rds_ib_free_fmr(struct rds_ib_mr_pool *pool, struct rds_ib_mr *ibmr); +static unsigned int rds_ib_unmap_fmr_list(struct rds_ib_mr_pool *pool, + struct list_head *unmap_list, + struct list_head *kill_list); +static void rds_ib_destroy_fmr(struct rds_ib_mr_pool *pool, struct rds_ib_mr *ibmr); +static int rds_ib_init_fastreg(struct rds_ib_mr_pool *pool, struct rds_ib_mr *ibmr); +static int rds_ib_map_fastreg(struct rds_ib_mr_pool *pool, + struct rds_ib_mr *ibmr, + struct scatterlist *sg, unsigned int nents); +static void rds_ib_free_fastreg(struct rds_ib_mr_pool *pool, struct rds_ib_mr *ibmr); +static unsigned int rds_ib_unmap_fastreg_list(struct rds_ib_mr_pool *pool, + struct list_head *unmap_list, + struct list_head *kill_list); +static void rds_ib_destroy_fastreg(struct rds_ib_mr_pool *pool, struct rds_ib_mr *ibmr); + +static struct rds_ib_mr_pool_ops rds_ib_fmr_pool_ops = { + .init = rds_ib_init_fmr, + .map = rds_ib_map_fmr, + .free = rds_ib_free_fmr, + .unmap = rds_ib_unmap_fmr_list, + .destroy = rds_ib_destroy_fmr, +}; -static struct rds_ib_device* rds_ib_get_device(__be32 ipaddr) -{ - struct rds_ib_device *rds_ibdev; - struct rds_ib_ipaddr *i_ipaddr; +static struct rds_ib_mr_pool_ops rds_ib_fastreg_pool_ops = { + .init = rds_ib_init_fastreg, + .map = rds_ib_map_fastreg, + .free = rds_ib_free_fastreg, + .unmap = rds_ib_unmap_fastreg_list, + .destroy = rds_ib_destroy_fastreg, +}; - list_for_each_entry(rds_ibdev, &rds_ib_devices, list) { - spin_lock_irq(&rds_ibdev->spinlock); - list_for_each_entry(i_ipaddr, &rds_ibdev->ipaddr_list, list) { - if (i_ipaddr->ipaddr == ipaddr) { - spin_unlock_irq(&rds_ibdev->spinlock); - return rds_ibdev; +static int rds_ib_get_device(struct rds_sock *rs, struct rds_ib_device **rds_ibdev, struct ib_qp **qp) +{ + struct rds_ib_device *ibdev; + struct rds_ib_cm_id *i_cm_id; + + *rds_ibdev = NULL; + *qp = NULL; + + list_for_each_entry(ibdev, &rds_ib_devices, list) { + spin_lock_irq(&ibdev->spinlock); + list_for_each_entry(i_cm_id, &ibdev->cm_id_list, list) { + struct sockaddr_in *src_addr, *dst_addr; + + src_addr = (struct sockaddr_in *)&i_cm_id->cm_id->route.addr.src_addr; + dst_addr = (struct sockaddr_in *)&i_cm_id->cm_id->route.addr.dst_addr; + + rdsdebug("%s: local ipaddr = %x port %d, remote ipaddr = %x port %d" + "....looking for %x port %d, remote ipaddr = %x port %d\n", + __func__, + src_addr->sin_addr.s_addr, + src_addr->sin_port, + dst_addr->sin_addr.s_addr, + dst_addr->sin_port, + rs->rs_bound_addr, + rs->rs_bound_port, + rs->rs_conn_addr, + rs->rs_conn_port); +#if WORKING_TUPLE_DETECTION + if (src_addr->sin_addr.s_addr == rs->rs_bound_addr && + src_addr->sin_port == rs->rs_bound_port && + dst_addr->sin_addr.s_addr == rs->rs_conn_addr && + dst_addr->sin_port == rs->rs_conn_port) { +#else + /* FIXME - needs to compare the local and remote ipaddr/port tuple, but the + * ipaddr is the only available infomation in the rds_sock (as the rest are + * zero'ed. It doesn't appear to be properly populated during connection + * setup... + */ + if (src_addr->sin_addr.s_addr == rs->rs_bound_addr) { +#endif + spin_unlock_irq(&ibdev->spinlock); + *rds_ibdev = ibdev; + *qp = i_cm_id->cm_id->qp; + return 0; } } - spin_unlock_irq(&rds_ibdev->spinlock); + spin_unlock_irq(&ibdev->spinlock); } - return NULL; + return 1; } -static int rds_ib_add_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr) +static int rds_ib_add_cm_id(struct rds_ib_device *rds_ibdev, struct rdma_cm_id *cm_id) { - struct rds_ib_ipaddr *i_ipaddr; + struct rds_ib_cm_id *i_cm_id; - i_ipaddr = kmalloc(sizeof *i_ipaddr, GFP_KERNEL); - if (!i_ipaddr) + i_cm_id = kmalloc(sizeof *i_cm_id, GFP_KERNEL); + if (!i_cm_id) return -ENOMEM; - i_ipaddr->ipaddr = ipaddr; + i_cm_id->cm_id = cm_id; spin_lock_irq(&rds_ibdev->spinlock); - list_add_tail(&i_ipaddr->list, &rds_ibdev->ipaddr_list); + list_add_tail(&i_cm_id->list, &rds_ibdev->cm_id_list); spin_unlock_irq(&rds_ibdev->spinlock); return 0; } -static void rds_ib_remove_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr) +static void rds_ib_remove_cm_id(struct rds_ib_device *rds_ibdev, struct rdma_cm_id *cm_id) { - struct rds_ib_ipaddr *i_ipaddr, *next; + struct rds_ib_cm_id *i_cm_id; spin_lock_irq(&rds_ibdev->spinlock); - list_for_each_entry_safe(i_ipaddr, next, &rds_ibdev->ipaddr_list, list) { - if (i_ipaddr->ipaddr == ipaddr) { - list_del(&i_ipaddr->list); - kfree(i_ipaddr); + list_for_each_entry(i_cm_id, &rds_ibdev->cm_id_list, list) { + if (i_cm_id->cm_id == cm_id) { + list_del(&i_cm_id->list); + kfree(i_cm_id); break; } } spin_unlock_irq(&rds_ibdev->spinlock); } -int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr) + +int rds_ib_update_cm_id(struct rds_ib_device *rds_ibdev, struct rdma_cm_id *cm_id) { - struct rds_ib_device *rds_ibdev_old; + struct sockaddr_in *src_addr, *dst_addr; + struct rds_ib_device *rds_ibdev_old; + struct rds_sock rs; + struct ib_qp *qp; + int rc; + + src_addr = (struct sockaddr_in *)&cm_id->route.addr.src_addr; + dst_addr = (struct sockaddr_in *)&cm_id->route.addr.dst_addr; + + rs.rs_bound_addr = src_addr->sin_addr.s_addr; + rs.rs_bound_port = src_addr->sin_port; + rs.rs_conn_addr = dst_addr->sin_addr.s_addr; + rs.rs_conn_port = dst_addr->sin_port; - rds_ibdev_old = rds_ib_get_device(ipaddr); - if (rds_ibdev_old) - rds_ib_remove_ipaddr(rds_ibdev_old, ipaddr); + rc = rds_ib_get_device(&rs, &rds_ibdev_old, &qp); + if (rc) + rds_ib_remove_cm_id(rds_ibdev, cm_id); - return rds_ib_add_ipaddr(rds_ibdev, ipaddr); + return rds_ib_add_cm_id(rds_ibdev, cm_id); } int rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn) @@ -172,26 +275,152 @@ void rds_ib_remove_conns(struct rds_ib_device *rds_ibdev) spin_unlock_irq(&rds_ibdev->spinlock); } -struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_ibdev) +static void rds_ib_set_scatterlist(struct rds_ib_scatterlist *sg, + struct scatterlist *list, unsigned int sg_len) { - struct rds_ib_mr_pool *pool; + sg->list = list; + sg->len = sg_len; + sg->dma_len = 0; + sg->dma_npages = 0; + sg->bytes = 0; +} + +static int rds_ib_drop_scatterlist(struct rds_ib_device *rds_ibdev, + struct rds_ib_scatterlist *sg) +{ + int unpinned = 0; + + if (sg->dma_len) { + ib_dma_unmap_sg(rds_ibdev->dev, + sg->list, sg->len, + DMA_BIDIRECTIONAL); + sg->dma_len = 0; + } + + /* Release the s/g list */ + if (sg->len) { + unsigned int i; + + for (i = 0; i < sg->len; ++i) { + struct page *page = sg_page(&sg->list[i]); + + /* FIXME we need a way to tell a r/w MR + * from a r/o MR */ + set_page_dirty(page); + put_page(page); + } + + unpinned = sg->len; + sg->len = 0; + + kfree(sg->list); + sg->list = NULL; + } + + return unpinned; +} + +static u64 *rds_ib_map_scatterlist(struct rds_ib_device *rds_ibdev, + struct rds_ib_scatterlist *sg, + unsigned int dma_page_shift) +{ + struct ib_device *dev = rds_ibdev->dev; + u64 *dma_pages = NULL; + u64 dma_mask; + unsigned int dma_page_size; + int i, j, ret; - /* For now, disable all RDMA service on iWARP. This check will - * go away when we have a working patch. */ - if (rds_ibdev->dev->node_type == RDMA_NODE_RNIC) - return NULL; + dma_page_size = 1 << dma_page_shift; + dma_mask = dma_page_size - 1; + + WARN_ON(sg->dma_len); + + sg->dma_len = ib_dma_map_sg(dev, sg->list, sg->len, DMA_BIDIRECTIONAL); + if (unlikely(!sg->dma_len)) { + printk(KERN_WARNING "RDS/IB: dma_map_sg failed!\n"); + return ERR_PTR(-EBUSY); + } + + sg->bytes = 0; + sg->dma_npages = 0; + + ret = -EINVAL; + for (i = 0; i < sg->dma_len; ++i) { + unsigned int dma_len = ib_sg_dma_len(dev, &sg->list[i]); + u64 dma_addr = ib_sg_dma_address(dev, &sg->list[i]); + u64 end_addr; + + sg->bytes += dma_len; + + end_addr = dma_addr + dma_len; + if (dma_addr & dma_mask) { + if (i > 0) + goto out_unmap; + dma_addr &= ~dma_mask; + } + if (end_addr & dma_mask) { + if (i < sg->dma_len - 1) + goto out_unmap; + end_addr = (end_addr + dma_mask) & ~dma_mask; + } + + sg->dma_npages += (end_addr - dma_addr) >> dma_page_shift; + } + + /* Now gather the dma addrs into one list */ + if (sg->dma_npages > fmr_message_size) + goto out_unmap; + + dma_pages = kmalloc(sizeof(u64) * sg->dma_npages, GFP_ATOMIC); + if (!dma_pages) { + ret = -ENOMEM; + goto out_unmap; + } + + for (i = j = 0; i < sg->dma_len; ++i) { + unsigned int dma_len = ib_sg_dma_len(dev, &sg->list[i]); + u64 dma_addr = ib_sg_dma_address(dev, &sg->list[i]); + u64 end_addr; + + end_addr = dma_addr + dma_len; + dma_addr &= ~dma_mask; + for (; dma_addr < end_addr; dma_addr += dma_page_size) + dma_pages[j++] = dma_addr; + } + + return dma_pages; + +out_unmap: + ib_dma_unmap_sg(rds_ibdev->dev, sg->list, sg->len, DMA_BIDIRECTIONAL); + sg->dma_len = 0; + if (dma_pages) + kfree(dma_pages); + return ERR_PTR(ret); +} + + +static struct rds_ib_mr_pool *__rds_ib_create_mr_pool(struct rds_ib_device *rds_ibdev, + unsigned int message_size, unsigned int pool_size, + struct rds_ib_mr_pool_ops *ops) +{ + struct rds_ib_mr_pool *pool; pool = kzalloc(sizeof(*pool), GFP_KERNEL); if (!pool) return ERR_PTR(-ENOMEM); - INIT_LIST_HEAD(&pool->free_list); - INIT_LIST_HEAD(&pool->drop_list); + pool->op = ops; + pool->device = rds_ibdev; + INIT_LIST_HEAD(&pool->dirty_list); INIT_LIST_HEAD(&pool->clean_list); mutex_init(&pool->flush_lock); spin_lock_init(&pool->list_lock); INIT_WORK(&pool->flush_worker, rds_ib_mr_pool_flush_worker); + pool->max_message_size = message_size; + pool->max_items = pool_size; + pool->max_free_pinned = pool->max_items * pool->max_message_size / 4; + pool->fmr_attr.max_pages = fmr_message_size; pool->fmr_attr.max_maps = rds_ibdev->fmr_max_remaps; pool->fmr_attr.page_shift = rds_ibdev->fmr_page_shift; @@ -202,8 +431,44 @@ struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_ibdev) * items more aggressively. * Make sure that max_items > max_items_soft > max_items / 2 */ - pool->max_items_soft = rds_ibdev->max_fmrs * 3 / 4; - pool->max_items = rds_ibdev->max_fmrs; + pool->max_items_soft = pool->max_items * 3 / 4; + + return pool; +} + +struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_ibdev) +{ + struct rds_ib_mr_pool *pool; + unsigned int pool_size; + + if (!rds_ibdev->use_fastreg) { + /* Use FMRs to implement memory registrations */ + pool_size = fmr_pool_size; + + if (rds_ibdev->max_fmrs && rds_ibdev->max_fmrs < pool_size) + pool_size = rds_ibdev->max_fmrs; + + pool = __rds_ib_create_mr_pool(rds_ibdev, fmr_message_size, pool_size, + &rds_ib_fmr_pool_ops); + + if (!IS_ERR(pool)) { + pool->fmr_attr.max_pages = pool->max_message_size; + pool->fmr_attr.max_maps = rds_ibdev->fmr_max_remaps; + pool->fmr_attr.page_shift = rds_ibdev->fmr_page_shift; + } + } else { + /* Use fastregs to implement memory registrations */ + pool_size = fastreg_pool_size; + + pool = __rds_ib_create_mr_pool(rds_ibdev, + fastreg_message_size, + pool_size, + &rds_ib_fastreg_pool_ops); + + if (IS_ERR(pool)) { + printk("__rds_ib_create_mr_pool error\n"); + } + } return pool; } @@ -232,15 +497,15 @@ static inline struct rds_ib_mr *rds_ib_reuse_fmr(struct rds_ib_mr_pool *pool) spin_lock_irqsave(&pool->list_lock, flags); if (!list_empty(&pool->clean_list)) { - ibmr = list_entry(pool->clean_list.next, struct rds_ib_mr, list); - list_del_init(&ibmr->list); + ibmr = list_entry(pool->clean_list.next, struct rds_ib_mr, mapping.m_list); + list_del_init(&ibmr->mapping.m_list); } spin_unlock_irqrestore(&pool->list_lock, flags); return ibmr; } -static struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev) +static struct rds_ib_mr *rds_ib_alloc_mr(struct rds_ib_device *rds_ibdev) { struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool; struct rds_ib_mr *ibmr = NULL; @@ -280,114 +545,26 @@ static struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev) goto out_no_cigar; } - ibmr->fmr = ib_alloc_fmr(rds_ibdev->pd, - (IB_ACCESS_LOCAL_WRITE | - IB_ACCESS_REMOTE_READ | - IB_ACCESS_REMOTE_WRITE), - &pool->fmr_attr); - if (IS_ERR(ibmr->fmr)) { - err = PTR_ERR(ibmr->fmr); - ibmr->fmr = NULL; - printk(KERN_WARNING "RDS/IB: ib_alloc_fmr failed (err=%d)\n", err); + spin_lock_init(&ibmr->mapping.m_lock); + INIT_LIST_HEAD(&ibmr->mapping.m_list); + ibmr->mapping.m_mr = ibmr; + + err = pool->op->init(pool, ibmr); + if (err) goto out_no_cigar; - } rds_ib_stats_inc(s_ib_rdma_mr_alloc); return ibmr; out_no_cigar: if (ibmr) { - if (ibmr->fmr) - ib_dealloc_fmr(ibmr->fmr); + pool->op->destroy(pool, ibmr); kfree(ibmr); } atomic_dec(&pool->item_count); return ERR_PTR(err); } -static int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct rds_ib_mr *ibmr, - struct scatterlist *sg, unsigned int nents) -{ - struct ib_device *dev = rds_ibdev->dev; - struct scatterlist *scat = sg; - u64 io_addr = 0; - u64 *dma_pages; - u32 len; - int page_cnt, sg_dma_len; - int i, j; - int ret; - - sg_dma_len = ib_dma_map_sg(dev, sg, nents, - DMA_BIDIRECTIONAL); - if (unlikely(!sg_dma_len)) { - printk(KERN_WARNING "RDS/IB: dma_map_sg failed!\n"); - return -EBUSY; - } - - len = 0; - page_cnt = 0; - - for (i = 0; i < sg_dma_len; ++i) { - unsigned int dma_len = ib_sg_dma_len(dev, &scat[i]); - u64 dma_addr = ib_sg_dma_address(dev, &scat[i]); - - if (dma_addr & ~rds_ibdev->fmr_page_mask) { - if (i > 0) - return -EINVAL; - else - ++page_cnt; - } - if ((dma_addr + dma_len) & ~rds_ibdev->fmr_page_mask) { - if (i < sg_dma_len - 1) - return -EINVAL; - else - ++page_cnt; - } - - len += dma_len; - } - - page_cnt += len >> rds_ibdev->fmr_page_shift; - if (page_cnt > fmr_message_size) - return -EINVAL; - - dma_pages = kmalloc(sizeof(u64) * page_cnt, GFP_ATOMIC); - if (!dma_pages) - return -ENOMEM; - - page_cnt = 0; - for (i = 0; i < sg_dma_len; ++i) { - unsigned int dma_len = ib_sg_dma_len(dev, &scat[i]); - u64 dma_addr = ib_sg_dma_address(dev, &scat[i]); - - for (j = 0; j < dma_len; j += rds_ibdev->fmr_page_size) - dma_pages[page_cnt++] = - (dma_addr & rds_ibdev->fmr_page_mask) + j; - } - - ret = ib_map_phys_fmr(ibmr->fmr, - dma_pages, page_cnt, io_addr); - if (ret) - goto out; - - /* Success - we successfully remapped the MR, so we can - * safely tear down the old mapping. */ - rds_ib_teardown_mr(ibmr); - - ibmr->sg = scat; - ibmr->sg_len = nents; - ibmr->sg_dma_len = sg_dma_len; - ibmr->remap_count++; - - rds_ib_stats_inc(s_ib_rdma_mr_used); - ret = 0; - -out: - kfree(dma_pages); - - return ret; -} - void rds_ib_sync_mr(void *trans_private, int direction) { struct rds_ib_mr *ibmr = trans_private; @@ -395,51 +572,21 @@ void rds_ib_sync_mr(void *trans_private, int direction) switch (direction) { case DMA_FROM_DEVICE: - ib_dma_sync_sg_for_cpu(rds_ibdev->dev, ibmr->sg, - ibmr->sg_dma_len, DMA_BIDIRECTIONAL); + ib_dma_sync_sg_for_cpu(rds_ibdev->dev, ibmr->mapping.m_sg.list, + ibmr->mapping.m_sg.dma_len, DMA_BIDIRECTIONAL); break; case DMA_TO_DEVICE: - ib_dma_sync_sg_for_device(rds_ibdev->dev, ibmr->sg, - ibmr->sg_dma_len, DMA_BIDIRECTIONAL); + ib_dma_sync_sg_for_device(rds_ibdev->dev, ibmr->mapping.m_sg.list, + ibmr->mapping.m_sg.dma_len, DMA_BIDIRECTIONAL); break; } } -static void __rds_ib_teardown_mr(struct rds_ib_mr *ibmr) -{ - struct rds_ib_device *rds_ibdev = ibmr->device; - - if (ibmr->sg_dma_len) { - ib_dma_unmap_sg(rds_ibdev->dev, - ibmr->sg, ibmr->sg_len, - DMA_BIDIRECTIONAL); - ibmr->sg_dma_len = 0; - } - - /* Release the s/g list */ - if (ibmr->sg_len) { - unsigned int i; - - for (i = 0; i < ibmr->sg_len; ++i) { - struct page *page = sg_page(&ibmr->sg[i]); - - /* FIXME we need a way to tell a r/w MR - * from a r/o MR */ - set_page_dirty(page); - put_page(page); - } - kfree(ibmr->sg); - - ibmr->sg = NULL; - ibmr->sg_len = 0; - } -} - void rds_ib_teardown_mr(struct rds_ib_mr *ibmr) { - unsigned int pinned = ibmr->sg_len; + unsigned int pinned; - __rds_ib_teardown_mr(ibmr); + pinned = rds_ib_drop_scatterlist(ibmr->device, &ibmr->mapping.m_sg); if (pinned) { struct rds_ib_device *rds_ibdev = ibmr->device; struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool; @@ -472,8 +619,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, int free_all) { struct rds_ib_mr *ibmr, *next; LIST_HEAD(unmap_list); - LIST_HEAD(fmr_list); - unsigned long unpinned = 0; + LIST_HEAD(kill_list); unsigned long flags; unsigned int nfreed = 0, ncleaned = 0, free_goal; int ret = 0; @@ -483,49 +629,50 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, int free_all) mutex_lock(&pool->flush_lock); spin_lock_irqsave(&pool->list_lock, flags); - /* Get the list of all MRs to be dropped. Ordering matters - - * we want to put drop_list ahead of free_list. */ - list_splice_init(&pool->free_list, &unmap_list); - list_splice_init(&pool->drop_list, &unmap_list); + /* Get the list of all mappings to be destroyed */ + list_splice_init(&pool->dirty_list, &unmap_list); if (free_all) - list_splice_init(&pool->clean_list, &unmap_list); + list_splice_init(&pool->clean_list, &kill_list); spin_unlock_irqrestore(&pool->list_lock, flags); free_goal = rds_ib_flush_goal(pool, free_all); - if (list_empty(&unmap_list)) - goto out; - - /* String all ib_mr's onto one list and hand them to ib_unmap_fmr */ - list_for_each_entry(ibmr, &unmap_list, list) - list_add(&ibmr->fmr->list, &fmr_list); - ret = ib_unmap_fmr(&fmr_list); - if (ret) - printk(KERN_WARNING "RDS/IB: ib_unmap_fmr failed (err=%d)\n", ret); + /* Batched invalidate of dirty MRs. + * For FMR based MRs, the mappings on the unmap list are + * actually members of an ibmr (ibmr->mapping). They either + * migrate to the kill_list, or have been cleaned and should be + * moved to the clean_list. + * For fastregs, they will be dynamically allocated, and + * will be destroyed by the unmap function. + */ + if (!list_empty(&unmap_list)) { + ncleaned = pool->op->unmap(pool, &unmap_list, &kill_list); + /* If we've been asked to destroy all MRs, move those + * that were simply cleaned to the kill list */ + if (free_all) + list_splice_init(&unmap_list, &kill_list); + } - /* Now we can destroy the DMA mapping and unpin any pages */ - list_for_each_entry_safe(ibmr, next, &unmap_list, list) { - unpinned += ibmr->sg_len; - __rds_ib_teardown_mr(ibmr); - if (nfreed < free_goal || ibmr->remap_count >= pool->fmr_attr.max_maps) { - rds_ib_stats_inc(s_ib_rdma_mr_free); - list_del(&ibmr->list); - ib_dealloc_fmr(ibmr->fmr); - kfree(ibmr); - nfreed++; - } - ncleaned++; + /* Destroy any MRs that are past their best before date */ + list_for_each_entry_safe(ibmr, next, &kill_list, mapping.m_list) { + rds_ib_stats_inc(s_ib_rdma_mr_free); + list_del(&ibmr->mapping.m_list); + pool->op->destroy(pool, ibmr); + kfree(ibmr); + nfreed++; } - spin_lock_irqsave(&pool->list_lock, flags); - list_splice(&unmap_list, &pool->clean_list); - spin_unlock_irqrestore(&pool->list_lock, flags); + /* Anything that remains are laundered ibmrs, which we can add + * back to the clean list. */ + if (!list_empty(&unmap_list)) { + spin_lock_irqsave(&pool->list_lock, flags); + list_splice(&unmap_list, &pool->clean_list); + spin_unlock_irqrestore(&pool->list_lock, flags); + } - atomic_sub(unpinned, &pool->free_pinned); atomic_sub(ncleaned, &pool->dirty_count); atomic_sub(nfreed, &pool->item_count); -out: mutex_unlock(&pool->flush_lock); return ret; } @@ -540,24 +687,14 @@ void rds_ib_mr_pool_flush_worker(struct work_struct *work) void rds_ib_free_mr(void *trans_private, int invalidate) { struct rds_ib_mr *ibmr = trans_private; - struct rds_ib_device *rds_ibdev = ibmr->device; - struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool; - unsigned long flags; + struct rds_ib_mr_pool *pool = ibmr->device->mr_pool; - rdsdebug("RDS/IB: free_mr nents %u\n", ibmr->sg_len); + rdsdebug("RDS/IB: free_mr nents %u\n", ibmr->mapping.m_sg.len); if (!pool) return; /* Return it to the pool's free list */ - spin_lock_irqsave(&pool->list_lock, flags); - if (ibmr->remap_count >= pool->fmr_attr.max_maps) { - list_add(&ibmr->list, &pool->drop_list); - } else { - list_add(&ibmr->list, &pool->free_list); - } - atomic_add(ibmr->sg_len, &pool->free_pinned); - atomic_inc(&pool->dirty_count); - spin_unlock_irqrestore(&pool->list_lock, flags); + pool->op->free(pool, ibmr); /* If we've pinned too many pages, request a flush */ if (atomic_read(&pool->free_pinned) >= pool->max_free_pinned @@ -588,36 +725,39 @@ void rds_ib_flush_mrs(void) } void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents, - __be32 ip_addr, u32 *key_ret) + struct rds_sock *rs, u32 *key_ret) { struct rds_ib_device *rds_ibdev; + struct rds_ib_mr_pool *pool; struct rds_ib_mr *ibmr = NULL; + struct ib_qp *qp; int ret; - rds_ibdev = rds_ib_get_device(ip_addr); - if (!rds_ibdev) { + ret = rds_ib_get_device(rs, &rds_ibdev, &qp); + if (ret || !qp) { ret = -ENODEV; goto out; } - if (!rds_ibdev->mr_pool) { + if (!(pool = rds_ibdev->mr_pool)) { ret = -ENODEV; goto out; } - ibmr = rds_ib_alloc_fmr(rds_ibdev); + ibmr = rds_ib_alloc_mr(rds_ibdev); if (IS_ERR(ibmr)) return ibmr; - ret = rds_ib_map_fmr(rds_ibdev, ibmr, sg, nents); + ibmr->qp = qp; + ibmr->device = rds_ibdev; + + ret = pool->op->map(pool, ibmr, sg, nents); if (ret == 0) - *key_ret = ibmr->fmr->rkey; + *key_ret = rds_ibdev->dev->node_type == RDMA_NODE_RNIC ? ibmr->fr_mr->rkey : ibmr->u.fmr->rkey; else - printk(KERN_WARNING "RDS/IB: map_fmr failed (errno=%d)\n", ret); - - ibmr->device = rds_ibdev; + printk(KERN_WARNING "RDS/IB: failed to map mr (errno=%d)\n", ret); - out: +out: if (ret) { if (ibmr) rds_ib_free_mr(ibmr, 0); @@ -625,3 +765,359 @@ void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents, } return ibmr; } + +/* + * This is the code that implements RDS memory registrations + * through FMRs. + */ +static int rds_ib_init_fmr(struct rds_ib_mr_pool *pool, + struct rds_ib_mr *ibmr) +{ + struct rds_ib_device *rds_ibdev = pool->device; + struct ib_fmr *fmr; + + fmr = ib_alloc_fmr(rds_ibdev->pd, + (IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_READ | + IB_ACCESS_REMOTE_WRITE), + &pool->fmr_attr); + if (IS_ERR(fmr)) { + int err = PTR_ERR(fmr); + + printk(KERN_WARNING "RDS/IB: ib_alloc_fmr failed (err=%d)\n", err); + return err; + } + + ibmr->u.fmr = fmr; + return 0; +} + +static int rds_ib_map_fmr(struct rds_ib_mr_pool *pool, struct rds_ib_mr *ibmr, + struct scatterlist *sg, unsigned int nents) +{ + struct rds_ib_device *rds_ibdev = pool->device; + struct rds_ib_scatterlist ibsg; + u64 *dma_pages; + int ret; + + rds_ib_set_scatterlist(&ibsg, sg, nents); + + dma_pages = rds_ib_map_scatterlist(rds_ibdev, &ibsg, rds_ibdev->fmr_page_shift); + if (IS_ERR(dma_pages)) + return PTR_ERR(dma_pages); + + ret = ib_map_phys_fmr(ibmr->u.fmr, dma_pages, ibsg.dma_npages, 0); + if (ret) { + rds_ib_drop_scatterlist(rds_ibdev, &ibsg); + goto out; + } + + /* Success - we successfully remapped the MR, so we can + * safely tear down the old mapping. */ + rds_ib_teardown_mr(ibmr); + + ibmr->mapping.m_sg = ibsg; + ibmr->remap_count++; + + rds_ib_stats_inc(s_ib_rdma_mr_used); + ret = 0; + +out: + kfree(dma_pages); + + return ret; +} + +static void rds_ib_free_fmr(struct rds_ib_mr_pool *pool, struct rds_ib_mr *ibmr) +{ + unsigned long flags; + + /* MRs that have reached their maximum remap count get queued + * to the head of the list. + */ + spin_lock_irqsave(&pool->list_lock, flags); + if (ibmr->remap_count >= pool->fmr_attr.max_maps) { + list_add(&ibmr->mapping.m_list, &pool->dirty_list); + } else { + list_add_tail(&ibmr->mapping.m_list, &pool->dirty_list); + } + atomic_add(ibmr->mapping.m_sg.len, &pool->free_pinned); + atomic_inc(&pool->dirty_count); + spin_unlock_irqrestore(&pool->list_lock, flags); +} + +static unsigned int rds_ib_unmap_fmr_list(struct rds_ib_mr_pool *pool, + struct list_head *unmap_list, + struct list_head *kill_list) +{ + struct rds_ib_mapping *mapping, *next; + struct rds_ib_mr *ibmr; + LIST_HEAD(fmr_list); + unsigned long unpinned = 0, ncleaned = 0; + int ret; + + /* String all ib_mr's onto one list and hand them to ib_unmap_fmr */ + list_for_each_entry_safe(mapping, next, unmap_list, m_list) { + ibmr = mapping->m_mr; + + list_add(&ibmr->u.fmr->list, &fmr_list); + } + ret = ib_unmap_fmr(&fmr_list); + if (ret) + printk(KERN_WARNING "RDS/IB: ib_unmap_fmr failed (err=%d)\n", ret); + + /* Now we can destroy the DMA mapping and unpin any pages */ + list_for_each_entry_safe(mapping, next, unmap_list, m_list) { + ibmr = mapping->m_mr; + + unpinned += rds_ib_drop_scatterlist(ibmr->device, &mapping->m_sg); + if (ibmr->remap_count >= pool->fmr_attr.max_maps) + list_move(&mapping->m_list, kill_list); + ncleaned++; + } + + atomic_sub(unpinned, &pool->free_pinned); + return ncleaned; +} + +static void rds_ib_destroy_fmr(struct rds_ib_mr_pool *pool, + struct rds_ib_mr *ibmr) +{ + if (ibmr->u.fmr) + ib_dealloc_fmr(ibmr->u.fmr); + ibmr->u.fmr = NULL; +} + +/* + * iWARP fastreg handling + * + * The life cycle of a fastreg registration is a bit different from + * FMRs. + * The idea behind fastreg is to have one MR, to which we bind different + * mappings over time. To avoid stalling on the expensive map and invalidate + * operations, these operations are pipelined on the same send queue on + * which we want to send the message containing the r_key. + * + * This creates a bit of a problem for us, as we do not have the destination + * IP in GET_MR, so the connection must be setup prior to the GET_MR call for + * RDMA to be correctly setup. If a fastreg request is present, rds_ib_xmit + * will try to queue a LOCAL_INV (if needed) and a FAST_REG_MR work request + * before queuing the SEND. When completions for these arrive, they are + * dispatched to the MR has a bit set showing that RDMa can be performed. + * + * There is another interesting aspect that's related to invalidation. + * The application can request that a mapping is invalidated in FREE_MR. + * The expectation there is that this invalidation step includes ALL + * PREVIOUSLY FREED MRs. + */ +static int rds_ib_init_fastreg(struct rds_ib_mr_pool *pool, + struct rds_ib_mr *ibmr) +{ + struct rds_ib_device *rds_ibdev = pool->device; + struct rds_ib_mapping *mapping = &ibmr->mapping; + struct ib_fast_reg_page_list *page_list = NULL; + struct ib_mr *mr; + int err; + + mr = ib_alloc_fast_reg_mr(rds_ibdev->pd, pool->max_message_size); + if (IS_ERR(mr)) { + err = PTR_ERR(mr); + + printk(KERN_WARNING "RDS/IB: ib_alloc_fast_reg_mr failed (err=%d)\n", err); + return err; + } + + page_list = ib_alloc_fast_reg_page_list(rds_ibdev->dev, mapping->m_sg.dma_npages); + if (IS_ERR(page_list)) { + err = PTR_ERR(page_list); + + printk(KERN_WARNING "RDS/IB: ib_alloc_fast_reg_page_list failed (err=%d)\n", err); + ib_dereg_mr(mr); + return err; + } + + ibmr->fr_page_list = page_list; + ibmr->fr_mr = mr; + return 0; +} + +static int rds_ib_rdma_build_fastreg(struct ib_qp *qp, struct rds_ib_mapping *mapping) +{ + struct rds_ib_mr *ibmr = mapping->m_mr; + struct ib_send_wr f_wr, *failed_wr; + int ret; + + /* + * Perform a WR for the fast_reg_mr. Each individual page + * in the sg list is added to the fast reg page list and placed + * inside the fast_reg_mr WR. The key used is a rolling 8bit + * counter, which should guarantee uniqueness. + */ + ib_update_fast_reg_key(ibmr->fr_mr, ibmr->remap_count++); + mapping->m_rkey = ibmr->fr_mr->rkey; + + memset(&f_wr, 0, sizeof(f_wr)); + f_wr.opcode = IB_WR_FAST_REG_MR; + f_wr.wr.fast_reg.length = mapping->m_sg.bytes; + f_wr.wr.fast_reg.rkey = mapping->m_rkey; + f_wr.wr.fast_reg.page_list = ibmr->fr_page_list; + f_wr.wr.fast_reg.page_list_len = mapping->m_sg.dma_len; + f_wr.wr.fast_reg.page_shift = ibmr->device->fmr_page_shift; + f_wr.wr.fast_reg.access_flags = IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_READ | + IB_ACCESS_REMOTE_WRITE; + f_wr.wr.fast_reg.iova_start = 0; + f_wr.send_flags = IB_SEND_SIGNALED; + + failed_wr = &f_wr; + ret = ib_post_send(qp, &f_wr, &failed_wr); + BUG_ON(failed_wr != &f_wr); + if (ret) { + printk(KERN_WARNING "RDS/IB: %s %d ib_post_send returned %d\n", + __func__, __LINE__, ret); + goto out; + } + +out: + return ret; +} + +int rds_ib_rdma_fastreg_inv(struct rds_ib_mr *ibmr) +{ + struct ib_send_wr s_wr, *failed_wr; + int ret; + + if (!ibmr->qp || !ibmr->fr_mr) + goto out; + + memset(&s_wr, 0, sizeof(s_wr)); + s_wr.opcode = IB_WR_LOCAL_INV; + s_wr.ex.invalidate_rkey = ibmr->fr_mr->rkey; + s_wr.send_flags = IB_SEND_SIGNALED; + + failed_wr = &s_wr; + ret = ib_post_send(ibmr->qp, &s_wr, &failed_wr); + if (ret) { + printk(KERN_WARNING "RDS/IB: %s %d ib_post_send returned %d\n", + __func__, __LINE__, ret); + goto out; + } +out: + return ret; +} + +static int rds_ib_map_fastreg(struct rds_ib_mr_pool *pool, + struct rds_ib_mr *ibmr, + struct scatterlist *sg, + unsigned int sg_len) +{ + struct rds_ib_device *rds_ibdev = pool->device; + struct rds_ib_mapping *mapping = &ibmr->mapping; + u64 *dma_pages; + int i, ret; + + rds_ib_set_scatterlist(&mapping->m_sg, sg, sg_len); + + dma_pages = rds_ib_map_scatterlist(rds_ibdev, + &mapping->m_sg, + rds_ibdev->fmr_page_shift); + if (IS_ERR(dma_pages)) { + ret = PTR_ERR(dma_pages); + dma_pages = NULL; + goto out; + } + + if (mapping->m_sg.dma_len > pool->max_message_size) { + printk("mapping->m_sg.dma_len > pool->max_message_size\n"); + ret = -EMSGSIZE; + goto out; + } + + for (i = 0; i < mapping->m_sg.dma_npages; ++i) + ibmr->fr_page_list->page_list[i] = dma_pages[i]; + + rds_ib_rdma_build_fastreg(ibmr->qp, mapping); + + rds_ib_stats_inc(s_ib_rdma_mr_used); + ret = 0; + +out: + kfree(dma_pages); + + return ret; +} + +/* + * "Free" a fastreg MR. + */ +static void rds_ib_free_fastreg(struct rds_ib_mr_pool *pool, + struct rds_ib_mr *ibmr) +{ + unsigned long flags; + + if (!ibmr->mapping.m_sg.dma_len) + return; + + rds_ib_rdma_fastreg_inv(ibmr); + + /* Try to post the LOCAL_INV WR to the queue. */ + spin_lock_irqsave(&pool->list_lock, flags); + + list_add_tail(&ibmr->mapping.m_list, &pool->dirty_list); + atomic_add(ibmr->mapping.m_sg.len, &pool->free_pinned); + atomic_inc(&pool->dirty_count); + + spin_unlock_irqrestore(&pool->list_lock, flags); +} + +static unsigned int rds_ib_unmap_fastreg_list(struct rds_ib_mr_pool *pool, + struct list_head *unmap_list, + struct list_head *kill_list) +{ + struct rds_ib_mapping *mapping, *next; + unsigned int ncleaned = 0; + LIST_HEAD(laundered); + + /* Batched invalidation of fastreg MRs. + * Why do we do it this way, even though we could pipeline unmap + * and remap? The reason is the application semantics - when the + * application requests an invalidation of MRs, it expects all + * previously released R_Keys to become invalid. + * + * If we implement MR reuse naively, we risk memory corruption + * (this has actually been observed). So the default behavior + * requires that a MR goes through an explicit unmap operation before + * we can reuse it again. + * + * We could probably improve on this a little, by allowing immediate + * reuse of a MR on the same socket (eg you could add small + * cache of unused MRs to strct rds_socket - GET_MR could grab one + * of these without requiring an explicit invalidate). + */ + while (!list_empty(unmap_list)) { + unsigned long flags; + + spin_lock_irqsave(&pool->list_lock, flags); + list_for_each_entry_safe(mapping, next, unmap_list, m_list) { + list_move(&mapping->m_list, &laundered); + ncleaned++; + } + spin_unlock_irqrestore(&pool->list_lock, flags); + } + + /* Move all laundered mappings back to the unmap list. + * We do not kill any WRs right now - it doesn't seem the + * fastreg API has a max_remap limit. */ + list_splice_init(&laundered, unmap_list); + + return ncleaned; +} + +static void rds_ib_destroy_fastreg(struct rds_ib_mr_pool *pool, + struct rds_ib_mr *ibmr) +{ + if (ibmr->u.fastreg.page_list) + ib_free_fast_reg_page_list(ibmr->u.fastreg.page_list); + if (ibmr->u.fastreg.mr) + ib_dereg_mr(ibmr->u.fastreg.mr); +} diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c index 1da664e..2ca48d5 100644 --- a/net/rds/ib_recv.c +++ b/net/rds/ib_recv.c @@ -97,12 +97,12 @@ void rds_ib_recv_init_ring(struct rds_ib_connection *ic) sge = rds_ib_data_sge(ic, recv->r_sge); sge->addr = 0; sge->length = RDS_FRAG_SIZE; - sge->lkey = rds_ib_local_dma_lkey(ic); + sge->lkey = 0; sge = rds_ib_header_sge(ic, recv->r_sge); sge->addr = ic->i_recv_hdrs_dma + (i * sizeof(struct rds_header)); sge->length = sizeof(struct rds_header); - sge->lkey = rds_ib_local_dma_lkey(ic); + sge->lkey = 0; } } diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c index 9058aaf..ecefc04 100644 --- a/net/rds/ib_send.c +++ b/net/rds/ib_send.c @@ -135,7 +135,9 @@ void rds_ib_send_init_ring(struct rds_ib_connection *ic) send->s_rm = NULL; send->s_op = NULL; + send->s_mapping = NULL; + send->s_wr.next = NULL; send->s_wr.wr_id = i; send->s_wr.sg_list = send->s_sge; send->s_wr.num_sge = 1; @@ -144,12 +146,29 @@ void rds_ib_send_init_ring(struct rds_ib_connection *ic) send->s_wr.ex.imm_data = 0; sge = rds_ib_data_sge(ic, send->s_sge); - sge->lkey = rds_ib_local_dma_lkey(ic); + sge->lkey = 0; sge = rds_ib_header_sge(ic, send->s_sge); sge->addr = ic->i_send_hdrs_dma + (i * sizeof(struct rds_header)); sge->length = sizeof(struct rds_header); - sge->lkey = rds_ib_local_dma_lkey(ic); + sge->lkey = 0; + + if (ic->i_iwarp) { + send->s_mr = ib_alloc_fast_reg_mr(ic->i_pd, fmr_message_size); + if (IS_ERR(send->s_mr)) { + printk(KERN_WARNING "RDS/IB: ib_alloc_fast_reg_mr failed\n"); + break; + } + + send->s_page_list = ib_alloc_fast_reg_page_list(ic->i_cm_id->device, RDS_IB_MAX_SGE); + if (IS_ERR(send->s_page_list)) { + printk(KERN_WARNING "RDS/IB: ib_alloc_fast_reg_page_list failed\n"); + break; + } + } else { + send->s_mr = NULL; + send->s_page_list = NULL; + } } } @@ -165,6 +184,11 @@ void rds_ib_send_clear_ring(struct rds_ib_connection *ic) rds_ib_send_unmap_rm(ic, send, IB_WC_WR_FLUSH_ERR); if (send->s_op) rds_ib_send_unmap_rdma(ic, send->s_op); + if (send->s_mr) + ib_dereg_mr(send->s_mr); + if (send->s_page_list) + ib_free_fast_reg_page_list(send->s_page_list); + } } @@ -192,12 +216,27 @@ void rds_ib_send_cq_comp_handler(struct ib_cq *cq, void *context) rdsdebug("ib_req_notify_cq send failed: %d\n", ret); } - while (ib_poll_cq(cq, 1, &wc) > 0 ) { + while (ib_poll_cq(cq, 1, &wc) > 0) { rdsdebug("wc wr_id 0x%llx status %u byte_len %u imm_data %u\n", (unsigned long long)wc.wr_id, wc.status, wc.byte_len, be32_to_cpu(wc.ex.imm_data)); rds_ib_stats_inc(s_ib_tx_cq_event); + if (wc.status != IB_WC_SUCCESS) { + printk("WC Error: status = %d opcode = %d\n", wc.status, wc.opcode); + break; + } + + if (wc.opcode == IB_WC_LOCAL_INV && wc.wr_id == 0) { + ic->i_fastreg_posted = 0; + continue; + } + + if (wc.opcode == IB_WC_FAST_REG_MR && wc.wr_id == 0) { + ic->i_fastreg_posted = 1; + continue; + } + if (wc.wr_id == RDS_IB_ACK_WR_ID) { if (ic->i_ack_queued + HZ/2 < jiffies) rds_ib_stats_inc(s_ib_tx_stalled); @@ -218,8 +257,10 @@ void rds_ib_send_cq_comp_handler(struct ib_cq *cq, void *context) if (send->s_rm) rds_ib_send_unmap_rm(ic, send, wc.status); break; + case IB_WR_FAST_REG_MR: case IB_WR_RDMA_WRITE: case IB_WR_RDMA_READ: + case IB_WR_RDMA_READ_WITH_INV: /* Nothing to be done - the SG list will be unmapped * when the SEND completes. */ break; @@ -475,6 +516,14 @@ int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm, BUG_ON(off % RDS_FRAG_SIZE); BUG_ON(hdr_off != 0 && hdr_off != sizeof(struct rds_header)); + /* Fastreg support */ + if (rds_rdma_cookie_key(rm->m_rdma_cookie) + && ic->i_fastreg + && !ic->i_fastreg_posted) { + ret = -EAGAIN; + goto out; + } + /* FIXME we may overallocate here */ if (be32_to_cpu(rm->m_inc.i_hdr.h_len) == 0) i = 1; @@ -483,6 +532,7 @@ int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm, work_alloc = rds_ib_ring_alloc(&ic->i_send_ring, i, &pos); if (work_alloc == 0) { + printk("%s line %d: ENOMEM\n", __func__, __LINE__); set_bit(RDS_LL_SEND_FULL, &conn->c_flags); rds_ib_stats_inc(s_ib_tx_ring_full); ret = -ENOMEM; @@ -702,6 +752,40 @@ out: return ret; } +static int rds_ib_build_send_fastreg(struct rds_ib_device *rds_ibdev, struct rds_ib_connection *ic, struct rds_ib_send_work *send, int nent, int len, u64 sg_addr) +{ + struct ib_send_wr *failed_wr; + int ret; + + /* + * Perform a WR for the fast_reg_mr. Each individual page + * in the sg list is added to the fast reg page list and placed + * inside the fast_reg_mr WR. + */ + send->s_wr.opcode = IB_WR_FAST_REG_MR; + send->s_wr.wr.fast_reg.length = len; + send->s_wr.wr.fast_reg.rkey = send->s_mr->rkey; + send->s_wr.wr.fast_reg.page_list = send->s_page_list; + send->s_wr.wr.fast_reg.page_list_len = nent; + send->s_wr.wr.fast_reg.page_shift = rds_ibdev->fmr_page_shift; + send->s_wr.wr.fast_reg.access_flags = IB_ACCESS_REMOTE_WRITE; + send->s_wr.wr.fast_reg.iova_start = sg_addr; + + failed_wr = &send->s_wr; + ret = ib_post_send(ic->i_cm_id->qp, &send->s_wr, &failed_wr); + BUG_ON(failed_wr != &send->s_wr); + if (ret) { + printk(KERN_WARNING "RDS/IB: %s %d ib_post_send returned %d\n", + __func__, __LINE__, ret); + goto out; + } + + ib_update_fast_reg_key(send->s_mr, send->s_remap_count++); + +out: + return ret; +} + int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op) { struct rds_ib_connection *ic = conn->c_transport_data; @@ -713,7 +797,7 @@ int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op) struct scatterlist *scat; unsigned long len; u64 remote_addr = op->r_remote_addr; - u32 pos; + u32 pos, fr_pos; u32 work_alloc; u32 i; u32 j; @@ -738,6 +822,18 @@ int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op) op->r_mapped = 1; } + if (!op->r_write && ic->i_iwarp) { + /* Alloc space on the send queue for the fastreg */ + work_alloc = rds_ib_ring_alloc(&ic->i_send_ring, 1, &fr_pos); + if (work_alloc != 1) { + printk("%s line %d: ENOMEM\n", __func__, __LINE__); + rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc); + rds_ib_stats_inc(s_ib_tx_ring_full); + ret = -ENOMEM; + goto out; + } + } + /* * Instead of knowing how to return a partial rdma read/write we insist that there * be enough work requests to send the entire message. @@ -746,6 +842,7 @@ int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op) work_alloc = rds_ib_ring_alloc(&ic->i_send_ring, i, &pos); if (work_alloc != i) { + printk("%s line %d: ENOMEM\n", __func__, __LINE__); rds_ib_ring_unalloc(&ic->i_send_ring, work_alloc); rds_ib_stats_inc(s_ib_tx_ring_full); ret = -ENOMEM; @@ -759,9 +856,10 @@ int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op) sent = 0; num_sge = op->r_count; - for ( i = 0; i < work_alloc && scat != &op->r_sg[op->r_count]; i++ ) { + for (i = 0; i < work_alloc && scat != &op->r_sg[op->r_count]; i++) { send->s_wr.send_flags = 0; send->s_queued = jiffies; + /* * We want to delay signaling completions just enough to get * the batching benefits but not so much that we create dead time on the wire. @@ -771,7 +869,17 @@ int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op) send->s_wr.send_flags = IB_SEND_SIGNALED; } - send->s_wr.opcode = op->r_write ? IB_WR_RDMA_WRITE : IB_WR_RDMA_READ; + /* To avoid the need to have the plumbing to invalidate the fastreg_mr used + * for local access after RDS is finished with it, using + * IB_WR_RDMA_READ_WITH_INV will invalidate it after the read has completed. + */ + if (op->r_write) + send->s_wr.opcode = IB_WR_RDMA_WRITE; + else if (ic->i_iwarp) + send->s_wr.opcode = IB_WR_RDMA_READ_WITH_INV; + else + send->s_wr.opcode = IB_WR_RDMA_READ; + send->s_wr.wr.rdma.remote_addr = remote_addr; send->s_wr.wr.rdma.rkey = op->r_key; send->s_op = op; @@ -779,8 +887,7 @@ int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op) if (num_sge > rds_ibdev->max_sge) { send->s_wr.num_sge = rds_ibdev->max_sge; num_sge -= rds_ibdev->max_sge; - } - else + } else send->s_wr.num_sge = num_sge; send->s_wr.next = NULL; @@ -792,15 +899,25 @@ int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op) len = sg_dma_len(scat); send->s_sge[j].addr = sg_dma_address(scat); send->s_sge[j].length = len; - send->s_sge[j].lkey = rds_ib_local_dma_lkey(ic); + + if (send->s_wr.opcode == IB_WR_RDMA_READ_WITH_INV) + send->s_page_list->page_list[j] = sg_dma_address(scat); + else + send->s_sge[j].lkey = rds_ib_local_dma_lkey(ic); sent += len; rdsdebug("ic %p sent %d remote_addr %llu\n", ic, sent, remote_addr); - remote_addr += sg_dma_len(scat); + scat++; } + if (send->s_wr.opcode == IB_WR_RDMA_READ_WITH_INV) { + send->s_wr.num_sge = 1; + send->s_sge[0].addr = conn->c_xmit_rm->m_rs->rs_user_addr; + send->s_sge[0].lkey = ((struct rds_ib_send_work)ic->i_sends[fr_pos]).s_mr->lkey; + } + rdsdebug("send %p wr %p num_sge %u next %p\n", send, &send->s_wr, send->s_wr.num_sge, send->s_wr.next); @@ -809,6 +926,15 @@ int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op) send = ic->i_sends; } + /* On iWARP, local memory access by a remote system (ie, RDMA Read) is not + * recommended. Putting the lkey on the wire is a security hole, as it can + * allow for memory access to all of memory on the remote system. Some + * adapters do not allow using the lkey for this at all. To bypass this use a + * fastreg_mr (or possibly a dma_mr) + */ + if (!op->r_write && ic->i_iwarp) + rds_ib_build_send_fastreg(rds_ibdev, ic, &ic->i_sends[fr_pos], op->r_count, sent, conn->c_xmit_rm->m_rs->rs_user_addr); + /* if we finished the message then send completion owns it */ if (scat == &op->r_sg[op->r_count]) { prev->s_wr.send_flags = IB_SEND_SIGNALED; @@ -831,12 +957,6 @@ int rds_ib_xmit_rdma(struct rds_connection *conn, struct rds_rdma_op *op) goto out; } - if (unlikely(failed_wr != &first->s_wr)) { - printk(KERN_WARNING "RDS/IB: ib_post_send() rc=%d, but failed_wqe updated!\n", ret); - BUG_ON(failed_wr != &first->s_wr); - } - - out: return ret; } diff --git a/net/rds/message.c b/net/rds/message.c index 9269b9a..ddeb95b 100644 --- a/net/rds/message.c +++ b/net/rds/message.c @@ -71,6 +71,8 @@ static void rds_message_purge(struct rds_message *rm) if (rm->m_rdma_op) rds_rdma_free_op(rm->m_rdma_op); + if (rm->m_rdma_mr) + rds_mr_put(rm->m_rdma_mr); } void rds_message_inc_purge(struct rds_incoming *inc) diff --git a/net/rds/rdma.c b/net/rds/rdma.c index 1f1039e..4d26246 100644 --- a/net/rds/rdma.c +++ b/net/rds/rdma.c @@ -116,11 +116,8 @@ static void rds_destroy_mr(struct rds_mr *mr) mr->r_trans->free_mr(trans_private, mr->r_invalidate); } -static void rds_mr_put(struct rds_mr *mr) +void __rds_put_mr_final(struct rds_mr *mr) { - if (!atomic_dec_and_test(&mr->r_refcount)) - return; - rds_destroy_mr(mr); kfree(mr); } @@ -169,7 +166,7 @@ static int rds_pin_pages(unsigned long user_addr, unsigned int nr_pages, } static int __rds_rdma_map(struct rds_sock *rs, struct rds_get_mr_args *args, - u64 *cookie_ret) + u64 *cookie_ret, struct rds_mr **mr_ret) { struct rds_mr *mr = NULL, *found; unsigned int nr_pages; @@ -257,8 +254,7 @@ static int __rds_rdma_map(struct rds_sock *rs, struct rds_get_mr_args *args, * s/g list is now owned by the MR. * Note that dma_map() implies that pending writes are * flushed to RAM, so no dma_sync is needed here. */ - trans_private = rs->rs_transport->get_mr(sg, nents, - rs->rs_bound_addr, + trans_private = rs->rs_transport->get_mr(sg, nents, rs, &mr->r_key); if (IS_ERR(trans_private)) { @@ -296,6 +292,10 @@ static int __rds_rdma_map(struct rds_sock *rs, struct rds_get_mr_args *args, BUG_ON(found && found != mr); rdsdebug("RDS: get_mr key is %x\n", mr->r_key); + if (mr_ret) { + atomic_inc(&mr->r_refcount); + *mr_ret = mr; + } ret = 0; out: @@ -317,7 +317,7 @@ int rds_get_mr(struct rds_sock *rs, char __user *optval, int optlen) sizeof(struct rds_get_mr_args))) return -EFAULT; - return __rds_rdma_map(rs, &args, NULL); + return __rds_rdma_map(rs, &args, NULL, NULL); } /* @@ -542,6 +542,8 @@ static struct rds_rdma_op *rds_rdma_prepare(struct rds_sock *rs, goto out; } + rs->rs_user_addr = vec.addr; + /* did the user change the vec under us? */ if (nr > max_pages || op->r_nents + nr > nr_pages) { ret = -EINVAL; @@ -655,7 +657,7 @@ int rds_cmsg_rdma_dest(struct rds_sock *rs, struct rds_message *rm, if (mr) { mr->r_trans->sync_mr(mr->r_trans_private, DMA_TO_DEVICE); - rds_mr_put(mr); + rm->m_rdma_mr = mr; } return err; } @@ -673,5 +675,5 @@ int rds_cmsg_rdma_map(struct rds_sock *rs, struct rds_message *rm, || rm->m_rdma_cookie != 0) return -EINVAL; - return __rds_rdma_map(rs, CMSG_DATA(cmsg), &rm->m_rdma_cookie); + return __rds_rdma_map(rs, CMSG_DATA(cmsg), &rm->m_rdma_cookie, &rm->m_rdma_mr); } diff --git a/net/rds/rdma.h b/net/rds/rdma.h index b1734a0..4878db6 100644 --- a/net/rds/rdma.h +++ b/net/rds/rdma.h @@ -22,7 +22,7 @@ struct rds_mr { * bit field here, but we need to use test_and_set_bit. */ unsigned long r_state; - struct rds_sock * r_sock; /* back pointer to the socket that owns us */ + struct rds_sock *r_sock; /* back pointer to the socket that owns us */ struct rds_transport *r_trans; void *r_trans_private; }; @@ -74,4 +74,11 @@ int rds_cmsg_rdma_map(struct rds_sock *rs, struct rds_message *rm, void rds_rdma_free_op(struct rds_rdma_op *ro); void rds_rdma_send_complete(struct rds_message *rm, int); +extern void __rds_put_mr_final(struct rds_mr *mr); +static inline void rds_mr_put(struct rds_mr *mr) +{ + if (atomic_dec_and_test(&mr->r_refcount)) + __rds_put_mr_final(mr); +} + #endif diff --git a/net/rds/rds.h b/net/rds/rds.h index 235c951..68726ee 100644 --- a/net/rds/rds.h +++ b/net/rds/rds.h @@ -122,7 +122,7 @@ struct rds_connection { __be32 c_laddr; __be32 c_faddr; unsigned int c_loopback : 1; - struct rds_connection * c_passive; + struct rds_connection *c_passive; struct rds_cong_map *c_lcong; struct rds_cong_map *c_fcong; @@ -297,6 +297,7 @@ struct rds_message { struct rds_sock *m_rs; struct rds_rdma_op *m_rdma_op; rds_rdma_cookie_t m_rdma_cookie; + struct rds_mr *m_rdma_mr; unsigned int m_nents; unsigned int m_count; struct scatterlist m_sg[0]; @@ -373,7 +374,7 @@ struct rds_transport { unsigned int avail); void (*exit)(void); void *(*get_mr)(struct scatterlist *sg, unsigned long nr_sg, - __be32 ip_addr, u32 *key_ret); + struct rds_sock *rs, u32 *key_ret); void (*sync_mr)(void *trans_private, int direction); void (*free_mr)(void *trans_private, int invalidate); void (*flush_mrs)(void); @@ -387,6 +388,7 @@ struct rds_sock { struct sock *rs_sk; #endif + u64 rs_user_addr; /* * bound_addr used for both incoming and outgoing, no INADDR_ANY * support. diff --git a/net/rds/send.c b/net/rds/send.c index 20d3e52..406ff64 100644 --- a/net/rds/send.c +++ b/net/rds/send.c @@ -772,6 +772,9 @@ static int rds_cmsg_send(struct rds_sock *rs, struct rds_message *rm, if (cmsg->cmsg_level != SOL_RDS) continue; + /* As a side effect, RDMA_DEST and RDMA_MAP will set + * rm->m_rdma_cookie and rm->m_rdma_mr. + */ switch (cmsg->cmsg_type) { case RDS_CMSG_RDMA_ARGS: ret = rds_cmsg_rdma_args(rs, rm, cmsg); From rdreier at cisco.com Fri Oct 10 14:37:53 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 10 Oct 2008 14:37:53 -0700 Subject: [ofa-general] Re: [PATCH] RDMA/cxgb3: Remove cmid reference on tid allocation failures. In-Reply-To: <20081010192128.17278.8317.stgit@dell3.ogc.int> (Steve Wise's message of "Fri, 10 Oct 2008 14:21:28 -0500") References: <20081010192128.17278.8317.stgit@dell3.ogc.int> Message-ID: thanks, applied From rdreier at cisco.com Fri Oct 10 14:41:39 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 10 Oct 2008 14:41:39 -0700 Subject: [ofa-general] Re: [PATCH 1/1] IB/ehca: Disallow creating UC QP with SRQ In-Reply-To: <200810011306.31544.hnguyen@linux.vnet.ibm.com> (Hoang-Nam Nguyen's message of "Wed, 1 Oct 2008 13:06:31 +0200") References: <200810011306.31544.hnguyen@linux.vnet.ibm.com> Message-ID: thanks, applied -- it didn't apply to the latest tree, because of the flush CQE changes, so I merged it manually as below -- let me know if this is wrong: commit 0540bbbe455e123a1692d26205ad1a29983883b0 Author: Hoang-Nam Nguyen Date: Fri Oct 10 14:40:39 2008 -0700 IB/ehca: Don't allow creating UC QP with SRQ This patch prevents a UC QP to be created attached to an SRQ, since current firmware does not support this feature. Signed-off-by: Michael Faath Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c index 4dbe287..40b578d 100644 --- a/drivers/infiniband/hw/ehca/ehca_qp.c +++ b/drivers/infiniband/hw/ehca/ehca_qp.c @@ -502,6 +502,12 @@ static struct ehca_qp *internal_create_qp( if (init_attr->srq) { my_srq = container_of(init_attr->srq, struct ehca_qp, ib_srq); + if (qp_type == IB_QPT_UC) { + ehca_err(pd->device, "UC with SRQ not supported"); + atomic_dec(&shca->num_qps); + return ERR_PTR(-EINVAL); + } + has_srq = 1; parms.ext_type = EQPT_SRQBASE; parms.srq_qpn = my_srq->real_qp_num; From andy.grover at oracle.com Fri Oct 10 14:46:37 2008 From: andy.grover at oracle.com (Andy Grover) Date: Fri, 10 Oct 2008 14:46:37 -0700 Subject: [ofa-general] Re: [RFC] rds: iWARP RDMA enablement In-Reply-To: <20081010211753.GA20735@opengridcomputing.com> References: <20081010211753.GA20735@opengridcomputing.com> Message-ID: <48EFCD3D.3040902@oracle.com> Jon Mason wrote: > Hey Andy, > This patch contains all of the changes needed to get rds-rdma working on > iWARP (with one FIXME left). This patch will apply to a stock OFED-1.4 > kernel, and includes the patch I sent out previously with changes to > enable RDMA READs. While not complete, I wanted to sendout code for > review to help me diagnose any coding, design, or style errors. > > The remaining FIXME in the code is allowing for multiple rds connections > (i.e., QPs) from the same host. > > This patch contains a stress bug that I have yet to determine root > cause. Running rds-stress, large RDMA payloads may cause memory > corruption. This is most likely caused by running over the bounds of > one of the RDS rings under stress, as my rds-simple tests run without > problems for long runs. > > Let me know what you think. Thanks, applied. -- Andy From kliteyn at dev.mellanox.co.il Fri Oct 10 15:31:36 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sat, 11 Oct 2008 00:31:36 +0200 Subject: [ofa-general] [PATCH v4] ibutils/ibis: prevent buffer overflows Message-ID: <48EFD7C8.5070700@dev.mellanox.co.il> Oren, [v4 of the patch - fixing the signed-off] As discovered by Sasha Khapyorsky , fixing two buffer overflows in the .i files. Note that one of them is in typemap of char array, which makes me wonder how this thing even worked... Please regenerate wrappers after this patch. Signed-off-by: Yevgeny Kliteynik --- ibis/src/ibis_typemaps.i | 2 +- ibis/src/ibsm.i | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/ibis/src/ibis_typemaps.i b/ibis/src/ibis_typemaps.i index b9449d2..4855e85 100644 --- a/ibis/src/ibis_typemaps.i +++ b/ibis/src/ibis_typemaps.i @@ -157,7 +157,7 @@ /* handle char arrays as members of a struct */ %typemap (tcl8, memberin) char [ANY] { strncpy($target,$source,$dim0 - 1); - $target[$dim0] = '\0'; + $target[$dim0 - 1] = '\0'; } %typemap(tcl8,out) ib_gid_t* { diff --git a/ibis/src/ibsm.i b/ibis/src/ibsm.i index 5979547..0e3d69b 100644 --- a/ibis/src/ibsm.i +++ b/ibis/src/ibsm.i @@ -642,7 +642,7 @@ typedef struct _ibsm_vl_arb_table } %typemap(tcl8,memberin) ibsm_node_desc_str_t[IB_NODE_DESCRIPTION_SIZE] { strncpy((char *)$target,(char *)$source,IB_NODE_DESCRIPTION_SIZE - 1); - $target[IB_NODE_DESCRIPTION_SIZE] = '\0'; + $target[IB_NODE_DESCRIPTION_SIZE - 1] = '\0'; } %typemap(tcl8,out) ibsm_node_desc_str_t[ANY] { -- 1.5.1.4 From chu11 at llnl.gov Fri Oct 10 15:56:46 2008 From: chu11 at llnl.gov (Al Chu) Date: Fri, 10 Oct 2008 15:56:46 -0700 Subject: [ofa-general] [infiniband-diags] specify -a in call to perfquery in ibclearerrors Message-ID: <1223679406.1197.236.camel@cardanus.llnl.gov> Hey Sasha, As discussed in the other thread, ibclearerrors should now specify -a instead of depend on passing port 255 to the tool. Al -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-specify-a-in-perfquery-call-to-clear-all-errors.patch Type: text/x-patch Size: 1176 bytes Desc: not available URL: From chu11 at llnl.gov Fri Oct 10 15:56:54 2008 From: chu11 at llnl.gov (Al Chu) Date: Fri, 10 Oct 2008 15:56:54 -0700 Subject: [ofa-general] [infiniband-diags] [trivial] tweak notes about port 255 in perfquery manpage Message-ID: <1223679414.1197.237.camel@cardanus.llnl.gov> Hey Sasha, As discussed in the other threads, using -a should be different than specifying port 255. Adjust the manpage appropriately. Al -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0002-tweak-perfquery-note-about-port-255.patch Type: text/x-patch Size: 906 bytes Desc: not available URL: From chu11 at llnl.gov Fri Oct 10 15:57:02 2008 From: chu11 at llnl.gov (Al Chu) Date: Fri, 10 Oct 2008 15:57:02 -0700 Subject: [ofa-general] [infiniband-diags] error out if AllPortSelect is not supported Message-ID: <1223679422.1197.238.camel@cardanus.llnl.gov> Hey Sasha, As discussed in the other thread, this patches makes perfquery error out if the user requested port 255 and AllPortSelect is not supported. Al -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0003-if-user-inputs-port-255-error-if-AllPortSelect-flag.patch Type: text/x-patch Size: 1116 bytes Desc: not available URL: From chu11 at llnl.gov Fri Oct 10 15:57:45 2008 From: chu11 at llnl.gov (Al Chu) Date: Fri, 10 Oct 2008 15:57:45 -0700 Subject: [ofa-general] [infiniband-diags] remove single port CA AllPortSelect workaround in perfquery Message-ID: <1223679465.1197.240.camel@cardanus.llnl.gov> Hey Sasha, This removes the workaround in perfquery that gets around lack of AllPortSelect support on a CA w/ a single port. Al -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0004-remove-single-port-CA-AllPortSelect-workaround.patch Type: text/x-patch Size: 3132 bytes Desc: not available URL: From chu11 at llnl.gov Fri Oct 10 15:58:07 2008 From: chu11 at llnl.gov (Al Chu) Date: Fri, 10 Oct 2008 15:58:07 -0700 Subject: [ofa-general] [infiniband-diags] in perfquery if --loop_ports is specified always loop through all ports if desired Message-ID: <1223679487.1197.242.camel@cardanus.llnl.gov> Hey Sasha, As discussed in the other thread, --loop_ports will now iterate through all ports no matter what with this patch. Al -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0005-when-loop_ports-is-specified-always-iterate-when.patch Type: text/x-patch Size: 4449 bytes Desc: not available URL: From chu11 at llnl.gov Fri Oct 10 15:58:20 2008 From: chu11 at llnl.gov (Al Chu) Date: Fri, 10 Oct 2008 15:58:20 -0700 Subject: [ofa-general] [infiniband-diags] [trivial] perfquery code cleanup Message-ID: <1223679500.1197.244.camel@cardanus.llnl.gov> Hey Sasha, This is a code cleanup patch renaming "all" to "all_ports" for consistency to other code in perfquery. Al -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0006-rename-all-to-all_ports-for-clarity-to-optionname.patch Type: text/x-patch Size: 2350 bytes Desc: not available URL: From chu11 at llnl.gov Fri Oct 10 16:02:00 2008 From: chu11 at llnl.gov (Al Chu) Date: Fri, 10 Oct 2008 16:02:00 -0700 Subject: [ofa-general] [infiniband-diags] if perfquery -a is specified loop through ports if required and aggregate output Message-ID: <1223679720.1197.250.camel@cardanus.llnl.gov> Hey Sasha, And finally the big patch. If -a is specified and AllPortSelect is not supported (and -l isn't specified) loop through all ports and aggregate them into one output. So the patch is a tad lengthy given the manual packet parsing/counting that had to be done. I'm not aware of any libs/helper funcs in OFED that could have made this code shorter. Please let me know if there are some obvious funcs that could make this better. Also, my understanding is that in IB, counters don't wrap around. When they get to the max they stay at the max. So wrap-around checks for all of the counters is there as I added things up. Couldn't find some wrapper funcs for this in OFED already. So hopefully I'm not repeating code. Al -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From chu11 at llnl.gov Fri Oct 10 16:03:05 2008 From: chu11 at llnl.gov (Al Chu) Date: Fri, 10 Oct 2008 16:03:05 -0700 Subject: [ofa-general] [infiniband-diags] if perfquery -a is specified loop through ports if required and aggregate output In-Reply-To: <1223679720.1197.250.camel@cardanus.llnl.gov> References: <1223679720.1197.250.camel@cardanus.llnl.gov> Message-ID: <1223679785.1197.252.camel@cardanus.llnl.gov> Oops, forgot to attach the patch. Here it is. Al On Fri, 2008-10-10 at 16:02 -0700, Al Chu wrote: > Hey Sasha, > > And finally the big patch. If -a is specified and AllPortSelect is not > supported (and -l isn't specified) loop through all ports and aggregate > them into one output. > > So the patch is a tad lengthy given the manual packet parsing/counting > that had to be done. I'm not aware of any libs/helper funcs in OFED > that could have made this code shorter. Please let me know if there are > some obvious funcs that could make this better. > > Also, my understanding is that in IB, counters don't wrap around. When > they get to the max they stay at the max. So wrap-around checks for all > of the counters is there as I added things up. Couldn't find some > wrapper funcs for this in OFED already. So hopefully I'm not repeating > code. > > Al > -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0007-aggregate-port-counters-for-single-output-under-al.patch Type: text/x-patch Size: 10748 bytes Desc: not available URL: From chu11 at llnl.gov Fri Oct 10 16:35:43 2008 From: chu11 at llnl.gov (Al Chu) Date: Fri, 10 Oct 2008 16:35:43 -0700 Subject: [ofa-general] [infiniband-diags] in perfquery if --loop_ports is specified always loop through all ports if desired In-Reply-To: <1223679487.1197.242.camel@cardanus.llnl.gov> References: <1223679487.1197.242.camel@cardanus.llnl.gov> Message-ID: <1223681743.1197.254.camel@cardanus.llnl.gov> Oops, noticed I typed something in the manpage. New patch. Al On Fri, 2008-10-10 at 15:58 -0700, Al Chu wrote: > Hey Sasha, > > As discussed in the other thread, --loop_ports will now iterate through > all ports no matter what with this patch. > > Al > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0005-when-loop_ports-is-specified-always-iterate-when.patch Type: text/x-patch Size: 4450 bytes Desc: not available URL: From vlad at lists.openfabrics.org Sat Oct 11 03:20:05 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 11 Oct 2008 03:20:05 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081011-0200 daily build status Message-ID: <20081011102006.2E342E60FEC@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: Build failed on ppc64 with linux-2.6.24 Log: /home/vlad/tmp/ofa_1_4_kernel-20081011-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c: In function 'ehca_poll_eqs': /home/vlad/tmp/ofa_1_4_kernel-20081011-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:942: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type /home/vlad/tmp/ofa_1_4_kernel-20081011-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.c:946: warning: passing argument 1 of 'local_irq_save_ptr' from incompatible pointer type make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081011-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca/ehca_main.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081011-0200_linux-2.6.24_ppc64_check/drivers/infiniband/hw/ehca] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081011-0200_linux-2.6.24_ppc64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081011-0200_linux-2.6.24_ppc64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/ppc64/linux-2.6.24' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- From sashak at voltaire.com Sat Oct 11 04:23:19 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 11 Oct 2008 13:23:19 +0200 Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_config.h: generated OSM_DEBUG macro Message-ID: <20081011112319.GH6947@sashak.voltaire.com> In accordance with requested mode ./configure will generate OSM_DEBUG macro in files config.h and osm_config.h. This will be defined as '1' when debug mode was enabled and undefined otherwise. It can be used by third parties (such as ibutils and plugins) to know about OpenSM build mode. Signed-off-by: Sasha Khapyorsky --- opensm/configure.in | 3 +++ opensm/include/opensm/osm_config.h.in | 3 +++ 2 files changed, 6 insertions(+), 0 deletions(-) diff --git a/opensm/configure.in b/opensm/configure.in index 680e6a0..bf24fcd 100644 --- a/opensm/configure.in +++ b/opensm/configure.in @@ -71,6 +71,9 @@ AC_ARG_ENABLE(debug, [ --enable-debug Turn on debugging], no) debug=false ;; *) AC_MSG_ERROR(bad value ${enableval} for --enable-debug) ;; esac],debug=false) +if test x$debug = xtrue ; then + AC_DEFINE(OSM_DEBUG, 1, [ define 1 if OpenSM build is in a debug mode ]) +fi AM_CONDITIONAL(DEBUG, test x$debug = xtrue) AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presence of ib libraries], diff --git a/opensm/include/opensm/osm_config.h.in b/opensm/include/opensm/osm_config.h.in index 6781af7..b12006f 100644 --- a/opensm/include/opensm/osm_config.h.in +++ b/opensm/include/opensm/osm_config.h.in @@ -10,6 +10,9 @@ #ifndef _OSM_CONFIG_H_ #define _OSM_CONFIG_H_ +/* define 1 if OpenSM build is in a debug mode */ +#undef OSM_DEBUG + /* Define as 1 if you want Dual Sided RMPP Support */ #undef DUAL_SIDED_RMPP -- 1.6.0.1.196.g01914 From sashak at voltaire.com Sat Oct 11 11:43:14 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 11 Oct 2008 20:43:14 +0200 Subject: [ofa-general] [PATCH] opensm/vendor: replace osm_vendor_select.h by osm_config.h Message-ID: <20081011184314.GI6947@sashak.voltaire.com> osm_config.h and config.h have mad interface already defined. No need to redefine in osm_vendor_select.h. Signed-off-by: Sasha Khapyorsky --- opensm/include/Makefile.am | 1 - opensm/include/vendor/osm_vendor.h | 4 +- opensm/include/vendor/osm_vendor_select.h | 70 ----------------------------- opensm/libvendor/Makefile.am | 1 - opensm/libvendor/osm_vendor_al.c | 2 - opensm/libvendor/osm_vendor_ibumad.c | 2 - opensm/libvendor/osm_vendor_mtl.c | 2 - opensm/libvendor/osm_vendor_test.c | 2 - opensm/libvendor/osm_vendor_umadt.c | 2 - 9 files changed, 2 insertions(+), 84 deletions(-) delete mode 100644 opensm/include/vendor/osm_vendor_select.h diff --git a/opensm/include/Makefile.am b/opensm/include/Makefile.am index f1b4504..1df1abc 100644 --- a/opensm/include/Makefile.am +++ b/opensm/include/Makefile.am @@ -21,7 +21,6 @@ EXTRA_DIST = \ $(srcdir)/vendor/osm_vendor_mlx_transport.h \ $(srcdir)/vendor/osm_vendor_mlx_inout.h \ $(srcdir)/vendor/osm_vendor_mtl_hca_guid.h \ - $(srcdir)/vendor/osm_vendor_select.h \ $(srcdir)/vendor/osm_vendor_test.h \ $(srcdir)/vendor/osm_vendor_ts.h \ $(srcdir)/vendor/osm_vendor_mlx_txn.h \ diff --git a/opensm/include/vendor/osm_vendor.h b/opensm/include/vendor/osm_vendor.h index 747c090..4d0ae4c 100644 --- a/opensm/include/vendor/osm_vendor.h +++ b/opensm/include/vendor/osm_vendor.h @@ -42,7 +42,7 @@ this is the generic include file which includes the proper vendor specific file */ -#include +#include #if defined( OSM_VENDOR_INTF_TEST ) #include @@ -67,5 +67,5 @@ #include #elif #error No MAD Interface selected! -#error Choose an interface in osm_vendor_select.h +#error Choose an interface in osm_config.h #endif diff --git a/opensm/include/vendor/osm_vendor_select.h b/opensm/include/vendor/osm_vendor_select.h deleted file mode 100644 index a7d12dc..0000000 --- a/opensm/include/vendor/osm_vendor_select.h +++ /dev/null @@ -1,70 +0,0 @@ -/* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. - * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -/* - * Abstract: - * Include file that defines which vendor files to compile. - */ - -#ifndef _OSM_VENDOR_SELECT_H_ -#define _OSM_VENDOR_SELECT_H_ - -///////////////////////////////////////////////////// -// -// MAD INTERFACE SELECTION -// -///////////////////////////////////////////////////// - -/* - TEST and UMADT must be specified in the 'make' command line, - with VENDOR=test or VENDOR=umadt. -*/ -#ifndef OSM_VENDOR_INTF_OPENIB -#ifndef OSM_VENDOR_INTF_TEST -#ifndef OSM_VENDOR_INTF_UMADT -#ifndef OSM_VENDOR_INTF_MTL -#ifndef OSM_VENDOR_INTF_TS -#ifndef OSM_VENDOR_INTF_SIM -#ifndef OSM_VENDOR_INTF_AL -#define OSM_VENDOR_INTF_OPENIB -#endif /* AL */ -#endif /* TS */ -#endif /* SIM */ -#endif /* MTL */ -#endif /* UMADT */ -#endif /* TEST */ -#endif /* OPENIB */ - -#endif /* _OSM_VENDOR_SELECT_H_ */ diff --git a/opensm/libvendor/Makefile.am b/opensm/libvendor/Makefile.am index 63282f4..22f7a08 100644 --- a/opensm/libvendor/Makefile.am +++ b/opensm/libvendor/Makefile.am @@ -23,7 +23,6 @@ osmvendor_api_version=$(shell grep LIBVERSION= $(srcdir)/libosmvendor.ver | sed COMM_HDRS= $(srcdir)/../include/vendor/osm_vendor_api.h \ $(srcdir)/../include/vendor/osm_vendor.h \ - $(srcdir)/../include/vendor/osm_vendor_select.h \ $(srcdir)/../include/vendor/osm_vendor_sa_api.h if OSMV_OPENIB diff --git a/opensm/libvendor/osm_vendor_al.c b/opensm/libvendor/osm_vendor_al.c index 6f8b690..0db6880 100644 --- a/opensm/libvendor/osm_vendor_al.c +++ b/opensm/libvendor/osm_vendor_al.c @@ -49,8 +49,6 @@ # include #endif /* HAVE_CONFIG_H */ -#include - #ifdef OSM_VENDOR_INTF_AL #include diff --git a/opensm/libvendor/osm_vendor_ibumad.c b/opensm/libvendor/osm_vendor_ibumad.c index 96fd01b..ab18623 100644 --- a/opensm/libvendor/osm_vendor_ibumad.c +++ b/opensm/libvendor/osm_vendor_ibumad.c @@ -48,8 +48,6 @@ # include #endif /* HAVE_CONFIG_H */ -#include - #ifdef OSM_VENDOR_INTF_OPENIB #include diff --git a/opensm/libvendor/osm_vendor_mtl.c b/opensm/libvendor/osm_vendor_mtl.c index 7f6c0cd..e81bb8e 100644 --- a/opensm/libvendor/osm_vendor_mtl.c +++ b/opensm/libvendor/osm_vendor_mtl.c @@ -37,8 +37,6 @@ # include #endif /* HAVE_CONFIG_H */ -#include - #ifdef OSM_VENDOR_INTF_MTL #include diff --git a/opensm/libvendor/osm_vendor_test.c b/opensm/libvendor/osm_vendor_test.c index c3b3e3d..67fc0e2 100644 --- a/opensm/libvendor/osm_vendor_test.c +++ b/opensm/libvendor/osm_vendor_test.c @@ -46,8 +46,6 @@ # include #endif /* HAVE_CONFIG_H */ -#include - #ifdef OSM_VENDOR_INTF_TEST #include diff --git a/opensm/libvendor/osm_vendor_umadt.c b/opensm/libvendor/osm_vendor_umadt.c index 0a8aea0..8237f3d 100644 --- a/opensm/libvendor/osm_vendor_umadt.c +++ b/opensm/libvendor/osm_vendor_umadt.c @@ -49,8 +49,6 @@ # include #endif /* HAVE_CONFIG_H */ -#include - #ifdef OSM_VENDOR_INTF_UMADT #include -- 1.6.0.1.196.g01914 From sashak at voltaire.com Sat Oct 11 11:55:06 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 11 Oct 2008 20:55:06 +0200 Subject: [ofa-general] Re: [infiniband-diags] specify -a in call to perfquery in ibclearerrors In-Reply-To: <1223679406.1197.236.camel@cardanus.llnl.gov> References: <1223679406.1197.236.camel@cardanus.llnl.gov> Message-ID: <20081011185506.GJ6947@sashak.voltaire.com> On 15:56 Fri 10 Oct , Al Chu wrote: > Hey Sasha, > > As discussed in the other thread, ibclearerrors should now specify -a > instead of depend on passing port 255 to the tool. > > Al > > -- > Albert Chu > chu11 at llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > From 2466cb327ff8abc319a203d72f9a8fb5c9cabbaa Mon Sep 17 00:00:00 2001 > From: Albert Chu > Date: Thu, 9 Oct 2008 16:38:22 -0700 > Subject: [PATCH] specify -a in perfquery call to clear all errors > > > Signed-off-by: Albert Chu Applied. Thanks. Sasha From sashak at voltaire.com Sat Oct 11 11:55:24 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 11 Oct 2008 20:55:24 +0200 Subject: [ofa-general] Re: [infiniband-diags] [trivial] tweak notes about port 255 in perfquery manpage In-Reply-To: <1223679414.1197.237.camel@cardanus.llnl.gov> References: <1223679414.1197.237.camel@cardanus.llnl.gov> Message-ID: <20081011185524.GK6947@sashak.voltaire.com> On 15:56 Fri 10 Oct , Al Chu wrote: > Hey Sasha, > > As discussed in the other threads, using -a should be different than > specifying port 255. Adjust the manpage appropriately. > > Al > > -- > Albert Chu > chu11 at llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > From 539c1e79883ab7df609aa8982a33582025de3fb3 Mon Sep 17 00:00:00 2001 > From: Albert Chu > Date: Thu, 9 Oct 2008 16:38:25 -0700 > Subject: [PATCH] tweak perfquery note about port 255 > > > Signed-off-by: Albert Chu Applied. Thanks. Sasha From sashak at voltaire.com Sat Oct 11 12:00:04 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 11 Oct 2008 21:00:04 +0200 Subject: [ofa-general] Re: [infiniband-diags] error out if AllPortSelect is not supported In-Reply-To: <1223679422.1197.238.camel@cardanus.llnl.gov> References: <1223679422.1197.238.camel@cardanus.llnl.gov> Message-ID: <20081011190004.GL6947@sashak.voltaire.com> On 15:57 Fri 10 Oct , Al Chu wrote: > Hey Sasha, > > As discussed in the other thread, this patches makes perfquery error out > if the user requested port 255 and AllPortSelect is not supported. > > Al > > -- > Albert Chu > chu11 at llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > From 7802b9ade2270e5c6f0b7606f9243bbe2f551c29 Mon Sep 17 00:00:00 2001 > From: Albert Chu > Date: Thu, 9 Oct 2008 16:38:30 -0700 > Subject: [PATCH] if user inputs port 255 error if AllPortSelect flag not supported > > > Signed-off-by: Albert Chu Applied. Thanks. Sasha From sashak at voltaire.com Sat Oct 11 13:03:22 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 11 Oct 2008 22:03:22 +0200 Subject: [ofa-general] [infiniband-diags] if perfquery -a is specified loop through ports if required and aggregate output In-Reply-To: <1223679785.1197.252.camel@cardanus.llnl.gov> References: <1223679720.1197.250.camel@cardanus.llnl.gov> <1223679785.1197.252.camel@cardanus.llnl.gov> Message-ID: <20081011200322.GM6947@sashak.voltaire.com> On 16:03 Fri 10 Oct , Al Chu wrote: > Oops, forgot to attach the patch. Here it is. > > Al > > On Fri, 2008-10-10 at 16:02 -0700, Al Chu wrote: > > Hey Sasha, > > > > And finally the big patch. If -a is specified and AllPortSelect is not > > supported (and -l isn't specified) loop through all ports and aggregate > > them into one output. > > > > So the patch is a tad lengthy given the manual packet parsing/counting > > that had to be done. I'm not aware of any libs/helper funcs in OFED > > that could have made this code shorter. Please let me know if there are > > some obvious funcs that could make this better. > > > > Also, my understanding is that in IB, counters don't wrap around. When > > they get to the max they stay at the max. So wrap-around checks for all > > of the counters is there as I added things up. Couldn't find some > > wrapper funcs for this in OFED already. So hopefully I'm not repeating > > code. > > > > Al > > > -- > Albert Chu > chu11 at llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > >From 5864bf543181dda556e41bfb6b7c2780faeb3035 Mon Sep 17 00:00:00 2001 > From: Albert Chu > Date: Thu, 9 Oct 2008 16:39:19 -0700 > Subject: [PATCH] aggregate port counters for single output under --all_ports > > > Signed-off-by: Albert Chu All applied. Thanks. Sasha From sashak at voltaire.com Sat Oct 11 14:51:13 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 11 Oct 2008 23:51:13 +0200 Subject: [ofa-general] ***SPAM*** [PATCH] opensm: update Release Notes for OpenSM version 3.2 Message-ID: <20081011215113.GN6947@sashak.voltaire.com> Update Release Notes for OpenSM version 3.2 Signed-off-by: Sasha Khapyorsky --- opensm/doc/opensm_release_notes-3.2.txt | 294 ++++++++++++++++++------------- 1 files changed, 175 insertions(+), 119 deletions(-) diff --git a/opensm/doc/opensm_release_notes-3.2.txt b/opensm/doc/opensm_release_notes-3.2.txt index 7728849..68178c4 100644 --- a/opensm/doc/opensm_release_notes-3.2.txt +++ b/opensm/doc/opensm_release_notes-3.2.txt @@ -17,104 +17,161 @@ This document includes the following sections: dependencies) 2 Known Issues And Limitations 3 Unsupported IB compliance statements -4 Major Bug Fixes +4 Bug Fixes 5 Main Verification Flows 6 Qualified software stacks and devices 1.1 Major New Features -* QoS manager (experimental) - This QoS manager implementation is in accordance with IBA QoS Annex. - Highly configurable QoS Policy is parsed from OpenSM QoS policy file. - Valid QoS parameters will be reported in SA PathRecord and - MultiPathRecord. In addition simple QoS levels per ULPs configuration - is supported too. - -* Performance Manager - When enabled it collects a fabric port counters and able to log it or - to pass to external program via event plugin interface. It handles - counters overflow, supports LID/QP redirection and is able to work - when OpenSM is in master, standby, and inactive states. - -* Dimension Order routing (DOR) algorithm - DOR Unicast routing algorithm - based on the Min Hop algorithm, but - avoids port equalization except for redundant links between the - same two switches. This provides deadlock free routes for hypercubes - when the fabric is cabled as a hypercube and for meshes when cabled - as a mesh (see details in OpenSM man page). - -* Routing improvements - Speedup the current routing algorithms default MinHops, Up/Down and - LASH and lid matrix generation. Fat Tree routing engine is able to work - with not pure fat free topology. - -* Multiple IB routers support - OpenSM now able to keep configurable subnet prefix to router table. - SA will report path to this routers when SA PathRecord was issued with - non-local DGID. - -* Node map - This is possible to name nodes in this config file. Those names will be - used for logging and by QoS configuration. - -* PKey index support - Proper support for PKey index in GSI queries. - -* Incremental LFTs, PKey, SL2VL, and VLarbitration table updates - OpenSM will only fetch those tables in first heavy sweep and then - will maintain this internally. - -* Fast port and switch detector - When port and/or switch was externally reset and it was fast so sweep - doesn't find this device as disconnected OpenSM will detect this by - changed port states and handle accordingly. - -* Duplicated GUIDs/port moving detector - OpenSM will be able to detect port moving during a fabric discovery - and will not report duplicated GUIDs in this case. - -* Multicast rerouting speedup - Now OpenSM will calculate and setup multicast forwarding tables for - all altered multicast groups and not for each one. - -* Event plugin API - OpenSM allows to load dynamically various plugin modules. - -* Many generic improvements +* Routing Chaining + Routing chaining is the ability to configure the order in which routing + algorithms are applied in opensm, i.e. '-R ftree,updn,minhop' - try + using ftree routing. If ftree fails, try updn. If updn fails, try + minhop. + +* Solicited Node Multicast addresses consolidation + When this mode is used (enabled with --consolidate_ipv6_snm_req option) + OpenSM will map all IPv6 Solicited Node Multicast address join requests + into a single Multicast group with address ff10:601b::1:ff00:0. In this + way limited MLID space is saved. The feature is very useful with large + (~> 1024 nodes) clusters. + +* OpenSM sweep state machine rework + Huge and buggy OpenSM sweep state machine was fully rewritten in safer + and more effective synchronous manner. + +* Multi lid routing balancing for updn/minhop routing algorithms + When LMC > 0 is used OpenSM will ensure to generate routing paths via + different switches and when possible chassis. + +* Preserve base lid routes when LMC > 0 + When LMC > 0 is used OpenSM will preserve routing paths for base lids + as it would be with LMC = 0. In this way traffic on each LID level is + not affected by LMC changes. + +* Ordered routing paths balancing + This adds ability to predefine the port order in which routing paths + balancing is performed by OpenSM. Helps to improve performance + dramatically (40-50%) for applications with known communication + pattern. Activated with --guid_routing_order_file command line option. + +* Unified OpenSM configuration + Now there is "conventional" config file instead of hidden option cache + file (opensm.opts). OpenSM will find this in a default place (consult + man page for exact value) or the file name can be specified with '-F' + command line option. Also there is an option ('-c') to generate config + file template. + +* Query remote SMs during light sweep + Master OpenSM will query remote standby SMs periodically to catch its + possible state changes and react accordingly (as required by IBA spec). + +* Predefined port ids for Up/Down algorithm + This is useful as Up/Down fine tuning tool - the algorithm will use + predefined port IDs instead of GUIDs for its decision about direction. + Activated with --ids_guid_file command line option. + +* Improved plugin API version 2. + Now OpenSM will provide to plugins the access to all data structures. + This make it possible to implement powerful multi purpose plugins. All + OpenSM header files are installed now and specific configuration/build + options are exported via generated osm_config.h header file. + +* Many code improvements, optimizations and cleanups + +* Automatic daily snapshots generation. + This is is not a "feature", but simplifies the access to recent OpenSM + bits. 1.2 Minor New Features: -* Daemon mode can be activated with -B option. +* Cleanup cl_qlock_pool memory allocator - speedup memory allocations -* Support multiple scopes for IPoIB multicast groups in partition config. +* Support for configurable (via OSM_UMAD_MAX_PENDING environment variable) + size of pending MADs pool. -* Loopback connection handling - Loopback connection is not interpreted as duplicated GUID anymore. +* Set packet life time to subnet timeout option rather than default -* Connect root nodes option for Up/Down routing engine. - When this option is specified Up/Down will create routing paths between - its root nodes. +* Enforce routing paths rebalancing on switch reconnection -* Dump and log filenames changed from osm* to opensm*. +* In Up/Down routing algorithm compare GUID values in host byte order -* Support loopback console - Socket console with only local access. +* Add 'switchbalance' and 'lidbalance' commands for OpenSM console -* Configurable config directory (the default value is /etc/opensm) and - configurable default values of OpenSM config filenames. +* Respond to new trap 144 node description update flag -* Add option for force SDR link speed - Add option to opensm.opts to force link speed. Currently, only forcing - to SDR link speed is supported. This option is not supported as a - command line option. +* Add '--connect_roots' command line options. This preserves connectivity + between root nodes in Up/Down routing algorithm -* Better packaging - Building and RPM packaging were improved and simplified. +* Setting SL in the IPoIB MCast groups in accordance with QoS policy -* Handle "babbling" ports - When a babbling port (port which causes a frequent trap generation) is - detected, OpenSM will disable the port which should terminate the trap - storm. +* Dump auto detected root node guids in Up/Down routing algorithm + +* Unify OpenSM dumpers code + +* Unify various guid files parsers - add generic nodenamemap style parser + +* When root node guids were provided in file update the list on each + Up/Down run + +* During ./configure show values of configuration dirs and files + +* Make prefix routes config file name configurable + +* Add a Performance Manager HOWTO to the docs and the dist + +* Support separate SA and SM keys as clarified in IBA 1.2.1 + +* Remove AM_MAINTAINER_MODE in ./configure + +* Make vendor type OSM_VENDOR_INTF_OPENIB (libibumad) to be default + +* Build osm_perfmgr_db.* content only when PerfMgr is enabled. + +* Move PerfMgr event_db_dump_file to common OpenSM dump dir + +* Allow space separated strings as values in OpenSM config + +* Support for multiple event plugins + +* Add '--version' command line option + +* Add '--create-config ' command line option + +* Speedup and simplify logging code + +* Speedup multicast processing in SA DB + +* In log messages convert unicast LIDs from hex to decimal format and + GIDs from hex to IPv6 address format + +* Handle all possible ports in "ignore-guids" file + +* Add 'reroute' console command + +* Remove many install-exec-hook from Makefiles + +* Some cleanups in LASH routing algorithm code + +* In Makefiles remove -rpath and explicit -lpthread, -ldl from LDFLAGS + (move to configurator) + +* Install all OpenSM header files + +* Improve locking in SM Info receiver + +* Add new OSM_EVENT_ID_SUBNET_UP event for plugins + +* Redo lex and yacc files generation in conventional way + +* Add a missing Node Description check on light sweep. + +* Move vendor specific compilation defines from command to generated + config.h file + +* Provide useful error message when log file opening fails + +* Add generated osm_config.h file with OpenSM specific defines 1.3 Library API Changes @@ -209,76 +266,75 @@ information regarding each compliance statement. * C15-0.1.14 (Services): Provide means to associate service name and ServiceKeys. -4 Major Bug Fixes +4 Bug Fixes ----------------- -The following is a list of bugs that were fixed. Note that other less critical -or visible bugs were also fixed. - -* osm_ucast_ftree.c: do load-leveling of non-CN routes +4.1 Major Bug Fixes -* osm_ucast_ftree.c: ignore port 0 and loopbacks on switches +* Set SA attribute offset to 0 when no records are returned -* lash: fix possible segfault in osm_get_lash_sl() +* Send trap 64 only after new ports are in ACTIVE state. -* osm_ucast_ftree.c: fixing coredump in fat-tree routing +* Fix in sending client reregistration bit -* osm_sa_slvl_record: fix overflow crash +* Fix default OpenSM SM Key byte order -* Break multicast rerouting requests processing when heavy sweep is - scheduled. +* Fix in sending Multicast groups creation/deletion notification (Traps + 66,67) -* updn: report fallback properly +* Don't startup automatically on SuSE based systems -* Fix incorrect identification of routing engine used +4.2 Other Bug Fixes -* Don't zero base LID when invalid value is received +* opensm/osm_console.c: fix seg fault when running "portstatus ca" in + the console -* lash: fix wrong allocation size +* opensm: fix potential core dumps where osm_node_get_physp_ptr can + return NULL -* Fixing broken logic in 'process world' part of LinkRecord processing +* opensm/osm_mcast_mgr: limit spanning tree creation recursion to value + of max hops (64) -* Fix lmc_mask bit order in osm_sa_link_record.c +* opensm: switch LFTs incremental update fix -* Adding missing comparison by to_lid/from_lid in LinkRecord processing +* opensm/osm_state_mgr.c: fix segmentation fault -* Broken logic when scanning subnet for PIR request +* opensm: eliminate some potential NULL pointer dereferences -* No interactive games in daemon mode +* opensm/osm_console.c: fix guid parsing -* Fixing memory leak in node description +* opensm: fix off by 1 issue with max_lid and max_multicat_lid_ho -* Fix PortInfo update issues for switch port 0 +* opensm: fix potentially wrong port_guid initialization -* Changed method_mask type in user_mad interface in accordance with - kernel ABI +* opensm/configure.in: fix wrong HAVE_DEFAULT_OPENSM_CONFIG_FILE define + generation -* Use umad_get_issm_path() in osm_vendor_set_sm() +* opensm: fix snprintf() usage -* Report message fix +* opensm/osm_sa_lft_record: validate LFT block number -* Uninitialized variables usage fix +* opensm/osm_sa_lft_record: pass block parameter in host byte order -* osm_ucast_ftree.c: Possible NULL ptr seg fault +* opensm/include/Makefile.am: don't duplicate header files in EXTRA_DIST -* osm_mcast_mgr.c: Possible NULL ptr seg fault +* opensm/osm_sa_class_port_info.c: fix over bound array access -* TrapRepress was failing for mkey != 0 +* osmtest/osmt_service.c: fix over bound array access -* IB_PR_COMPMASK was used in MPR +* osmtest: fix qpn encoding in osmtest_informinfo_request() -* Set hop limit when creating ipoib multicast groups +* opensm/osm_vendor_mlx_sa.c: handling attribute offset of 0 -* Fix outstanding mad counters tracking on the error paths. +* opensm: fix segfault corner case when osm_console_init fails -* Report new ports before handover mastership +* opensm/console: close console socket on cleanup path -* Fix opvls and neighbormtu when remote port invalid. +* opensm/osm_ucast_lash: fix buffer overflow -* Bug in coding trying to set vl_arb_high_limit when PortInfo.base_lid - was still zero. +* opensm: fix broken IPv6 SNM consolidation code -* Protect SMInfo response against port moving issue. +* Other less critical or visible bugs were also fixed. 5 Main Verification Flows ------------------------- -- 1.6.0.1.196.g01914 From sashak at voltaire.com Sat Oct 11 16:31:03 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 12 Oct 2008 01:31:03 +0200 Subject: [ofa-general] [PATCH] opensm: Unify options listing in usage() message Message-ID: <20081011233103.GO6947@sashak.voltaire.com> This unify OpenSM options listing in the usage message (shown on 'opensm --help'). Options listing format is: --long_option, -o Option description text. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/main.c | 126 +++++++++++++++++++------------------------------- 1 files changed, 48 insertions(+), 78 deletions(-) diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index 65adb9a..81b0a01 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -126,24 +126,21 @@ static void show_usage(void) printf("\n------- OpenSM - Usage and options ----------------------\n"); printf("Usage: opensm [options]\n"); printf("Options:\n"); - printf("--version\n" - " Prints OpenSM version and exits.\n\n"); - printf("-F , --config \n" + printf("--version\n Prints OpenSM version and exits.\n\n"); + printf("--config, -F \n" " The name of the OpenSM config file. When not specified\n" " " OSM_DEFAULT_CONFIG_FILE " will be used (if exists).\n\n"); - printf("-c , --create-config \n" + printf("--create-config, -c \n" " OpenSM will dump its configuration to the specified file and exit.\n" " This is a way to generate OpenSM configuration file template.\n\n"); - printf("-g \n" - "--guid \n" + printf("--guid, -g \n" " This option specifies the local port GUID value\n" " with which OpenSM should bind. OpenSM may be\n" " bound to 1 port at a time.\n" " If GUID given is 0, OpenSM displays a list\n" " of possible port GUIDs and waits for user input.\n" " Without -g, OpenSM tries to use the default port.\n\n"); - printf("-l \n" - "--lmc \n" + printf("--lmc, -l \n" " This option specifies the subnet's LMC value.\n" " The number of LIDs assigned to each port is 2^LMC.\n" " The LMC value must be in the range 0-7.\n" @@ -153,155 +150,131 @@ static void show_usage(void) " ports, i.e. multiple interconnects between switches.\n" " Without -l, OpenSM defaults to LMC = 0, which allows\n" " one path between any two ports.\n\n"); - printf("-p \n" - "--priority \n" + printf("--priority, -p \n" " This option specifies the SM's PRIORITY.\n" " This will effect the handover cases, where master\n" " is chosen by priority and GUID. Range goes\n" " from 0 (lowest priority) to 15 (highest).\n\n"); - printf("-smkey \n" + printf("--smkey, -k \n" " This option specifies the SM's SM_Key (64 bits).\n" " This will effect SM authentication.\n" " Note that OpenSM version 3.2.1 and below used the\n" " default value '1' in a host byte order, it is fixed\n" " now but you may need this option to interoperate\n" - " with old OpenSM running on a little endian machine.\n" - "\n"); - - printf("-r\n" - "--reassign_lids\n" + " with old OpenSM running on a little endian machine.\n\n"); + printf("--reassign_lids, -r\n" " This option causes OpenSM to reassign LIDs to all\n" " end nodes. Specifying -r on a running subnet\n" " may disrupt subnet traffic.\n" " Without -r, OpenSM attempts to preserve existing\n" " LID assignments resolving multiple use of same LID.\n\n"); - printf("-R\n" - "--routing_engine \n" + printf("--routing_engine, -R \n" " This option chooses routing engine(s) to use instead of default\n" " Min Hop algorithm. Multiple routing engines can be specified\n" " separated by commas so that specific ordering of routing\n" " algorithms will be tried if earlier routing engines fail.\n" " Supported engines: updn, file, ftree, lash, dor\n\n"); - printf("-z\n" - "--connect_roots\n" + printf("--connect_roots, -z\n" " This option enforces a routing engine (currently\n" " up/down only) to make connectivity between root switches\n" " and in this way be IBA compliant. In many cases,\n" " this can violate \"pure\" deadlock free algorithm, so\n" " use it carefully.\n\n"); - printf("-M\n" - "--lid_matrix_file \n" + printf("--lid_matrix_file, -M \n" " This option specifies the name of the lid matrix dump file\n" " from where switch lid matrices (min hops tables will be\n" " loaded.\n\n"); - printf("-U\n" - "--lfts_file \n" + printf("--lfts_file, -U \n" " This option specifies the name of the LFTs file\n" " from where switch forwarding tables will be loaded.\n\n"); - printf("-S\n" - "--sadb_file \n" + printf("--sadb_file, -S \n" " This option specifies the name of the SA DB dump file\n" " from where SA database will be loaded.\n\n"); - printf("-a\n" - "--root_guid_file \n" + printf("--root_guid_file, -a \n" " Set the root nodes for the Up/Down or Fat-Tree routing\n" " algorithm to the guids provided in the given file (one\n" " to a line)\n" "\n"); - printf("-u\n" - "--cn_guid_file \n" + printf("--cn_guid_file, -u \n" " Set the compute nodes for the Fat-Tree routing algorithm\n" - " to the guids provided in the given file (one to a line)\n" - "\n"); - printf("-m\n" - "--ids_guid_file \n" + " to the guids provided in the given file (one to a line)\n\n"); + printf("--ids_guid_file, -m \n" " Name of the map file with set of the IDs which will be used\n" " by Up/Down routing algorithm instead of node GUIDs\n" - " (format: per line)\n"); - printf("-X\n" - "--guid_routing_order_file \n" + " (format: per line)\n\n"); + printf("--guid_routing_order_file, -X \n" " Set the order port guids will be routed for the MinHop\n" " and Up/Down routing algorithms to the guids provided in the\n" " given file (one to a line)\n\n"); - printf("-o\n" - "--once\n" + printf("--once, -o\n" " This option causes OpenSM to configure the subnet\n" " once, then exit. Ports remain in the ACTIVE state.\n\n"); - printf("-s \n" - "--sweep \n" + printf("--sweep, -s \n" " This option specifies the number of seconds between\n" " subnet sweeps. Specifying -s 0 disables sweeping.\n" " Without -s, OpenSM defaults to a sweep interval of\n" " 10 seconds.\n\n"); - printf("-t \n" - "--timeout \n" + printf("--timeout, -t \n" " This option specifies the time in milliseconds\n" " used for transaction timeouts.\n" " Specifying -t 0 disables timeouts.\n" " Without -t, OpenSM defaults to a timeout value of\n" " 200 milliseconds.\n\n"); - printf("-maxsmps \n" + printf("--maxsmps, -n \n" " This option specifies the number of VL15 SMP MADs\n" " allowed on the wire at any one time.\n" - " Specifying -maxsmps 0 allows unlimited outstanding\n" + " Specifying --maxsmps 0 allows unlimited outstanding\n" " SMPs.\n" - " Without -maxsmps, OpenSM defaults to a maximum of\n" + " Without --maxsmps, OpenSM defaults to a maximum of\n" " 4 outstanding SMPs.\n\n"); - printf("-console [off|local" + printf("--console, -q [off|local" #ifdef ENABLE_OSM_CONSOLE_SOCKET "|socket|loopback" #endif "]\n This option activates the OpenSM console (default off).\n\n"); #ifdef ENABLE_OSM_CONSOLE_SOCKET - printf("-console-port \n" + printf("--console-port, -C \n" " Specify an alternate telnet port for the console (default %d).\n\n", OSM_DEFAULT_CONSOLE_PORT); #endif - printf("-i \n" - "-ignore-guids \n" + printf("--ignore-guids, -i \n" " This option provides the means to define a set of ports\n" " (by guid) that will be ignored by the link load\n" " equalization algorithm.\n\n"); - printf("-x\n" - "--honor_guid2lid\n" + printf("--honor_guid2lid, -x\n" " This option forces OpenSM to honor the guid2lid file,\n" " when it comes out of Standby state, if such file exists\n" " under OSM_CACHE_DIR, and is valid. By default, this is FALSE.\n\n"); - printf("-f\n" - "--log_file\n" + printf("--log_file, -f \n" " This option defines the log to be the given file.\n" " By default, the log goes to /var/log/opensm.log.\n" " For the log to go to standard output use -f stdout.\n\n"); - printf("-L \n" - "--log_limit \n" + printf("--log_limit, -L \n" " This option defines maximal log file size in MB. When\n" " specified the log file will be truncated upon reaching\n" " this limit.\n\n"); - printf("-e\n" - "--erase_log_file\n" + printf("--erase_log_file, -e\n" " This option will cause deletion of the log file\n" " (if it previously exists). By default, the log file\n" " is accumulative.\n\n"); - printf("-P\n" - "--Pconfig\n" + printf("--Pconfig, -P \n" " This option defines the optional partition configuration file.\n" " The default name is \'" OSM_DEFAULT_PARTITION_CONFIG_FILE "\'.\n\n"); - printf("-Q\n" "--qos\n" " This option enables QoS setup.\n\n"); - printf("-Y\n" - "--qos_policy_file\n" + printf("--no_part_enforce, -N\n" + " This option disables partition enforcement on switch external ports.\n\n"); + printf("--qos, -Q\n" " This option enables QoS setup.\n\n"); + printf("--qos_policy_file, -Y \n" " This option defines the optional QoS policy file.\n" " The default name is \'" OSM_DEFAULT_QOS_POLICY_FILE "\'.\n\n"); - printf("-N\n" "--no_part_enforce\n" - " This option disables partition enforcement on switch external ports.\n\n"); - printf("-y\n" "--stay_on_fatal\n" + printf("--stay_on_fatal, -y\n" " This option will cause SM not to exit on fatal initialization\n" " issues: if SM discovers duplicated guids or 12x link with\n" " lane reversal badly configured.\n" " By default, the SM will exit on these errors.\n\n"); - printf("-B\n" "--daemon\n" + printf("--daemon, -B\n" " Run in daemon mode - OpenSM will run in the background.\n\n"); - printf("-I\n" "--inactive\n" + printf("--inactive, -I\n" " Start SM in inactive rather than normal init SM state.\n\n"); #ifdef ENABLE_OSM_PERF_MGR printf("--perfmgr\n" " Start with PerfMgr enabled.\n\n"); @@ -316,20 +289,19 @@ static void show_usage(void) printf("--consolidate_ipv6_snm_req\n" " Consolidate IPv6 Solicited Node Multicast group joins\n" " into 1 IB multicast group.\n\n"); - printf("-v\n" - "--verbose\n" + printf("--verbose, -v\n" " This option increases the log verbosity level.\n" " The -v option may be specified multiple times\n" " to further increase the verbosity level.\n" " See the -D option for more information about\n" " log verbosity.\n\n"); - printf("-V\n" + printf("--V, -V\n" " This option sets the maximum verbosity level and\n" " forces log flushing.\n" " The -V is equivalent to '-D 0xFF -d 2'.\n" " See the -D option for more information about\n" " log verbosity.\n\n"); - printf("-D \n" + printf("--D, -D \n" " This option sets the log verbosity level.\n" " A flags field must follow the -D option.\n" " A bit set/clear in the flags enables/disables a\n" @@ -349,8 +321,7 @@ static void show_usage(void) " Specifying -D 0xFF enables all messages (see -V).\n" " High verbosity levels may require increasing\n" " the transaction timeout with the -t option.\n\n"); - printf("-d \n" - "--debug \n" + printf("--debug, -d \n" " This option specifies a debug option.\n" " These options are not normally needed.\n" " The number following -d selects the debug\n" @@ -363,9 +334,8 @@ static void show_usage(void) " -d3 - Disable multicast support\n" " -d10 - Put OpenSM in testability mode\n" " Without -d, no debug options are enabled\n\n"); - printf("-h\n" - "--help\n" " Display this usage info then exit.\n\n"); - printf("-?\n" " Display this usage info then exit.\n\n"); + printf("--help, -h, -?\n" + " Display this usage info then exit.\n\n"); fflush(stdout); exit(2); } @@ -546,7 +516,7 @@ int main(int argc, char *argv[]) uint32_t val; unsigned config_file_done = 0; const char *const short_option = - "F:c:i:f:ed:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:NBIQvVhoryxp:n:q:k:C:"; + "F:c:i:f:ed:D:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:NBIQvVhoryxp:n:q:k:C:"; /* In the array below, the 2nd parameter specifies the number -- 1.6.0.1.196.g01914 From kliteyn at dev.mellanox.co.il Sat Oct 11 16:54:37 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 12 Oct 2008 01:54:37 +0200 Subject: [ofa-general] Re: [PATCH 3/6] opensm/Unicast Routing Cache: add osm_ucast_cache.{c, h} files In-Reply-To: <20081009183223.GG4912@sashak.voltaire.com> References: <48E969A6.1000607@dev.mellanox.co.il> <20081009183223.GG4912@sashak.voltaire.com> Message-ID: <48F13CBD.30208@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > Hi Yevgeny, > > Comments are below. > > Also I made couple of incremental changes (which were obvious IMHO), > will post later... > > On 03:28 Mon 06 Oct , Yevgeny Kliteynik wrote: >> Implementation of the osm unicast routing cache. > > Would be nice to have more detailed comment here and also for the > "integration" patch. OK >> Signed-off-by: Yevgeny Kliteynik >> --- >> opensm/include/opensm/osm_ucast_cache.h | 439 ++++++++++++ >> opensm/opensm/osm_ucast_cache.c | 1176 +++++++++++++++++++++++++++++++ >> 2 files changed, 1615 insertions(+), 0 deletions(-) >> create mode 100644 opensm/include/opensm/osm_ucast_cache.h >> create mode 100644 opensm/opensm/osm_ucast_cache.c >> >> diff --git a/opensm/include/opensm/osm_ucast_cache.h b/opensm/include/opensm/osm_ucast_cache.h >> new file mode 100644 >> index 0000000..2dc1c4e >> --- /dev/null >> +++ b/opensm/include/opensm/osm_ucast_cache.h > > [snip...] > >> +/****s* OpenSM: Unicast Cache/osm_ucast_cache_t >> +* NAME >> +* osm_ucast_cache_t >> +* >> +* DESCRIPTION >> +* Unicast Cache structure. >> +* >> +* This object should be treated as opaque and should >> +* be manipulated only through the provided functions. >> +* >> +* SYNOPSIS >> +*/ >> +typedef struct _osm_ucast_cache { > > There are no _osm_* structure names in OpenSM. Please keep things > consistent. (Will post the patch) Thanks >> + cl_qmap_t sw_tbl; >> + boolean_t valid; >> + struct osm_ucast_mgr * p_ucast_mgr; >> +} osm_ucast_cache_t; > > The object itself is pretty small, actually there are only sw_tbl map > and valid flag. Why to not do it as part of struct osm_ucast_mgr and to > save a lot of code like p_cache->p_ucast_mgr->..., construct, etc...? Well, the osm_ucast_cache_t once had much more stuff, so it's a separate object for historical reasons. Interesting how feature that technically never existed already has "historical reasons"... :) Anyway, I agree with you - no reason to have it as separate object right now. I'll fix it in the next version of patches. > [snip...] > >> diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c >> new file mode 100644 >> index 0000000..2c2154a >> --- /dev/null >> +++ b/opensm/opensm/osm_ucast_cache.c > > [snip...] > >> +static cache_switch_t * >> +__cache_sw_new(uint16_t lid_ho) >> +{ >> + cache_switch_t * p_cache_sw = >> + (cache_switch_t *)malloc(sizeof(cache_switch_t)); >> + if (!p_cache_sw) >> + return NULL; >> + >> + memset(p_cache_sw, 0, sizeof(cache_switch_t)); >> + >> + p_cache_sw->ports = (cache_port_t *)malloc(sizeof(cache_port_t)); >> + if (!p_cache_sw->ports) { >> + free(p_cache_sw); >> + return NULL; >> + } > > Is it really helpful to alloc only one port at init time and realloc > later? Normally cache will not be huge, OTOH it saves some flow like > (port_num >= p_cache_sw->num_ports). Maybe I'm missing more complicated > cases? I can define some minimal size of the ports array (let's say 36 plus port0), and then I can increase it each time I need to cache some port higher than p_cache_sw->num_ports (no such switches now, but will be in the future). But frankly, it won't save much runtime, because every time before the cache validation cache should be cleaned up by removing all the switches that don't have any cached ports (all the links were restored during the last discovery), so all the port arrays will be scanned. Still, saving some reallocations would be a nice idea - I'll fix it. >> + >> + /* port[0] fields represent this switch details - lid and type */ >> + p_cache_sw->ports[0].remote_lid_ho = lid_ho; >> + p_cache_sw->ports[0].is_leaf = FALSE; >> + >> + return p_cache_sw; >> +} > > [snip...] > >> +/********************************************************************** >> + **********************************************************************/ >> + >> +static void >> +__cache_add_port(osm_ucast_cache_t * p_cache, >> + uint16_t lid_ho, >> + uint8_t port_num, >> + uint16_t remote_lid_ho, >> + boolean_t is_ca) >> +{ >> + cache_switch_t * p_cache_sw; >> + >> + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); >> + >> + if (!lid_ho || !remote_lid_ho || !port_num) >> + goto Exit; >> + >> + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, >> + "Caching switch port: lid %u [port %u] -> lid %u (%s)\n", >> + lid_ho, port_num, remote_lid_ho, >> + (is_ca)? "CA/RTR" : "SW"); >> + >> + p_cache_sw = __cache_get_or_add_sw(p_cache, lid_ho); >> + if (!p_cache_sw) { >> + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_ERROR, >> + "ERR AD01: Out of memory - cache is invalid\n"); >> + osm_ucast_cache_invalidate(p_cache); >> + goto Exit; >> + } >> + >> + if (port_num >= p_cache_sw->num_ports) { >> + cache_port_t * ports = (cache_port_t *) >> + malloc(sizeof(cache_port_t)*(port_num+1)); > > As Hal already noted, no malloc() result check. Fixed > Also here and in other places - malloc() return 'void *' so why casting? > IMO it is confused. Fixed >> + memset(ports, 0, sizeof(cache_port_t)*(port_num+1)); > > [snip...] > >> +static void >> +__cache_check_link_change(osm_ucast_cache_t * p_cache, >> + osm_physp_t * p_physp_1, >> + osm_physp_t * p_physp_2) >> +{ >> + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); >> + CL_ASSERT(p_physp_1 && p_physp_2); >> + >> + if (!p_cache->valid) >> + goto Exit; >> + >> + if (!p_physp_1->p_remote_physp && !p_physp_2->p_remote_physp) >> + /* both ports were down - new link */ >> + goto Exit; >> + >> + /* unicast cache cannot tolerate any link location change */ >> + >> + if ((p_physp_1->p_remote_physp && >> + p_physp_1->p_remote_physp->p_remote_physp) || >> + (p_physp_2->p_remote_physp && >> + p_physp_2->p_remote_physp->p_remote_physp)) { > > Will this handle port moving during discovery? When duplicated guid > detection is passed ports may have old "remotes". I need to think about it. Will answer in a separate mail. >> + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, >> + "Link location change discovered - cache is invalid\n"); > > Here and in other places in this file: OSM_LOG_INFO is pretty overused > (it is likely this file has more OSM_LOG_INFO than all other OpenSM > parts together), Cache should print *one* log message that will tell whether the cached routing is used or not, same as any other routing engine does. Of all those OSM_LOG_INFO messages, only one can be printed at each heavy sweep. Don't think that it's too much, they just spread all over the code. > I think all OSM_LOG_VERBOSE should actually be replaced by OSM_LOG_DEBUG > and all OSM_LOG_INFO should become OSM_LOG_VERBOSE. No objection about the OSM_LOG_VERBOSE, but I do need your answer about the OSM_LOG_INFO. >> + osm_ucast_cache_invalidate(p_cache); >> + goto Exit; >> + } >> +Exit: >> + OSM_LOG_EXIT(p_cache->p_ucast_mgr->p_log); >> +} > > [snip...] > >> +static void >> +__cache_restore_ucast_info(osm_ucast_cache_t * p_cache, >> + cache_switch_t * p_cache_sw, >> + osm_switch_t * p_sw) >> +{ >> + if (!p_cache->valid) >> + return; >> + >> + /* when seting unicast info, the cached port >> + should have all the required info */ >> + CL_ASSERT(p_cache_sw->max_lid_ho && p_cache_sw->lft && >> + p_cache_sw->num_hops && p_cache_sw->hops); >> + >> + p_sw->max_lid_ho = p_cache_sw->max_lid_ho; >> + >> + if (p_sw->lft_buf) >> + free(p_sw->lft_buf); >> + p_sw->lft_buf = p_cache_sw->lft; >> + p_cache_sw->lft = NULL; >> + >> + p_sw->num_hops = p_cache_sw->num_hops; >> + p_cache_sw->num_hops = 0; >> + if (p_sw->hops) >> + free(p_sw->hops); >> + p_sw->hops = p_cache_sw->hops; >> + p_cache_sw->hops = NULL; > > This is nice :). > > sw->hops is array of pointers which could be allocated by routing > engine so in generic case we will need to free all sub-buffers first. > As far as I can see this function will be used for freshly discovered > switches only, if so it looks correct for me. Just to be sure... This function is used only for freshly discovered switches that were found in cache. > [snip...] > >> +void >> +osm_ucast_cache_validate(osm_ucast_cache_t * p_cache) >> +{ >> + cache_switch_t * p_cache_sw; >> + cache_switch_t * p_remote_cache_sw; >> + unsigned port_num; >> + unsigned max_ports; >> + uint8_t remote_node_type; >> + uint16_t lid_ho; >> + uint16_t remote_lid_ho; >> + osm_switch_t * p_sw; >> + osm_switch_t * p_remote_sw; >> + osm_node_t * p_node; >> + osm_physp_t * p_physp; >> + osm_physp_t * p_remote_physp; >> + osm_port_t * p_remote_port; >> + cl_qmap_t * p_node_guid_tbl; >> + >> + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); >> + if (!p_cache->valid) >> + goto Exit; >> + >> + /* >> + * Scan all the physical switch ports in the subnet. >> + * If the port need_update flag is on, check whether >> + * it's just some node/port reset or a cached topology >> + * change. Otherwise the cache is invalid. >> + */ >> + p_node_guid_tbl = &p_cache->p_ucast_mgr->p_subn->node_guid_tbl; > > Then it should be: > > p_sw_tbl = &p_cache->p_ucast_mgr->p_subn->sw_guid_tbl... > > Will send the patch... Thanks >> + for (p_node = (osm_node_t *) cl_qmap_head(p_node_guid_tbl); >> + p_node != (osm_node_t *) cl_qmap_end(p_node_guid_tbl); >> + p_node = (osm_node_t *) cl_qmap_next(&p_node->map_item)) { >> + >> + if (osm_node_get_type(p_node) != IB_NODE_TYPE_SWITCH) >> + continue; >> + >> + lid_ho = cl_ntoh16(osm_node_get_base_lid(p_node,0)); >> + p_cache_sw = __cache_get_sw(p_cache, lid_ho); >> + >> + p_sw = p_node->sw; >> + max_ports = osm_node_get_num_physp(p_node); >> + >> + /* skip port 0 */ >> + for (port_num = 1; port_num < max_ports; port_num++) { >> + >> + p_physp = osm_node_get_physp_ptr(p_node, port_num); >> + >> + if (!p_physp || !p_physp->p_remote_physp || >> + !osm_physp_link_exists(p_physp, p_physp->p_remote_physp)) >> + /* no valid link */ >> + continue; >> + >> + /* >> + * While scanning all the physical ports in the subnet, >> + * mark corresponding leaf switches in the cache. >> + */ >> + if (p_cache_sw && >> + !p_cache_sw->dropped && >> + !__cache_sw_is_leaf(p_cache_sw) && >> + p_physp->p_remote_physp->p_node && >> + osm_node_get_type( >> + p_physp->p_remote_physp->p_node) != >> + IB_NODE_TYPE_SWITCH) >> + __cache_sw_set_leaf(p_cache_sw); >> + >> + if (!p_physp->need_update) >> + continue; >> + >> + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, >> + "Checking switch lid %u, port %u\n", >> + lid_ho, port_num); >> + >> + p_remote_physp = osm_physp_get_remote(p_physp); >> + remote_node_type = osm_node_get_type(p_remote_physp->p_node); >> + >> + if (remote_node_type == IB_NODE_TYPE_SWITCH) >> + remote_lid_ho = cl_ntoh16(osm_node_get_base_lid( >> + p_remote_physp->p_node, 0)); >> + else >> + remote_lid_ho = cl_ntoh16(osm_node_get_base_lid( >> + p_remote_physp->p_node, >> + osm_physp_get_port_num(p_remote_physp))); >> + >> + if (!p_cache_sw || >> + port_num >= p_cache_sw->num_ports || >> + !p_cache_sw->ports[port_num].remote_lid_ho) { >> + /* >> + * There is some uncached change on the port. >> + * In general, the reasons might be as follows: >> + * - switch reset >> + * - port reset (or port down/up) >> + * - quick connection location change >> + * - new link (or new switch) >> + * >> + * First two reasons allow cache usage, while >> + * the last two reasons should invalidate cache. >> + * >> + * In case of quick connection location change, >> + * cache would have been invalidated by >> + * osm_ucast_cache_check_new_link() function. >> + * >> + * In case of new link between two known nodes, >> + * cache also would have been invalidated by >> + * osm_ucast_cache_check_new_link() function. >> + * >> + * Another reason is cached link between two >> + * known switches went back. In this case the >> + * osm_ucast_cache_check_new_link() function would >> + * clear both sides of the link from the cache >> + * during the discovery process, so effectively >> + * this would be equivalent to port reset. >> + * >> + * So three possible reasons remain: >> + * - switch reset >> + * - port reset (or port down/up) >> + * - link of a new switch >> + * >> + * To validate cache, we need to check only the >> + * third reason - link of a new node/switch: >> + * - If this is the local switch that is new, >> + * then it should have (p_sw->need_update == 2). >> + * - If the remote node is switch and it's new, >> + * then it also should have >> + * (p_sw->need_update == 2). >> + * - If the remote node is CA/RTR and it's new, >> + * then its port should have is_new flag on. >> + */ >> + if (p_sw->need_update == 2) { >> + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, >> + "New switch found (lid %u) - " >> + "cache is invalid\n", >> + lid_ho); >> + osm_ucast_cache_invalidate(p_cache); >> + goto Exit; >> + } >> + >> + if (remote_node_type == IB_NODE_TYPE_SWITCH) { >> + >> + p_remote_sw = p_remote_physp->p_node->sw; >> + if (p_remote_sw->need_update == 2) { >> + /* this could also be case of >> + switch coming back with an >> + additional link that it >> + didn't have before */ >> + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, >> + "New switch/link found (lid %u) - " >> + "cache is invalid\n", >> + remote_lid_ho); >> + osm_ucast_cache_invalidate(p_cache); >> + goto Exit; >> + } > > Maybe not related to cache directly, but anyway related to "fast" switch > reset. When it happened we need to resend whole LFTs (recalculated or > from cache) to this switch. Is it handled? Was it handled? Nope. I have actually seen this problem recently. I think it should be handled in osm_ucast_mgr_set_fwd_table(). Right not the function doesn't check the sw->need_update flag. > [snip...] > >> +void >> +osm_ucast_cache_check_new_link(osm_ucast_cache_t * p_cache, >> + osm_node_t * p_node_1, >> + uint8_t port_num_1, >> + osm_node_t * p_node_2, >> + uint8_t port_num_2) >> +{ >> + uint16_t lid_ho_1; >> + uint16_t lid_ho_2; >> + >> + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); >> + >> + if (!p_cache->valid) >> + goto Exit; >> + >> + __cache_check_link_change(p_cache, >> + osm_node_get_physp_ptr(p_node_1, port_num_1), >> + osm_node_get_physp_ptr(p_node_2, port_num_2)); >> + >> + if (!p_cache->valid) >> + goto Exit; >> + >> + if (osm_node_get_type(p_node_1) != IB_NODE_TYPE_SWITCH && >> + osm_node_get_type(p_node_2) != IB_NODE_TYPE_SWITCH) { >> + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, >> + "Found CA/RTR-2-CA/RTR link - cache is invalid\n"); >> + osm_ucast_cache_invalidate(p_cache); >> + goto Exit; >> + } > > Here and in other places, should we care about back-to-back connections? > Maybe we need just to disable cache at all there is no switches in a > fabric... Hmm, come to think of it, in order to have a valid cache there should be a successful execution of osm_ucast_mgr_process() with routing engine, so if there's no switches in the subnet, the cache won't be valid. On the other hand, cache should take care of the case when SM port is disconnected and then connected with back-2-back link. I'll review the CA/RTR-2-CA/RTR connections in the cache - I'm sure I can remove some code in this area. > [snip...] > >> +void >> +osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, >> + osm_node_t * p_node_1, >> + uint8_t port_num_1, >> + osm_node_t * p_node_2, >> + uint8_t port_num_2) > > I looked at places where this function is used and think it would be > simpler to use prototype like: > > osm_ucast_cache_add_link(osm_ucast_cache_t * p_cache, > osm_physp_t *physp1, osm_physp_t *physp2) > > Will send the patch. Thanks >> +{ >> + uint16_t lid_ho_1; >> + uint16_t lid_ho_2; >> + >> + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); >> + >> + if (!p_cache->valid) >> + goto Exit; >> + >> + if (osm_node_get_type(p_node_1) != IB_NODE_TYPE_SWITCH && >> + osm_node_get_type(p_node_2) != IB_NODE_TYPE_SWITCH) { >> + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, >> + "Dropping CA-2-CA link - cache invalid\n"); >> + osm_ucast_cache_invalidate(p_cache); >> + goto Exit; >> + } > > back-to-back again... > >> + >> + if (((osm_node_get_type(p_node_1) == IB_NODE_TYPE_SWITCH) && >> + (!osm_node_get_physp_ptr(p_node_1, 0) || >> + !osm_physp_is_valid(osm_node_get_physp_ptr(p_node_1, 0)))) || >> + ((osm_node_get_type(p_node_2) == IB_NODE_TYPE_SWITCH) && >> + (!osm_node_get_physp_ptr(p_node_2, 0) || >> + !osm_physp_is_valid(osm_node_get_physp_ptr(p_node_2, 0))))) { >> + /* we're caching a link when one of the nodes >> + has already been dropped and cached */ > > osm_node_get_physp_ptr() already checks port validity and return NULL if > it is not. Patch... Thanks > [snip...] > >> +void >> +osm_ucast_cache_add_node(osm_ucast_cache_t * p_cache, >> + osm_node_t * p_node) >> +{ >> + uint16_t lid_ho; >> + uint8_t max_ports; >> + uint8_t port_num; >> + osm_physp_t * p_physp; >> + osm_node_t * p_remote_node; >> + cache_switch_t * p_cache_sw; >> + >> + OSM_LOG_ENTER(p_cache->p_ucast_mgr->p_log); >> + >> + if (!p_cache->valid) >> + goto Exit; >> + >> + if (osm_node_get_type(p_node) == IB_NODE_TYPE_SWITCH) { >> + >> + lid_ho = cl_ntoh16(osm_node_get_base_lid(p_node,0)); >> + >> + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_VERBOSE, >> + "Caching dropped switch lid %u\n", lid_ho); >> + >> + if (!p_node->sw) { >> + /* something is wrong - forget about cache */ >> + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_ERROR, >> + "ERR AD02: no switch info for node lid %u -" >> + " clearing cache\n", lid_ho); >> + osm_ucast_cache_invalidate(p_cache); >> + goto Exit; >> + } >> + >> + /* unlink (add to cache) all the ports of this switch */ >> + max_ports = osm_node_get_num_physp(p_node); >> + for (port_num = 1; port_num < max_ports; port_num++) { >> + >> + p_physp = osm_node_get_physp_ptr(p_node, port_num); >> + if (!p_physp || !p_physp->p_node || >> + !p_physp->p_remote_physp || >> + !p_physp->p_remote_physp->p_node) > > Can p_physp->p_node be NULL? Fixed (here and in some other place). >> + continue; >> + >> + osm_ucast_cache_add_link(p_cache, p_node, port_num, >> + p_physp->p_remote_physp->p_node, >> + p_physp->p_remote_physp->port_num); >> + } >> + >> + /* >> + * All the ports have been dropped (cached). >> + * If one of the ports was connected to CA/RTR, >> + * then the cached switch would be marked as leaf. >> + * If it isn't, then the dropped switch isn't a leaf, >> + * and cache can't handle it. >> + */ >> + >> + p_cache_sw = __cache_get_sw(p_cache, lid_ho); >> + CL_ASSERT(p_cache_sw); >> + >> + if (!__cache_sw_is_leaf(p_cache_sw)) { >> + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, >> + "Dropped non-leaf switch (lid %u) - " >> + "cache is invalid\n", lid_ho); >> + osm_ucast_cache_invalidate(p_cache); >> + goto Exit; >> + } >> + >> + p_cache_sw->dropped = TRUE; >> + >> + if (!p_node->sw->num_hops || !p_node->sw->hops) { >> + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, >> + "No LID matrices for switch lid %u - " >> + "cache is invalid\n", lid_ho); >> + osm_ucast_cache_invalidate(p_cache); >> + goto Exit; >> + } >> + >> + /* lid matrices */ >> + >> + p_cache_sw->num_hops = p_node->sw->num_hops; >> + p_node->sw->num_hops = 0; >> + p_cache_sw->hops = p_node->sw->hops; >> + p_node->sw->hops = NULL; >> + >> + /* linear forwarding table */ >> + >> + p_cache_sw->lft = p_node->sw->lft_buf; >> + p_node->sw->lft_buf = NULL; >> + p_cache_sw->max_lid_ho = p_node->sw->max_lid_ho; >> + } >> + else { >> + /* dropping CA/RTR: add to cache all the ports of this switch */ >> + max_ports = osm_node_get_num_physp(p_node); >> + for (port_num = 0; port_num < max_ports; port_num++) { > > Any reason to start from port 0 and not 1? Nope, just copy-pasted from drop_mgr... Fixed. >> + >> + p_physp = osm_node_get_physp_ptr(p_node, port_num); >> + if (!p_physp || !p_physp->p_node || >> + !p_physp->p_remote_physp || >> + !p_physp->p_remote_physp->p_node) >> + continue; >> + >> + p_remote_node = p_physp->p_remote_physp->p_node; >> + if (osm_node_get_type(p_remote_node) != >> + IB_NODE_TYPE_SWITCH) { >> + /* CA/RTR to CA/RTR connection */ >> + OSM_LOG(p_cache->p_ucast_mgr->p_log, OSM_LOG_INFO, >> + "Dropping CA/RTR to CA/RTR connection - " >> + "cache is invalid\n"); >> + osm_ucast_cache_invalidate(p_cache); >> + goto Exit; >> + } > > back-to-back connection? Ditto. -- Yevgeny > Will send the patches. > > Sasha > From kliteyn at dev.mellanox.co.il Sat Oct 11 16:57:29 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 12 Oct 2008 01:57:29 +0200 Subject: [ofa-general] Re: [PATCH 5/6] opensm/Unicast Routing Cache: integrate cache into opensm In-Reply-To: <20081009171000.GE4912@sashak.voltaire.com> References: <48E969FE.7050507@dev.mellanox.co.il> <20081009171000.GE4912@sashak.voltaire.com> Message-ID: <48F13D69.5090606@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 03:29 Mon 06 Oct , Yevgeny Kliteynik wrote: > > [snip...] > >> @@ -818,27 +826,37 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * const p_mgr) >> /* >> If there are no switches in the subnet, we are done. >> */ >> - if (cl_qmap_count(p_sw_guid_tbl) == 0 || >> - ucast_mgr_setup_all_switches(p_mgr->p_subn) < 0) >> + if (cl_qmap_count(p_sw_guid_tbl) == 0) >> goto Exit; >> >> p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_NONE; >> - while (p_routing_eng) { >> - if (!ucast_mgr_route(p_routing_eng, p_osm)) >> - break; >> - p_routing_eng = p_routing_eng->next; >> - } >> + if (p_mgr->p_subn->opt.use_ucast_cache && >> + osm_ucast_cache_is_valid(p_mgr->p_cache)) { >> + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, >> + "Configuring switch tables using cached routing\n"); >> + osm_ucast_cache_apply(p_mgr->p_cache); >> >> - if (p_osm->routing_engine_used == OSM_ROUTING_ENGINE_TYPE_NONE) { >> - /* If configured routing algorithm failed, use default MinHop */ >> - osm_ucast_mgr_build_lid_matrices(p_mgr); >> - ucast_mgr_build_lfts(p_mgr); >> - p_osm->routing_engine_used = OSM_ROUTING_ENGINE_TYPE_MINHOP; >> - } >> + } else { > > I think this will break some routing engines (such LASH) and logging > because p_osm->routing_engine_used is leaved as > OSM_ROUTING_ENGINE_TYPE_NONE. > > And since we have cache validation calls in do_sweep() anyway: > > + if (sm->p_subn->opt.use_ucast_cache) > + osm_ucast_cache_validate(sm->ucast_mgr.p_cache); > > Isn't it would be better to not touch osm_ucast_mgr_process() at all and > instead to replace lines above by: > > if (!sm->p_subn->opt.use_ucast_cache || > osm_ucast_cache_process(sm->ucast_mgr.p_cache)) > > This also saves couple of public calls in ucast cache. I think then the > patch could look like below. Agreed? Yes, this is a good idea. Thanks for the patch. -- Yevgeny From kliteyn at dev.mellanox.co.il Sat Oct 11 16:59:40 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 12 Oct 2008 01:59:40 +0200 Subject: [ofa-general] Re: [PATCH 0/6] opensm: Unicast Routing Cache In-Reply-To: <20081010082827.GX4912@sashak.voltaire.com> References: <48E96928.8030200@dev.mellanox.co.il> <20081009171103.GF4912@sashak.voltaire.com> <48EEB00E.7000209@dev.mellanox.co.il> <20081010082827.GX4912@sashak.voltaire.com> Message-ID: <48F13DEC.2030109@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 03:29 Fri 10 Oct , Yevgeny Kliteynik wrote: >> Thanks for the review and the patches. Didn't manage to address >> all your comments yet - will do it tomorrow. >> One question though: how to deal with the incremental patches that >> you sent me? Should I apply them to my branch and then issue one >> V2 patch instead of the old one, or will you apply the original >> patch, followed by all the incremental (yours and mine)? > > It is up to you. You can merge all in single V2 (guess it is simpler) I'll send a v2 patches for 3/6 and 5/6 when I'm done fixing all the stuff. -- Yevgeny > or leave it unchanged and I will apply later. Except integration patch > others are not critical IMHO. > > Sasha > From vlad at lists.openfabrics.org Sun Oct 12 03:16:36 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 12 Oct 2008 03:16:36 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081012-0200 daily build status Message-ID: <20081012101636.89297E60C87@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From hal.rosenstock at gmail.com Sun Oct 12 04:38:49 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sun, 12 Oct 2008 07:38:49 -0400 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm: update Release Notes for OpenSM version 3.2 In-Reply-To: <20081011215113.GN6947@sashak.voltaire.com> References: <20081011215113.GN6947@sashak.voltaire.com> Message-ID: On Sat, Oct 11, 2008 at 5:51 PM, Sasha Khapyorsky wrote: > > Update Release Notes for OpenSM version 3.2 > > Signed-off-by: Sasha Khapyorsky > --- > opensm/doc/opensm_release_notes-3.2.txt | 294 ++++++++++++++++++------------- > 1 files changed, 175 insertions(+), 119 deletions(-) > > diff --git a/opensm/doc/opensm_release_notes-3.2.txt b/opensm/doc/opensm_release_notes-3.2.txt > index 7728849..68178c4 100644 > --- a/opensm/doc/opensm_release_notes-3.2.txt > +++ b/opensm/doc/opensm_release_notes-3.2.txt > @@ -17,104 +17,161 @@ This document includes the following sections: > dependencies) > 1.1 Major New Features \> +* Solicited Node Multicast addresses consolidation IPv6 Solicited Node Multicast address consolidation > + When this mode is used (enabled with --consolidate_ipv6_snm_req option) > + OpenSM will map all IPv6 Solicited Node Multicast address join requests > + into a single Multicast group with address ff10:601b::1:ff00:0. In this > + way limited MLID space is saved. The feature is very useful with large > + (~> 1024 nodes) clusters. Isn't there truth in adversing ? This should also say: This IBA noncompliant feature may be problematic in heterogeneous subnets (due to rate and MTU differences). -- Hal From hal.rosenstock at gmail.com Sun Oct 12 07:35:36 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sun, 12 Oct 2008 10:35:36 -0400 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm: update Release Notes for OpenSM version 3.2 In-Reply-To: References: <20081011215113.GN6947@sashak.voltaire.com> Message-ID: On Sun, Oct 12, 2008 at 7:38 AM, Hal Rosenstock wrote: > On Sat, Oct 11, 2008 at 5:51 PM, Sasha Khapyorsky wrote: >> >> Update Release Notes for OpenSM version 3.2 >> >> Signed-off-by: Sasha Khapyorsky >> --- >> opensm/doc/opensm_release_notes-3.2.txt | 294 ++++++++++++++++++------------- >> 1 files changed, 175 insertions(+), 119 deletions(-) >> >> diff --git a/opensm/doc/opensm_release_notes-3.2.txt b/opensm/doc/opensm_release_notes-3.2.txt >> index 7728849..68178c4 100644 >> --- a/opensm/doc/opensm_release_notes-3.2.txt >> +++ b/opensm/doc/opensm_release_notes-3.2.txt >> @@ -17,104 +17,161 @@ This document includes the following sections: >> dependencies) > > > >> 1.1 Major New Features > \> +* Solicited Node Multicast addresses consolidation > > IPv6 Solicited Node Multicast address consolidation > >> + When this mode is used (enabled with --consolidate_ipv6_snm_req option) >> + OpenSM will map all IPv6 Solicited Node Multicast address join requests >> + into a single Multicast group with address ff10:601b::1:ff00:0. In this >> + way limited MLID space is saved. The feature is very useful with large >> + (~> 1024 nodes) clusters. > > Isn't there truth in adversing ? This should also say: > > This IBA noncompliant feature may be problematic in heterogeneous > subnets (due to rate and MTU differences). Also: +* Support separate SA and SM keys as clarified in IBA 1.2.1 Not just separate keys but: The default key changes should be highlighted as well as incompatibility with previous OpenSM and saquery versions. -- Hal > -- Hal > > > From hal.rosenstock at gmail.com Sun Oct 12 11:50:44 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sun, 12 Oct 2008 14:50:44 -0400 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm: update Release Notes for OpenSM version 3.2 In-Reply-To: References: <20081011215113.GN6947@sashak.voltaire.com> Message-ID: On Sun, Oct 12, 2008 at 10:35 AM, Hal Rosenstock wrote: > On Sun, Oct 12, 2008 at 7:38 AM, Hal Rosenstock > wrote: >> On Sat, Oct 11, 2008 at 5:51 PM, Sasha Khapyorsky wrote: >>> >>> Update Release Notes for OpenSM version 3.2 >>> >>> Signed-off-by: Sasha Khapyorsky >>> --- >>> opensm/doc/opensm_release_notes-3.2.txt | 294 ++++++++++++++++++------------- >>> 1 files changed, 175 insertions(+), 119 deletions(-) >>> >>> diff --git a/opensm/doc/opensm_release_notes-3.2.txt b/opensm/doc/opensm_release_notes-3.2.txt >>> index 7728849..68178c4 100644 >>> --- a/opensm/doc/opensm_release_notes-3.2.txt >>> +++ b/opensm/doc/opensm_release_notes-3.2.txt >>> @@ -17,104 +17,161 @@ This document includes the following sections: >>> dependencies) >> >> >> >>> 1.1 Major New Features >> \> +* Solicited Node Multicast addresses consolidation >> >> IPv6 Solicited Node Multicast address consolidation >> >>> + When this mode is used (enabled with --consolidate_ipv6_snm_req option) >>> + OpenSM will map all IPv6 Solicited Node Multicast address join requests >>> + into a single Multicast group with address ff10:601b::1:ff00:0. In this >>> + way limited MLID space is saved. The feature is very useful with large >>> + (~> 1024 nodes) clusters. >> >> Isn't there truth in adversing ? This should also say: >> >> This IBA noncompliant feature may be problematic in heterogeneous >> subnets (due to rate and MTU differences). > > Also: > > +* Support separate SA and SM keys as clarified in IBA 1.2.1 > > Not just separate keys but: > The default key changes should be highlighted as well as > incompatibility with previous OpenSM and saquery versions. Should this go as a note in the Qualification section too ? Also, in that section, the note on ConnectX and QoS should be updated as the 2_5_000 release is out so that should say 2_5_000 or later. -- Hal > > -- Hal > >> -- Hal >> >> >> > From tziporet at dev.mellanox.co.il Sun Oct 12 14:07:10 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 12 Oct 2008 23:07:10 +0200 Subject: [ofa-general] ***SPAM*** PPC64 autobuild machine In-Reply-To: References: Message-ID: <48F266FE.3070104@mellanox.co.il> Toru Nishimura wrote: > Hi, OpenFabrics guys, > > I'm gathering info about PPC and OFED combination. My > understanding is that OFED autobuild system is generating > PPC64 binaries, in possibly self (native) build way. > What type of PPC64 machine is used for that? Machine > model name and configuration are wanted. Thanks in > advance. > OFED is not distributed in binaries, just in SRPMs. Mellanox provides binary RPMs, see http://www.mellanox.com/products/ofed.php Tziporet From tziporet at dev.mellanox.co.il Sun Oct 12 14:23:40 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 12 Oct 2008 23:23:40 +0200 Subject: [ofa-general] ***SPAM*** PPC64 platform and InfiniBand HW In-Reply-To: <2763777062A04D64A0CEBA827536F4B1@tpad2> References: <2763777062A04D64A0CEBA827536F4B1@tpad2> Message-ID: <48F26ADC.1090201@mellanox.co.il> Toru Nishimura wrote: > - I can see PPC64 is one of supported architecture. What > are PPC64 platform in use, for development and in-field > applications? We need proven / known-to-work PPC64 > platform(s) as a solid foundation to start the entire project. > Specific model names are wanted. We test ConnectX with IBM PPC blades - JS22 and QS21 systems Both Redhat and SLES OSes are working well. > > - What IB HW are in use for PPC64 combination? We are > consulted to use Mellanox LX PCIe card. Is the product > mature enough for PPC64 platforms? Mellanox LX PCIe card is in GA quality. However we have not tested it with PPC systems since we don't have appropiate HW systems Tziporet From tziporet at dev.mellanox.co.il Sun Oct 12 14:25:48 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 12 Oct 2008 23:25:48 +0200 Subject: [ofa-general] [GIT PULL] please pull infiniband.git In-Reply-To: References: Message-ID: <48F26B5C.2070609@mellanox.co.il> Roland Dreier wrote: > This will get the first batch of 2.6.28 merges -- pretty much all > low-level hardware driver and IP-over-IB fixes, along with a few little > things elsewhere, as the dirstat shows: > Can you say which of the new features are planed for 2.6.28 too? Thanks Tziporet From rdreier at cisco.com Sun Oct 12 14:37:06 2008 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 12 Oct 2008 14:37:06 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git In-Reply-To: <48F26B5C.2070609@mellanox.co.il> (Tziporet Koren's message of "Sun, 12 Oct 2008 23:25:48 +0200") References: <48F26B5C.2070609@mellanox.co.il> Message-ID: > Can you say which of the new features are planed for 2.6.28 too? OK... I didn't get around to saying what's in my tree but I will send a summary of what was there. - R. From rpearson at systemfabricworks.com Sun Oct 12 15:21:00 2008 From: rpearson at systemfabricworks.com (Robert Pearson) Date: Sun, 12 Oct 2008 17:21:00 -0500 Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer overflow In-Reply-To: <20081009190549.GQ4912@sashak.voltaire.com> References: <20081008012149.GK7563@sashak.voltaire.com> <20081009190549.GQ4912@sashak.voltaire.com> Message-ID: <005901c92cb8$cf13dd80$6d3b9880$@com> How does lash know how many VLs are available? Especially, with ibsim. Is there a way to have lash report the number of VLs required independent of the type of switch used? -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Sasha Khapyorsky Sent: Thursday, October 09, 2008 2:06 PM To: Hal Rosenstock Cc: OpenIB Subject: Re: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer overflow On 07:04 Wed 08 Oct , Hal Rosenstock wrote: > > Minor simplification as it seems like this could just be: > > if (++lanes_needed > p_lash->vl_min) > goto Error_Not_Enough_Lanes; Works for me. Thanks! Sasha _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From rpearson at systemfabricworks.com Sun Oct 12 16:55:49 2008 From: rpearson at systemfabricworks.com (Robert Pearson) Date: Sun, 12 Oct 2008 18:55:49 -0500 Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer overflow In-Reply-To: <005901c92cb8$cf13dd80$6d3b9880$@com> References: <20081008012149.GK7563@sashak.voltaire.com> <20081009190549.GQ4912@sashak.voltaire.com> <005901c92cb8$cf13dd80$6d3b9880$@com> Message-ID: <000001c92cc6$13b90890$3b2b19b0$@com> I spent a little time looking at osm_ucast_lash.c and ibsim. It looks like ibsim reports vl_cap = 4 and op_vl = 1 by default for a switch. Osm_ucast_lash.c computes the minimum over all switches of op_vl as extracted from the port info mads starting from (5 which would correspond to VL 0-14 operational). It then uses the encoded value as though it was an integer instead of an encoding of an integer which seems wrong. I am not yet sure when the SM is supposed to set the op_vl field away from 1. If later then you are using the wrong value and should be comparing to the decoded value of vl_cap instead. -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Robert Pearson Sent: Sunday, October 12, 2008 5:21 PM To: 'Sasha Khapyorsky'; 'Hal Rosenstock' Cc: 'OpenIB' Subject: RE: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer overflow How does lash know how many VLs are available? Especially, with ibsim. Is there a way to have lash report the number of VLs required independent of the type of switch used? -----Original Message----- From: general-bounces at lists.openfabrics.org [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Sasha Khapyorsky Sent: Thursday, October 09, 2008 2:06 PM To: Hal Rosenstock Cc: OpenIB Subject: Re: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer overflow On 07:04 Wed 08 Oct , Hal Rosenstock wrote: > > Minor simplification as it seems like this could just be: > > if (++lanes_needed > p_lash->vl_min) > goto Error_Not_Enough_Lanes; Works for me. Thanks! Sasha _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hnguyen at linux.vnet.ibm.com Mon Oct 13 00:34:09 2008 From: hnguyen at linux.vnet.ibm.com (Hoang-Nam Nguyen) Date: Mon, 13 Oct 2008 09:34:09 +0200 Subject: [ofa-general] Re: [PATCH 1/1] IB/ehca: Disallow creating UC QP with SRQ In-Reply-To: References: <200810011306.31544.hnguyen@linux.vnet.ibm.com> Message-ID: <200810130934.09592.hnguyen@linux.vnet.ibm.com> Hi Roland, That looks good to me. Thanks for all help. Nam On Friday 10 October 2008 23:41, Roland Dreier wrote: > thanks, applied -- it didn't apply to the latest tree, because of the > flush CQE changes, so I merged it manually as below -- let me know if > this is wrong: > > commit 0540bbbe455e123a1692d26205ad1a29983883b0 > Author: Hoang-Nam Nguyen > Date: Fri Oct 10 14:40:39 2008 -0700 > > IB/ehca: Don't allow creating UC QP with SRQ > > This patch prevents a UC QP to be created attached to an SRQ, since > current firmware does not support this feature. > > Signed-off-by: Michael Faath > Signed-off-by: Roland Dreier > > diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c > index 4dbe287..40b578d 100644 > --- a/drivers/infiniband/hw/ehca/ehca_qp.c > +++ b/drivers/infiniband/hw/ehca/ehca_qp.c > @@ -502,6 +502,12 @@ static struct ehca_qp *internal_create_qp( > if (init_attr->srq) { > my_srq = container_of(init_attr->srq, struct ehca_qp, ib_srq); > > + if (qp_type == IB_QPT_UC) { > + ehca_err(pd->device, "UC with SRQ not supported"); > + atomic_dec(&shca->num_qps); > + return ERR_PTR(-EINVAL); > + } > + > has_srq = 1; > parms.ext_type = EQPT_SRQBASE; > parms.srq_qpn = my_srq->real_qp_num; > From anuj01 at gmail.com Mon Oct 13 03:06:01 2008 From: anuj01 at gmail.com (=?UTF-8?B?4KSF4KSo4KWB4KSc?=) Date: Mon, 13 Oct 2008 15:36:01 +0530 Subject: [ofa-general] ***SPAM*** cleanup resources using uverbs Message-ID: Hi I have used user space server and client programs for simple data transfer (rdma write, send and receive) by using uverbs (libibverbs). is it required to cleanup all the resources (pd, qp, cq, mr etc. ) allocated with in these programs explicitly by using uverbs (ibv_dealloc_pd, ibv_dereg_mr etc.) ? Or all the resources will be cleaned automatically when the programs exit? -- Anuj Aggarwal .''`. : :Ⓐ : # apt-get install hakuna-matata `. `'` `- -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Mon Oct 13 03:17:17 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 13 Oct 2008 03:17:17 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081013-0200 daily build status Message-ID: <20081013101717.9FF40E60B25@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From ossrosch at linux.vnet.ibm.com Mon Oct 13 04:10:32 2008 From: ossrosch at linux.vnet.ibm.com (Stefan Roscher) Date: Mon, 13 Oct 2008 13:10:32 +0200 Subject: [ofa-general] [PATCH]IB/ehca:reject dynamic memory add/remove Message-ID: <200810131310.34413.ossrosch@linux.vnet.ibm.com> Since the ehca device driver does not support dynamic memory add and remove operations, the driver must explicitly reject such requests in order to prevent unpredictable behaviors related to memory regions already occupied and being used by InfiniBand applications. The solution is to add a memory notifier to the ehca device driver and if a request for dynamic memory add or remove comes in, ehca will always reject it. Signed-off-by: Stefan Roscher --- diff -Nurp linux-2.6.27-rc6-7/drivers/infiniband/hw/ehca/ehca_main.c linux-2.6.27-rc6-7.new/drivers/infiniband/hw/ehca/ehca_main.c --- linux-2.6.27-rc6-7/drivers/infiniband/hw/ehca/ehca_main.c 2008-09-16 18:19:27.000000000 +0200 +++ linux-2.6.27-rc6-7.new/drivers/infiniband/hw/ehca/ehca_main.c 2008-10-03 13:52:50.000000000 +0200 @@ -44,6 +44,8 @@ #include #endif +#include +#include #include "ehca_classes.h" #include "ehca_iverbs.h" #include "ehca_mrmw.h" @@ -964,6 +966,41 @@ void ehca_poll_eqs(unsigned long data) spin_unlock(&shca_list_lock); } +static int ehca_mem_notifier(struct notifier_block *nb, + unsigned long action, void *data) +{ + static unsigned long ehca_dmem_warn_time; + + switch (action) { + case MEM_CANCEL_OFFLINE: + case MEM_CANCEL_ONLINE: + case MEM_ONLINE: + case MEM_OFFLINE: + return NOTIFY_OK; + case MEM_GOING_ONLINE: + case MEM_GOING_OFFLINE: + /* only ok if no hca is attached to the lpar */ + spin_lock(&shca_list_lock); + if (list_empty(&shca_list)) { + spin_unlock(&shca_list_lock); + return NOTIFY_OK; + } else { + spin_unlock(&shca_list_lock); + if (printk_timed_ratelimit(&ehca_dmem_warn_time, + 30 * 1000)) + ehca_gen_err("DMEM operations are not allowed" + "as long as an ehca adapter is" + "attached to the LPAR"); + return NOTIFY_BAD; + } + } + return NOTIFY_OK; +} + +static struct notifier_block ehca_mem_nb = { + .notifier_call = ehca_mem_notifier, +}; + static int __init ehca_module_init(void) { int ret; @@ -991,6 +1028,12 @@ static int __init ehca_module_init(void) goto module_init2; } + ret = register_memory_notifier(&ehca_mem_nb); + if (ret) { + ehca_gen_err("Failed registering memory add/remove notifier"); + goto module_init3; + } + if (ehca_poll_all_eqs != 1) { ehca_gen_err("WARNING!!!"); ehca_gen_err("It is possible to lose interrupts."); @@ -1003,6 +1046,9 @@ static int __init ehca_module_init(void) return 0; +module_init3: + ibmebus_unregister_driver(&ehca_driver); + module_init2: ehca_destroy_slab_caches(); @@ -1018,6 +1064,8 @@ static void __exit ehca_module_exit(void ibmebus_unregister_driver(&ehca_driver); + unregister_memory_notifier(&ehca_mem_nb); + ehca_destroy_slab_caches(); ehca_destroy_comp_pool(); From hal.rosenstock at gmail.com Mon Oct 13 09:53:59 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 13 Oct 2008 12:53:59 -0400 Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer overflow In-Reply-To: <005901c92cb8$cf13dd80$6d3b9880$@com> References: <20081008012149.GK7563@sashak.voltaire.com> <20081009190549.GQ4912@sashak.voltaire.com> <005901c92cb8$cf13dd80$6d3b9880$@com> Message-ID: On Sun, Oct 12, 2008 at 6:21 PM, Robert Pearson wrote: > How does lash know how many VLs are available? >From PortInfo:OpVLs >Especially, with ibsim. Same. > Is there a way to have lash report the number of VLs required independent of > the type of switch used? Not sure what you mean by this. -- Hal > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Sasha Khapyorsky > Sent: Thursday, October 09, 2008 2:06 PM > To: Hal Rosenstock > Cc: OpenIB > Subject: Re: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer > overflow > > On 07:04 Wed 08 Oct , Hal Rosenstock wrote: >> >> Minor simplification as it seems like this could just be: >> >> if (++lanes_needed > p_lash->vl_min) >> goto Error_Not_Enough_Lanes; > > Works for me. Thanks! > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Mon Oct 13 09:57:36 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 13 Oct 2008 12:57:36 -0400 Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer overflow In-Reply-To: <000001c92cc6$13b90890$3b2b19b0$@com> References: <20081008012149.GK7563@sashak.voltaire.com> <20081009190549.GQ4912@sashak.voltaire.com> <005901c92cb8$cf13dd80$6d3b9880$@com> <000001c92cc6$13b90890$3b2b19b0$@com> Message-ID: On Sun, Oct 12, 2008 at 7:55 PM, Robert Pearson wrote: > I spent a little time looking at osm_ucast_lash.c and ibsim. > > It looks like ibsim reports vl_cap = 4 and op_vl = 1 by default for a > switch. Yes, ibsim only has one canned template for this. I think it is mainly that ibnetdiscover output format doesn't include this information currently and something needs to be assumed. > Osm_ucast_lash.c computes the minimum over all switches of op_vl as > extracted from the port info mads starting from (5 which would correspond to > > It then uses the encoded value as though it was an integer instead of an > encoding of an integer which seems wrong. osm_ucast_lash.c::discover_network_properties fixes this up properly: vl_min = 1 << (vl_min - 1); if (vl_min > 15) vl_min = 15; > I am not yet sure when the SM is supposed to set the op_vl field away from > 1. If later then you are using the wrong value and should be comparing to > the decoded value of vl_cap instead. I don't understand what you mean here. vl_cap would never be right to be used. -- Hal > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Robert Pearson > Sent: Sunday, October 12, 2008 5:21 PM > To: 'Sasha Khapyorsky'; 'Hal Rosenstock' > Cc: 'OpenIB' > Subject: RE: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer > overflow > > How does lash know how many VLs are available? Especially, with ibsim. > Is there a way to have lash report the number of VLs required independent of > the type of switch used? > > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Sasha Khapyorsky > Sent: Thursday, October 09, 2008 2:06 PM > To: Hal Rosenstock > Cc: OpenIB > Subject: Re: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer > overflow > > On 07:04 Wed 08 Oct , Hal Rosenstock wrote: >> >> Minor simplification as it seems like this could just be: >> >> if (++lanes_needed > p_lash->vl_min) >> goto Error_Not_Enough_Lanes; > > Works for me. Thanks! > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Mon Oct 13 10:04:34 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 13 Oct 2008 13:04:34 -0400 Subject: [ofa-general] ***SPAM*** OpenSM build issue In-Reply-To: <48F331A7.6060200@obsidianresearch.com> References: <48F331A7.6060200@obsidianresearch.com> Message-ID: Sasha, With your latest master, I now get: if gcc -DHAVE_CONFIG_H -I. -I. -I../include -I../include/opensm -I./../include -I./../../libibcommon/include -I./../../libibumad/include -I/usr/local/include -Wall -g -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -g -O2 -MT osm_event_plugin.o -MD -MP -MF ".deps/osm_event_plugin.Tpo" -c -o osm_event_plugin.o osm_event_plugin.c; \ then mv -f ".deps/osm_event_plugin.Tpo" ".deps/osm_event_plugin.Po"; else rm -f ".deps/osm_event_plugin.Tpo"; exit 1; fi In file included from ../include/opensm/osm_base.h:47, from ../include/opensm/osm_subnet.h:52, from ../include/opensm/osm_console_io.h:43, from ../include/opensm/osm_opensm.h:52, from osm_event_plugin.c:49: ../include/config.h:135:1: warning: "_OSM_CONFIG_H_" redefined In file included from ../include/opensm/osm_event_plugin.h:41, from osm_event_plugin.c:48: ../include/opensm/osm_config.h:12:1: warning: this is the location of the previous definition Is something stale in my environment or is this a current build issue ? -- Hal From dave at linux.vnet.ibm.com Mon Oct 13 10:09:26 2008 From: dave at linux.vnet.ibm.com (Dave Hansen) Date: Mon, 13 Oct 2008 10:09:26 -0700 Subject: [ofa-general] Re: [PATCH]IB/ehca:reject dynamic memory add/remove In-Reply-To: <200810131310.34413.ossrosch@linux.vnet.ibm.com> References: <200810131310.34413.ossrosch@linux.vnet.ibm.com> Message-ID: <1223917766.29877.17.camel@nimitz> On Mon, 2008-10-13 at 13:10 +0200, Stefan Roscher wrote: > Since the ehca device driver does not support dynamic memory add and remove > operations, the driver must explicitly reject such requests in order to prevent > unpredictable behaviors related to memory regions already occupied and being > used by InfiniBand applications. > The solution is to add a memory notifier to the ehca device driver and if a request > for dynamic memory add or remove comes in, ehca will always reject it. Why doesn't the driver support it? This seems like an awfully extreme action to take. Do you have plans to support this in the driver soon? -- Dave From rpearson at systemfabricworks.com Mon Oct 13 11:45:44 2008 From: rpearson at systemfabricworks.com (Robert Pearson) Date: Mon, 13 Oct 2008 13:45:44 -0500 Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer overflow In-Reply-To: References: <20081008012149.GK7563@sashak.voltaire.com> <20081009190549.GQ4912@sashak.voltaire.com> <005901c92cb8$cf13dd80$6d3b9880$@com> <000001c92cc6$13b90890$3b2b19b0$@com> Message-ID: <00cf01c92d63$e6ce45e0$b46ad1a0$@com> Thanks Hal, I grabbed the head of tree and recompiled with some print statements thrown in. As you say the '<<' fixes the decoding correctly and ibsim is reporting 4 not 1 so life is good and I see no problems in ibsim or opensm. I was looking at the binary init values for port info for switches and it looked like the op_vl was set to 1 but when I ran opensm over it I saw 4 which is the correct answer. The last point is moot in this case. I artificially relaxed the limit on the number of VLs in my case and found that I needed 12 VLs to route my problem with lash so it is not going to work since the switches only support 8 data vls. Regards, Bob -----Original Message----- From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] Sent: Monday, October 13, 2008 11:58 AM To: Robert Pearson Cc: Sasha Khapyorsky; OpenIB Subject: Re: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer overflow On Sun, Oct 12, 2008 at 7:55 PM, Robert Pearson wrote: > I spent a little time looking at osm_ucast_lash.c and ibsim. > > It looks like ibsim reports vl_cap = 4 and op_vl = 1 by default for a > switch. Yes, ibsim only has one canned template for this. I think it is mainly that ibnetdiscover output format doesn't include this information currently and something needs to be assumed. > Osm_ucast_lash.c computes the minimum over all switches of op_vl as > extracted from the port info mads starting from (5 which would correspond to > > It then uses the encoded value as though it was an integer instead of an > encoding of an integer which seems wrong. osm_ucast_lash.c::discover_network_properties fixes this up properly: vl_min = 1 << (vl_min - 1); if (vl_min > 15) vl_min = 15; > I am not yet sure when the SM is supposed to set the op_vl field away from > 1. If later then you are using the wrong value and should be comparing to > the decoded value of vl_cap instead. I don't understand what you mean here. vl_cap would never be right to be used. -- Hal > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Robert Pearson > Sent: Sunday, October 12, 2008 5:21 PM > To: 'Sasha Khapyorsky'; 'Hal Rosenstock' > Cc: 'OpenIB' > Subject: RE: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer > overflow > > How does lash know how many VLs are available? Especially, with ibsim. > Is there a way to have lash report the number of VLs required independent of > the type of switch used? > > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Sasha Khapyorsky > Sent: Thursday, October 09, 2008 2:06 PM > To: Hal Rosenstock > Cc: OpenIB > Subject: Re: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer > overflow > > On 07:04 Wed 08 Oct , Hal Rosenstock wrote: >> >> Minor simplification as it seems like this could just be: >> >> if (++lanes_needed > p_lash->vl_min) >> goto Error_Not_Enough_Lanes; > > Works for me. Thanks! > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From dotanba at gmail.com Mon Oct 13 19:31:04 2008 From: dotanba at gmail.com (Dotan Barak) Date: Tue, 14 Oct 2008 04:31:04 +0200 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** cleanup resources using uverbs In-Reply-To: References: Message-ID: <48F40468.8000500@gmail.com> अनुज wrote: > Hi > > I have used user space server and client programs for simple data > transfer (rdma write, send and receive) by using uverbs (libibverbs). > > is it required to cleanup all the resources (pd, qp, cq, mr etc. ) > allocated with in these programs explicitly by using uverbs > (ibv_dealloc_pd, ibv_dereg_mr etc.) ? > > Or all the resources will be cleaned automatically when the programs exit? > One don't have to clean the resources in user level; upon process termination all of the resources are being cleaned automatically (in kernel level, it is a completely different story ...) Dotan > > > -- > Anuj Aggarwal > > .''`. > : :Ⓐ : # apt-get install hakuna-matata > `. `'` > `- > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Tue Oct 14 01:05:03 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Oct 2008 10:05:03 +0200 Subject: [ofa-general] Re: OpenSM build issue In-Reply-To: References: <48F331A7.6060200@obsidianresearch.com> Message-ID: <20081014080503.GE5528@sashak.voltaire.com> Hi Hal, On 13:04 Mon 13 Oct , Hal Rosenstock wrote: > > With your latest master, I now get: > > if gcc -DHAVE_CONFIG_H -I. -I. -I../include -I../include/opensm > -I./../include -I./../../libibcommon/include > -I./../../libibumad/include -I/usr/local/include -Wall -g > -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -g -O2 -MT osm_event_plugin.o -MD > -MP -MF ".deps/osm_event_plugin.Tpo" -c -o osm_event_plugin.o > osm_event_plugin.c; \ > then mv -f ".deps/osm_event_plugin.Tpo" > ".deps/osm_event_plugin.Po"; else rm -f ".deps/osm_event_plugin.Tpo"; > exit 1; fi > In file included from ../include/opensm/osm_base.h:47, > from ../include/opensm/osm_subnet.h:52, > from ../include/opensm/osm_console_io.h:43, > from ../include/opensm/osm_opensm.h:52, > from osm_event_plugin.c:49: > ../include/config.h:135:1: warning: "_OSM_CONFIG_H_" redefined > In file included from ../include/opensm/osm_event_plugin.h:41, > from osm_event_plugin.c:48: > ../include/opensm/osm_config.h:12:1: warning: this is the location of > the previous definition Are there another warnings or only this one (I don't have any)? What is your './configure' command line? > Is something stale in my environment or is this a current build issue ? I don't think it is your environment. Try this: diff --git a/opensm/opensm/osm_event_plugin.c b/opensm/opensm/osm_event_plugin.c index 177740a..c6999f5 100644 --- a/opensm/opensm/osm_event_plugin.c +++ b/opensm/opensm/osm_event_plugin.c @@ -43,6 +43,10 @@ * *********/ +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + #include #include #include Sasha From sashak at voltaire.com Tue Oct 14 01:22:18 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Oct 2008 10:22:18 +0200 Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer overflow In-Reply-To: References: <20081008012149.GK7563@sashak.voltaire.com> <20081009190549.GQ4912@sashak.voltaire.com> <005901c92cb8$cf13dd80$6d3b9880$@com> Message-ID: <20081014082218.GF5528@sashak.voltaire.com> On 12:53 Mon 13 Oct , Hal Rosenstock wrote: > > > Is there a way to have lash report the number of VLs required independent of > > the type of switch used? > > Not sure what you mean by this. May be it is how to know how many VLs does LASH require for a given topology? Sasha From sashak at voltaire.com Tue Oct 14 02:21:50 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Oct 2008 11:21:50 +0200 Subject: [ofa-general] Re: [PATCH] opensm: update Release Notes for OpenSM version 3.2 In-Reply-To: References: <20081011215113.GN6947@sashak.voltaire.com> Message-ID: <20081014092150.GG5528@sashak.voltaire.com> Hi Hal, On 07:38 Sun 12 Oct , Hal Rosenstock wrote: > > > + When this mode is used (enabled with --consolidate_ipv6_snm_req option) > > + OpenSM will map all IPv6 Solicited Node Multicast address join requests > > + into a single Multicast group with address ff10:601b::1:ff00:0. In this > > + way limited MLID space is saved. The feature is very useful with large > > + (~> 1024 nodes) clusters. > > Isn't there truth in adversing ? This should also say: > > This IBA noncompliant feature may be problematic in heterogeneous > subnets (due to rate and MTU differences). The potential problem in heterogeneous subnets is not much different from IPv4 - 1x ports will not be able to join. Is it the case you meant? If yes, IPv6 SNM becomes much less critical issue comparing to not working IPv4. Sasha From sashak at voltaire.com Tue Oct 14 02:25:24 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Oct 2008 11:25:24 +0200 Subject: [ofa-general] Re: [PATCH] opensm: update Release Notes for OpenSM version 3.2 In-Reply-To: References: <20081011215113.GN6947@sashak.voltaire.com> Message-ID: <20081014092524.GH5528@sashak.voltaire.com> On 10:35 Sun 12 Oct , Hal Rosenstock wrote: > > +* Support separate SA and SM keys as clarified in IBA 1.2.1 > > Not just separate keys but: > The default key changes should be highlighted as well as > incompatibility with previous OpenSM and saquery versions. It is stated in "Bug Fixes" section. Sasha From sashak at voltaire.com Tue Oct 14 02:30:14 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Oct 2008 11:30:14 +0200 Subject: [ofa-general] Re: [PATCH] opensm: update Release Notes for OpenSM version 3.2 In-Reply-To: References: <20081011215113.GN6947@sashak.voltaire.com> Message-ID: <20081014093014.GI5528@sashak.voltaire.com> On 14:50 Sun 12 Oct , Hal Rosenstock wrote: > > > > +* Support separate SA and SM keys as clarified in IBA 1.2.1 > > > > Not just separate keys but: > > The default key changes should be highlighted as well as > > incompatibility with previous OpenSM and saquery versions. > > Should this go as a note in the Qualification section too ? Works for me, feel free to patch. > Also, in that section, the note on ConnectX and QoS should be updated > as the 2_5_000 release is out so that should say 2_5_000 or later. Ok. Sasha From vlad at lists.openfabrics.org Tue Oct 14 03:15:54 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 14 Oct 2008 03:15:54 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081014-0200 daily build status Message-ID: <20081014101554.56297E60C8D@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From dave at linux.vnet.ibm.com Tue Oct 14 05:29:13 2008 From: dave at linux.vnet.ibm.com (Dave Hansen) Date: Tue, 14 Oct 2008 05:29:13 -0700 Subject: [ofa-general] Re: [PATCH]IB/ehca:reject dynamic memory add/remove In-Reply-To: <200810141423.48111.stefan.roscher@vnet.de.ibm.com> References: <200810131310.34413.ossrosch@linux.vnet.ibm.com> <1223917766.29877.17.camel@nimitz> <200810141423.48111.stefan.roscher@vnet.de.ibm.com> Message-ID: <1223987353.29877.40.camel@nimitz> On Tue, 2008-10-14 at 14:23 +0200, Stefan Roscher wrote: > On Monday 13 October 2008 07:09:26 pm Dave Hansen wrote: > > On Mon, 2008-10-13 at 13:10 +0200, Stefan Roscher wrote: > > > Since the ehca device driver does not support dynamic memory add and remove > > > operations, the driver must explicitly reject such requests in order to prevent > > > unpredictable behaviors related to memory regions already occupied and being > > > used by InfiniBand applications. > > > The solution is to add a memory notifier to the ehca device driver and if a request > > > for dynamic memory add or remove comes in, ehca will always reject it. > > > > Why doesn't the driver support it? > > > > This seems like an awfully extreme action to take. Do you have plans to > > support this in the driver soon? > > > There is currently a slight incompatibility how openfabrics uses MRs > and how System p does DMEM add/remove, which basically disables this > support. > If you want to talk to the firmware developpers, I can give you the right contacts. I wish I knew what an 'MR' is. :( Could you be a bit more specific so we can get a better changelog? Perhaps if we understand the situation better, we can come up with a better solution. Does this have anything in common with the problems with 16GB pages? -- Dave From rpearson at systemfabricworks.com Tue Oct 14 06:58:36 2008 From: rpearson at systemfabricworks.com (Robert Pearson) Date: Tue, 14 Oct 2008 08:58:36 -0500 Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer overflow In-Reply-To: <20081014082218.GF5528@sashak.voltaire.com> References: <20081008012149.GK7563@sashak.voltaire.com> <20081009190549.GQ4912@sashak.voltaire.com> <005901c92cb8$cf13dd80$6d3b9880$@com> <20081014082218.GF5528@sashak.voltaire.com> Message-ID: <002701c92e04$f6079a10$e216ce30$@com> Hi Sasha, The idea would be to let lash complete the analysis and report how many VLs would be required even if that is more than the number available. It would require allocating more memory than it does now. In my case I did this by hacking opensm to set the number of VLs to a large number (64) and letting it run. Bob -----Original Message----- From: Sasha Khapyorsky [mailto:sashak at voltaire.com] Sent: Tuesday, October 14, 2008 3:22 AM To: Hal Rosenstock Cc: Robert Pearson; OpenIB Subject: Re: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer overflow On 12:53 Mon 13 Oct , Hal Rosenstock wrote: > > > Is there a way to have lash report the number of VLs required independent of > > the type of switch used? > > Not sure what you mean by this. May be it is how to know how many VLs does LASH require for a given topology? Sasha From dave at linux.vnet.ibm.com Tue Oct 14 08:13:32 2008 From: dave at linux.vnet.ibm.com (Dave Hansen) Date: Tue, 14 Oct 2008 08:13:32 -0700 Subject: [ofa-general] Re: [PATCH]IB/ehca:reject dynamic memory add/remove In-Reply-To: <200810141423.48111.stefan.roscher@vnet.de.ibm.com> References: <200810131310.34413.ossrosch@linux.vnet.ibm.com> <1223917766.29877.17.camel@nimitz> <200810141423.48111.stefan.roscher@vnet.de.ibm.com> Message-ID: <1223997212.29877.98.camel@nimitz> On Tue, 2008-10-14 at 14:23 +0200, Stefan Roscher wrote: > On Monday 13 October 2008 07:09:26 pm Dave Hansen wrote: > > On Mon, 2008-10-13 at 13:10 +0200, Stefan Roscher wrote: > > > Since the ehca device driver does not support dynamic memory add and remove > > > operations, the driver must explicitly reject such requests in order to prevent > > > unpredictable behaviors related to memory regions already occupied and being > > > used by InfiniBand applications. > > > The solution is to add a memory notifier to the ehca device driver and if a request > > > for dynamic memory add or remove comes in, ehca will always reject it. > > > > Why doesn't the driver support it? > > > > This seems like an awfully extreme action to take. Do you have plans to > > support this in the driver soon? > > > There is currently a slight incompatibility how openfabrics uses MRs and how System p does DMEM add/remove, > which basically disables this support. > If you want to talk to the firmware developpers, I can give you the right contacts. OK, Stefan and Christoph have very patiently explained the whole situation to me. The ehca driver needs to register any memory to which it might write with the hypervisor (which then talks to the hardware). For normal apps, it does get_user_pages() on the userspace memory and tells the hypervisor which pages it got. But, there are in-kernel users of the hardware as well like NFS and the IP stack. These might potentially write anywhere in memory since, for instance, an skbuf can be allocated anywhere. Due to limitations in the Infiniband software stack, all these users must all share the same "L_KEY", which is basically the identifier of the individual Infiniband "user". So, ehca registers all of the partition's memory with the hypervisor when it is loaded to prepare for these in-kernel users. (I think of this as mmap("/dev/mem") from a device to kernel memory.) The size of this table is restricted by the starting size of the physical memory allocated to the partition, so we can't oversize it and just fill it in later as memory is added (hypervisor limitation). We also can't resize it at runtime because of other hypervisor limitations. The only way to change it is basically to shut the adapter down, which Infiniband wouldn't deal well with since it doesn't have any retransmitting (Infiniband limitation). We could restrict the kernel area to which the ehca driver could write. We would then just bounce buffer things in and out of it. But, that'd be a latency and complexity nightmare. We could probably also modify each of the existing in-kernel users (NFS, etc...) to check to see whether the memory they're about to touch has been registered with the hypervisor. They could only bounce in cases where it hadn't. We could probably also detect these in-kernel users and only deny hotplugging in case one of them is actually active. But, for now, we take the cowardly approach and simply disable memory hotplug. You can still hotplug to the system, you just need to un-hotplug the ehca adapter from the partition, first. This will, of course be well documented in the already huge IBM manual. :) Back to the patch... Could we be a bit more explicit that a user can go to the HMC (the IBM control console) and remove the adapter? I'm just trying to think of the poor user looking at dmesg. The dude/dudette doing this is going to be sitting at the HMC. Can we get an helpful message to pop up to them? Will they even see dmesg output? -- Dave From haven.hash at isilon.com Tue Oct 14 10:34:02 2008 From: haven.hash at isilon.com (Haven Hash) Date: Tue, 14 Oct 2008 10:34:02 -0700 Subject: [ofa-general] [PATCH][TRIVIAL]mad.c: Need parens to kmalloc correct amount of memory Message-ID: <1224005642.5997.8.camel@hhash-dev> I assume this has never been a problem because the malloc will probably word align the allocation, but maybe it was desired? Potential patch attached. Signed-off-by: Haven Hash Acked-by: Hal Rosenstock Haven Hash haven.hash at isilon.com- -------------- next part -------------- A non-text attachment was scrubbed... Name: mad.c.diff Type: text/x-patch Size: 703 bytes Desc: not available URL: From arlin.r.davis at intel.com Tue Oct 14 11:09:21 2008 From: arlin.r.davis at intel.com (Arlin Davis) Date: Tue, 14 Oct 2008 11:09:21 -0700 Subject: [ofa-general] [PATCH][v2.0] dat: look for dat.conf in multiple locations, including sysconfdir Message-ID: <000001c92e27$fb854ff0$4797070a@amr.corp.intel.com> Current static registration (SR) assumes DAT_OVERRIDE or /etc/dat.conf. Change SR to include sysconfdir. SR file access in the following order: - DAT_OVERRIDE - sysconfdir - /etc if DAT_OVERRIDE is set, assume administration override and do not failover to other locations. Signed-off-by: Arlin Davis --- Makefile.am | 4 ++-- dat/udat/udat_sr_parser.c | 45 ++++++++++++++++++++++++++++++++------------- 2 files changed, 34 insertions(+), 15 deletions(-) diff --git a/Makefile.am b/Makefile.am index 4929f83..bfc93f7 100755 --- a/Makefile.am +++ b/Makefile.am @@ -22,9 +22,9 @@ XPROGRAMS_SCM = endif if DEBUG -AM_CFLAGS = -g -Wall -D_GNU_SOURCE -DDAPL_DBG +AM_CFLAGS = -g -Wall -D_GNU_SOURCE -DDAPL_DBG -DDAT_CONF="\"$(sysconfdir)/dat.conf\"" else -AM_CFLAGS = -g -Wall -D_GNU_SOURCE +AM_CFLAGS = -g -Wall -D_GNU_SOURCE -DDAT_CONF="\"$(sysconfdir)/dat.conf\"" endif datlibdir = $(libdir) diff --git a/dat/udat/udat_sr_parser.c b/dat/udat/udat_sr_parser.c index 644e1c9..5b3f318 100644 --- a/dat/udat/udat_sr_parser.c +++ b/dat/udat/udat_sr_parser.c @@ -279,27 +279,44 @@ static DAT_SR_STACK_NODE *g_token_stack = NULL; * Function: dat_sr_load ***********************************************************************/ + DAT_RETURN dat_sr_load (void) { char *sr_path; DAT_OS_FILE *sr_file; - sr_path = dat_os_getenv (DAT_SR_CONF_ENV); - if ( sr_path == NULL ) - { - sr_path = DAT_SR_CONF_DEFAULT; - } + sr_path = dat_os_getenv(DAT_SR_CONF_ENV); - dat_os_dbg_print (DAT_OS_DBG_TYPE_SR, - "DAT Registry: static registry file <%s> \n", sr_path); - - sr_file = dat_os_fopen (sr_path); - if ( sr_file == NULL ) + /* environment override */ + if ((sr_path != NULL) && ((sr_file = dat_os_fopen(sr_path)) == NULL)) { - goto bail; + goto bail; + } + + if (sr_path == NULL) { + +#ifdef DAT_CONF + sr_path = DAT_CONF; +#else + sr_path = DAT_SR_CONF_DEFAULT; +#endif + sr_file = dat_os_fopen (sr_path); + if (sr_file == NULL) + { +#ifdef DAT_CONF + /* try default after sysconfdir fails */ + sr_path = DAT_SR_CONF_DEFAULT; + sr_file = dat_os_fopen(sr_path); + if (sr_file == NULL) +#endif + goto bail; + } } + dat_os_dbg_print(DAT_OS_DBG_TYPE_GENERIC, + "DAT Registry: using config file %s\n", sr_path); + for (;;) { if ( DAT_SUCCESS == dat_sr_parse_eof (sr_file) ) @@ -312,6 +329,8 @@ dat_sr_load (void) } else { + dat_os_dbg_print(DAT_OS_DBG_TYPE_ERROR, + "ERROR: parsing dat.conf\n"); goto cleanup; } } @@ -324,8 +343,8 @@ dat_sr_load (void) cleanup: dat_os_fclose(sr_file); bail: - dat_os_dbg_print(DAT_OS_DBG_TYPE_ERROR, - "ERROR: unable to parse static registry file, dat.conf\n"); + dat_os_dbg_print(DAT_OS_DBG_TYPE_ERROR, + "DAT Registry: dat.conf, bad filename - %s\n", sr_path); return DAT_INTERNAL_ERROR; } -- 1.5.2.5 From dledford at redhat.com Tue Oct 14 11:26:00 2008 From: dledford at redhat.com (Doug Ledford) Date: Tue, 14 Oct 2008 14:26:00 -0400 Subject: [ofa-general] Re: [PATCH][v2.0] dat: look for dat.conf in multiple locations, including sysconfdir In-Reply-To: <000001c92e27$fb854ff0$4797070a@amr.corp.intel.com> References: <000001c92e27$fb854ff0$4797070a@amr.corp.intel.com> Message-ID: <1224008760.4843.33.camel@firewall.xsintricity.com> On Tue, 2008-10-14 at 11:09 -0700, Arlin Davis wrote: > > Current static registration (SR) assumes DAT_OVERRIDE or /etc/dat.conf. > Change SR to include sysconfdir. SR file access in the following order: > > - DAT_OVERRIDE > - sysconfdir > - /etc > > if DAT_OVERRIDE is set, assume administration override > and do not failover to other locations. > > Signed-off-by: Arlin Davis I just had to patch up our dapl to work around this issue, so I certainly concur on fixing this up. This approach seems workable to me, although I think you need the same thing in the dapl-1.2 stream as well. Acked-by: Doug Ledford -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From dledford at redhat.com Tue Oct 14 11:31:55 2008 From: dledford at redhat.com (Doug Ledford) Date: Tue, 14 Oct 2008 14:31:55 -0400 Subject: [ofa-general] Re: [PATCH][v2.0] dat: look for dat.conf in multiple locations, including sysconfdir In-Reply-To: <000001c92e27$fb854ff0$4797070a@amr.corp.intel.com> References: <000001c92e27$fb854ff0$4797070a@amr.corp.intel.com> Message-ID: <1224009115.4843.36.camel@firewall.xsintricity.com> On Tue, 2008-10-14 at 11:09 -0700, Arlin Davis wrote: > @@ -324,8 +343,8 @@ dat_sr_load (void) > cleanup: > dat_os_fclose(sr_file); > bail: > - dat_os_dbg_print(DAT_OS_DBG_TYPE_ERROR, > - "ERROR: unable to parse static registry file, dat.conf\n"); > + dat_os_dbg_print(DAT_OS_DBG_TYPE_ERROR, > + "DAT Registry: dat.conf, bad filename - %s\n", sr_path); > return DAT_INTERNAL_ERROR; > > } One comment...since this is where we get when we fail, and we try multiple versions of sr_path, this error message is always going to show just the last path we tried, and won't reflect all the paths we tried. This might confuse people if they pass in a different location via DAT_OVERRIDE and it isn't found, but we also try the default dat locations and then they get this message. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From halr at obsidianresearch.com Tue Oct 14 13:11:02 2008 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Tue, 14 Oct 2008 14:11:02 -0600 Subject: [ofa-general] [PATCH] OpenSM release notes: Clarify OpenSM compatibility due to change in default SM/SA keys Message-ID: <48F4FCD6.1000300@obsidianresearch.com> Sasha, Please see attached for some tweaks to the latest release notes. -- Hal -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: patch-osmrn1 URL: From hal.rosenstock at gmail.com Tue Oct 14 13:13:26 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 14 Oct 2008 16:13:26 -0400 Subject: [ofa-general] Re: [PATCH] opensm: update Release Notes for OpenSM version 3.2 In-Reply-To: <20081014092150.GG5528@sashak.voltaire.com> References: <20081011215113.GN6947@sashak.voltaire.com> <20081014092150.GG5528@sashak.voltaire.com> Message-ID: Sasha, On Tue, Oct 14, 2008 at 5:21 AM, Sasha Khapyorsky wrote: > Hi Hal, > > On 07:38 Sun 12 Oct , Hal Rosenstock wrote: >> >> > + When this mode is used (enabled with --consolidate_ipv6_snm_req option) >> > + OpenSM will map all IPv6 Solicited Node Multicast address join requests >> > + into a single Multicast group with address ff10:601b::1:ff00:0. In this >> > + way limited MLID space is saved. The feature is very useful with large >> > + (~> 1024 nodes) clusters. >> >> Isn't there truth in advertising ? This should also say: >> >> This IBA noncompliant feature may be problematic in heterogeneous >> subnets (due to rate and MTU differences). > > The potential problem in heterogeneous subnets is not much different > from IPv4 - 1x ports will not be able to join. Is it the case you meant? > > If yes, IPv6 SNM becomes much less critical issue comparing to not > working IPv4. Isn't it at best equal rather than much less critical ? The reason I say at best equal is that it's the first joiner's parameters are used for all SNM groups which could "lock out" the majority of users. That's not the case with IPv4 where the group parameters are configured and only those which don't match can't participate. There's also a difference in that one is IBA compliant and the other is not. So this doesn't warrant mentioning as far as you're concerned ? -- Hal > Sasha From rdreier at cisco.com Tue Oct 14 14:08:31 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 14 Oct 2008 14:08:31 -0700 Subject: [ofa-general] [PATCH][TRIVIAL]mad.c: Need parens to kmalloc correct amount of memory In-Reply-To: <1224005642.5997.8.camel@hhash-dev> (Haven Hash's message of "Tue, 14 Oct 2008 10:34:02 -0700") References: <1224005642.5997.8.camel@hhash-dev> Message-ID: thanks a lot for sending this with the Signed-off-by line... But actually, looking at the code it seems that if we're going to touch this area anyway, we might as well use krealloc() too. So I'll queue the following to merge: IB/mad: Use krealloc() to resize snoop table Use krealloc() instead of kmalloc() followed by memcpy() when resizing the MAD module's snoop table. Also put parentheses around the new table size to avoid calculating the wrong size to allocate, which fixes a bug pointed out by Haven Hash . Signed-off-by: Roland Dreier --- drivers/infiniband/core/mad.c | 14 +++++--------- 1 files changed, 5 insertions(+), 9 deletions(-) diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index 49c45fe..5c54fc2 100644 --- a/drivers/infiniband/core/mad.c +++ b/drivers/infiniband/core/mad.c @@ -406,19 +406,15 @@ static int register_snoop_agent(struct ib_mad_qp_info *qp_info, if (i == qp_info->snoop_table_size) { /* Grow table. */ - new_snoop_table = kmalloc(sizeof mad_snoop_priv * - qp_info->snoop_table_size + 1, - GFP_ATOMIC); + new_snoop_table = krealloc(qp_info->snoop_table, + sizeof mad_snoop_priv * + (qp_info->snoop_table_size + 1), + GFP_ATOMIC); if (!new_snoop_table) { i = -ENOMEM; goto out; } - if (qp_info->snoop_table) { - memcpy(new_snoop_table, qp_info->snoop_table, - sizeof mad_snoop_priv * - qp_info->snoop_table_size); - kfree(qp_info->snoop_table); - } + qp_info->snoop_table = new_snoop_table; qp_info->snoop_table_size++; } -- 1.6.0.1 From andy.grover at oracle.com Tue Oct 14 16:10:40 2008 From: andy.grover at oracle.com (Andy Grover) Date: Tue, 14 Oct 2008 16:10:40 -0700 Subject: [ofa-general] Please pull RDS Message-ID: <48F526F0.3060007@oracle.com> Hi Vlad, This git tree has Jon Mason's initial iWarp rdma support. It also removes rds's seldom-used tcp transport, and moves everything to infiniband/ulp. Thanks -- Andy pull: www.openfabrics.org:/pub/scm/~agrover/ofed_1_4/linux-2.6.git code-drop/20081014 Andy Grover (4): RDS: fixup compilation errors & warnings RDS: Remove TCP transport support RDS: Move files from net/rds to drivers/infiniband/ulp/rds RDS: Alter build files for new RDS location Jon Mason (1): RDS: iWARP RDMA enablement From yossi.openib at gmail.com Wed Oct 15 01:17:49 2008 From: yossi.openib at gmail.com (Yossi Etigin) Date: Wed, 15 Oct 2008 10:17:49 +0200 Subject: [ofa-general] ***SPAM*** Re: [PATCH v2] ipoib: fix hang while bringing down uninitialized interface In-Reply-To: References: <48CEA6DC.9000904@gmail.com> Message-ID: <48F5A72D.8060004@gmail.com> Looks OK to me. Roland Dreier wrote: > > Fix bug #1172: If a pkey for an interface is not found during > > initialization, then poll_timer is left uninitialized. When the > > device is brought down, ipoib tries to del_timer_sync() it. This > > call hangs in an infinite loop in lock_timer_base(), because > > timer_base is NULL. We should check whether the timer was really > > initialized. > > Sorry for being so slow to get to this. > > But does it work to just make sure the timer is always initialized? > Seems cleaner that way, and it makes the code an insignificant bit > smaller as a bonus. ie does the patch below fix things too? > > drivers/infiniband/ulp/ipoib/ipoib_ib.c | 7 +++---- > 1 files changed, 3 insertions(+), 4 deletions(-) > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > index 0e748ae..28eb6f0 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c > @@ -685,10 +685,6 @@ int ipoib_ib_dev_open(struct net_device *dev) > queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, > round_jiffies_relative(HZ)); > > - init_timer(&priv->poll_timer); > - priv->poll_timer.function = ipoib_ib_tx_timer_func; > - priv->poll_timer.data = (unsigned long)dev; > - > set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags); > > return 0; > @@ -906,6 +902,9 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port) > return -ENODEV; > } > > + setup_timer(&priv->poll_timer, ipoib_ib_tx_timer_func, > + (unsigned long) dev); > + > if (dev->flags & IFF_UP) { > if (ipoib_ib_dev_open(dev)) { > ipoib_transport_dev_cleanup(dev); > From vlad at dev.mellanox.co.il Wed Oct 15 01:53:43 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 15 Oct 2008 10:53:43 +0200 Subject: [ofa-general] Re: [rds-devel] Please pull RDS In-Reply-To: <48F526F0.3060007@oracle.com> References: <48F526F0.3060007@oracle.com> Message-ID: <48F5AF97.3040805@dev.mellanox.co.il> Andy Grover wrote: > Hi Vlad, > > This git tree has Jon Mason's initial iWarp rdma support. It also > removes rds's seldom-used tcp transport, and moves everything to > infiniband/ulp. > > Thanks -- Andy > > pull: www.openfabrics.org:/pub/scm/~agrover/ofed_1_4/linux-2.6.git > code-drop/20081014 > > Andy Grover (4): > RDS: fixup compilation errors & warnings > RDS: Remove TCP transport support > RDS: Move files from net/rds to drivers/infiniband/ulp/rds > RDS: Alter build files for new RDS location > > Jon Mason (1): > RDS: iWARP RDMA enablement > > _______________________________________________ > rds-devel mailing list > rds-devel at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/rds-devel > Done, Following your commits I did the following changes in the ofed git tree: RDS: include vmalloc.h to fix compilation issue on ppc64 and ia64 kernel_patches/backports: updated RDS backport patches due to moved files from net/rds to drivers/infiniband/ulp/rds ofed_scripts: updated scripts due to: moved files from net/rds to drivers/infiniband/ulp/rds RDS: removed TCP transport support Regards, Vladimir From vlad at lists.openfabrics.org Wed Oct 15 03:20:02 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 15 Oct 2008 03:20:02 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081015-0200 daily build status Message-ID: <20081015102002.AC883E60A8A@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From sashak at voltaire.com Wed Oct 15 03:33:50 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 15 Oct 2008 12:33:50 +0200 Subject: [ofa-general] Re: [PATCH] opensm: update Release Notes for OpenSM version 3.2 In-Reply-To: References: <20081011215113.GN6947@sashak.voltaire.com> <20081014092150.GG5528@sashak.voltaire.com> Message-ID: <20081015103350.GK5528@sashak.voltaire.com> Hal, On 16:13 Tue 14 Oct , Hal Rosenstock wrote: > > Isn't it at best equal rather than much less critical ? The reason I > say at best equal is that it's the first joiner's parameters are used > for all SNM groups which could "lock out" the majority of users. > That's not the case with IPv4 where the group parameters are > configured and only those which don't match can't participate. Aren't mcmember record parameters (MTU and Rate) for IPv6 SNM join copied from IPv4's mcmember record? Sasha From sashak at voltaire.com Wed Oct 15 03:54:15 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 15 Oct 2008 12:54:15 +0200 Subject: [ofa-general] Re: [PATCH] OpenSM release notes: Clarify OpenSM compatibility due to change in default SM/SA keys In-Reply-To: <48F4FCD6.1000300@obsidianresearch.com> References: <48F4FCD6.1000300@obsidianresearch.com> Message-ID: <20081015105415.GN5528@sashak.voltaire.com> On 14:11 Tue 14 Oct , Hal Rosenstock wrote: > OpenSM release notes: Clarify OpenSM compatibility due to change in default > SM/SA keys > Indicate IBA noncompliance of consolidate_ipv6_snm feature > Also, some cosmetic changes > > Signed-off-by: Hal Rosenstock Applied with small change (as indicated below). Thanks. [snip...] > -6 Qualification > ----------------- > +6 Qualified Software Stacks and Devices > +--------------------------------------- > + > +OpenSM Compatibility > +-------------------- > +Note that OpenSM version 3.2.1 and earlier used a value of 1 I added "by default" here ^^^. Sasha From hal.rosenstock at gmail.com Wed Oct 15 04:28:10 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 15 Oct 2008 07:28:10 -0400 Subject: [ofa-general] Re: [PATCH] OpenSM release notes: Clarify OpenSM compatibility due to change in default SM/SA keys In-Reply-To: <20081015105415.GN5528@sashak.voltaire.com> References: <48F4FCD6.1000300@obsidianresearch.com> <20081015105415.GN5528@sashak.voltaire.com> Message-ID: On Wed, Oct 15, 2008 at 6:54 AM, Sasha Khapyorsky wrote: > On 14:11 Tue 14 Oct , Hal Rosenstock wrote: >> OpenSM release notes: Clarify OpenSM compatibility due to change in default >> SM/SA keys >> Indicate IBA noncompliance of consolidate_ipv6_snm feature >> Also, some cosmetic changes >> >> Signed-off-by: Hal Rosenstock > > Applied with small change (as indicated below). Thanks. > > [snip...] > >> -6 Qualification >> ----------------- >> +6 Qualified Software Stacks and Devices >> +--------------------------------------- >> + >> +OpenSM Compatibility >> +-------------------- >> +Note that OpenSM version 3.2.1 and earlier used a value of 1 > > I added "by default" here ^^^. It says "default SM_Key" on the next line. Now one of them should be removed. -- Hal > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Wed Oct 15 04:34:59 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 15 Oct 2008 13:34:59 +0200 Subject: [ofa-general] Re: [PATCH] OpenSM release notes: Clarify OpenSM compatibility due to change in default SM/SA keys In-Reply-To: References: <48F4FCD6.1000300@obsidianresearch.com> <20081015105415.GN5528@sashak.voltaire.com> Message-ID: <20081015113459.GQ5528@sashak.voltaire.com> On 07:28 Wed 15 Oct , Hal Rosenstock wrote: > > It says "default SM_Key" on the next line. Now one of them should be removed. Ok, removed mine. Sasha From hal.rosenstock at gmail.com Wed Oct 15 04:39:09 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 15 Oct 2008 07:39:09 -0400 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm: update Release Notes for OpenSM version 3.2 In-Reply-To: <20081015103350.GK5528@sashak.voltaire.com> References: <20081011215113.GN6947@sashak.voltaire.com> <20081014092150.GG5528@sashak.voltaire.com> <20081015103350.GK5528@sashak.voltaire.com> Message-ID: Sasha, On Wed, Oct 15, 2008 at 6:33 AM, Sasha Khapyorsky wrote: > Hal, > > On 16:13 Tue 14 Oct , Hal Rosenstock wrote: >> >> Isn't it at best equal rather than much less critical ? The reason I >> say at best equal is that it's the first joiner's parameters are used >> for all SNM groups which could "lock out" the majority of users. >> That's not the case with IPv4 where the group parameters are >> configured and only those which don't match can't participate. > > Aren't mcmember record parameters (MTU and Rate) for IPv6 SNM join > copied from IPv4's mcmember record? from the broadcast group (as all multicast groups do whether IPv6 or IPv4). So FWIW, it's the same issue as with IPv4 in that sense in terms of mismatched ports not being able to join (no worse, no better). -- Hal > Sasha From sashak at voltaire.com Wed Oct 15 04:49:08 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 15 Oct 2008 13:49:08 +0200 Subject: [ofa-general] Re: [PATCH] opensm: update Release Notes for OpenSM version 3.2 In-Reply-To: References: <20081011215113.GN6947@sashak.voltaire.com> <20081014092150.GG5528@sashak.voltaire.com> <20081015103350.GK5528@sashak.voltaire.com> Message-ID: <20081015114908.GR5528@sashak.voltaire.com> On 07:39 Wed 15 Oct , Hal Rosenstock wrote: > > > > Aren't mcmember record parameters (MTU and Rate) for IPv6 SNM join > > copied from IPv4's mcmember record? > > from the broadcast group (as all multicast groups do whether IPv6 or > IPv4). So FWIW, it's the same issue as with IPv4 in that sense in > terms of mismatched ports not being able to join (no worse, no > better). Ok. Then if this "random" bad port was not able to join the broadcast group, how it will create IPv6 SNM group for whole fabric? Sasha From hal.rosenstock at gmail.com Wed Oct 15 05:34:42 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 15 Oct 2008 08:34:42 -0400 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm: update Release Notes for OpenSM version 3.2 In-Reply-To: <20081015114908.GR5528@sashak.voltaire.com> References: <20081011215113.GN6947@sashak.voltaire.com> <20081014092150.GG5528@sashak.voltaire.com> <20081015103350.GK5528@sashak.voltaire.com> <20081015114908.GR5528@sashak.voltaire.com> Message-ID: On Wed, Oct 15, 2008 at 7:49 AM, Sasha Khapyorsky wrote: > On 07:39 Wed 15 Oct , Hal Rosenstock wrote: >> > >> > Aren't mcmember record parameters (MTU and Rate) for IPv6 SNM join >> > copied from IPv4's mcmember record? >> >> from the broadcast group (as all multicast groups do whether IPv6 or >> IPv4). So FWIW, it's the same issue as with IPv4 in that sense in >> terms of mismatched ports not being able to join (no worse, no >> better). > > Ok. Then if this "random" bad port was not able to join the broadcast > group, how it will create IPv6 SNM group for whole fabric? Any group creation attempts by a mismatched port should fail. -- Hal > Sasha > From vlad at dev.mellanox.co.il Wed Oct 15 05:37:15 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 15 Oct 2008 14:37:15 +0200 Subject: [ofa-general] Re: [rds-devel] Please pull RDS In-Reply-To: <48F5AF97.3040805@dev.mellanox.co.il> References: <48F526F0.3060007@oracle.com> <48F5AF97.3040805@dev.mellanox.co.il> Message-ID: <48F5E3FB.1020900@dev.mellanox.co.il> Vladimir Sokolovsky wrote: > Andy Grover wrote: >> Hi Vlad, >> >> This git tree has Jon Mason's initial iWarp rdma support. It also >> removes rds's seldom-used tcp transport, and moves everything to >> infiniband/ulp. >> >> Thanks -- Andy >> >> pull: www.openfabrics.org:/pub/scm/~agrover/ofed_1_4/linux-2.6.git >> code-drop/20081014 >> >> Andy Grover (4): >> RDS: fixup compilation errors & warnings >> RDS: Remove TCP transport support >> RDS: Move files from net/rds to drivers/infiniband/ulp/rds >> RDS: Alter build files for new RDS location >> >> Jon Mason (1): >> RDS: iWARP RDMA enablement >> >> _______________________________________________ >> rds-devel mailing list >> rds-devel at oss.oracle.com >> http://oss.oracle.com/mailman/listinfo/rds-devel >> > > Done, > Following your commits I did the following changes in the ofed git tree: > > RDS: include vmalloc.h to fix compilation issue on ppc64 and ia64 > > kernel_patches/backports: updated RDS backport patches due to > moved files from net/rds to drivers/infiniband/ulp/rds > > ofed_scripts: updated scripts due to: > moved files from net/rds to drivers/infiniband/ulp/rds > RDS: removed TCP transport support > > > Regards, > Vladimir Hi Andy, I tried to run RDS tests over Mellanox HCAs after the changes above and all tests failed. We are planning to release OFED-1.4-rc3 tomorrow. So, I have to revert all the commits above till the RDS issues will be solved. Please remember to check compilation on the OFA server after changes that you do to the ofed_kernel git tree. For details, see https://wiki.openfabrics.org/tiki-index.php?page=HOWTO+Build+OFA+kernel+package Regards, Vladimir From halr at obsidianresearch.com Wed Oct 15 07:23:04 2008 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Wed, 15 Oct 2008 08:23:04 -0600 Subject: [ofa-general] [PATCH] OpenSM release notes: Indicate InfiniScale-IV support Message-ID: <48F5FCC8.8070601@obsidianresearch.com> Sasha, Another update to the release notes. -- Hal -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: patch-osmrn2 URL: From tziporet at mellanox.co.il Wed Oct 15 08:23:29 2008 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 15 Oct 2008 17:23:29 +0200 Subject: [ofa-general] OFED meeting agenda for today (Oct 15) Message-ID: <5D49E7A8952DC44FB38C38FA0D758EADB76E52@mtlexch01.mtl.com> Agenda for OFED meeting today on OFED 1.4 status toward RC3: 1. OFED 1.4 status: - Moved to kernel 2.6.27 - MPI: Open MPI 1.2.8 ready, MVAPICH2 and MVAPICH updated. - uDAPL: compat-dapl-1.2.11 and dapl-2.0.14 released - NFS-RDMA critical bug with NFS_RDMA on SLES 10 SP2 - reported it was fixed Suggestion: NFS-RDMA will not be installed by default. Reason: This is a new component that was not tested enough - OSM: Cashed routing - almost done - RC3 schedule: Going to build it tomorrow morning and release it by end of Thursday - RDS: We see many changes in RDS that break functionality - I suggest delay them for 2. GA date: Need to decide if 29-Oct is still a valid day 3. Bugs review: 1128 blo Othe stefan.roscher at de.ibm.com release IPoIB-CM QP resources in flushing CQE context 1278 blo RHEL perkinjo at cse.ohio-state.edu Mvapich2 fails to compile when uDAPL option is specified 1242 cri RHEL yannick.cote at qlogic.com kernel panic while running mpi2007 against ofed1.4 -- ib_... 1257 cri All eli at mellanox.co.il Severe performance penalty for PCIe strict ordering 1262 cri Othe andy.grover at oracle.com congestion hang with RDS 1164 maj SLES yosefe at voltaire.com iperf over IPoIB fails for 100 tcp connections 1221 maj SLES Jeffrey.C.Becker at nasa.gov SLES10 sp2: remote logins via ssh fail due to rpcbind and... 1248 maj SLES monis at voltaire.com Bonding - after reboot the host stucks while raising the ... 4. Open discussion Tziporet From arlin.r.davis at intel.com Wed Oct 15 09:44:52 2008 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Wed, 15 Oct 2008 09:44:52 -0700 Subject: [ofa-general] [PATCH][dat2.0][v2] Current static registration (SR) assumes DAT_OVERRIDE or /etc/dat.conf. Message-ID: Rev 2 patch with Doug's suggestions. Change SR to include sysconfdir. SR file access in the following order: - DAT_OVERRIDE - sysconfdir - /etc if DAT_OVERRIDE is set, assume administration override and do not failover to other locations. Add debug messages for each failure and retries. Signed-off-by: Arlin Davis Acked-by: Doug Ledford --- Makefile.am | 4 +- dat/udat/udat_sr_parser.c | 65 +++++++++++++++++++++++++++++++++++--------- 2 files changed, 53 insertions(+), 16 deletions(-) diff --git a/Makefile.am b/Makefile.am index 4929f83..bfc93f7 100755 --- a/Makefile.am +++ b/Makefile.am @@ -22,9 +22,9 @@ XPROGRAMS_SCM = endif if DEBUG -AM_CFLAGS = -g -Wall -D_GNU_SOURCE -DDAPL_DBG +AM_CFLAGS = -g -Wall -D_GNU_SOURCE -DDAPL_DBG -DDAT_CONF="\"$(sysconfdir)/dat.conf\"" else -AM_CFLAGS = -g -Wall -D_GNU_SOURCE +AM_CFLAGS = -g -Wall -D_GNU_SOURCE -DDAT_CONF="\"$(sysconfdir)/dat.conf\"" endif datlibdir = $(libdir) diff --git a/dat/udat/udat_sr_parser.c b/dat/udat/udat_sr_parser.c index 644e1c9..ddd96bf 100644 --- a/dat/udat/udat_sr_parser.c +++ b/dat/udat/udat_sr_parser.c @@ -285,21 +285,52 @@ dat_sr_load (void) char *sr_path; DAT_OS_FILE *sr_file; - sr_path = dat_os_getenv (DAT_SR_CONF_ENV); - if ( sr_path == NULL ) - { - sr_path = DAT_SR_CONF_DEFAULT; - } - - dat_os_dbg_print (DAT_OS_DBG_TYPE_SR, - "DAT Registry: static registry file <%s> \n", sr_path); + sr_path = dat_os_getenv(DAT_SR_CONF_ENV); - sr_file = dat_os_fopen (sr_path); - if ( sr_file == NULL ) + /* environment override */ + if ((sr_path != NULL) && ((sr_file = dat_os_fopen(sr_path)) == NULL)) { - goto bail; + dat_os_dbg_print(DAT_OS_DBG_TYPE_ERROR, + "DAT Registry: DAT_OVERRIDE, " + "bad filename - %s, aborting\n", + sr_path); + goto bail; + } + + if (sr_path == NULL) { + +#ifdef DAT_CONF + sr_path = DAT_CONF; +#else + sr_path = DAT_SR_CONF_DEFAULT; +#endif + sr_file = dat_os_fopen (sr_path); + if (sr_file == NULL) + { +#ifdef DAT_CONF + dat_os_dbg_print(DAT_OS_DBG_TYPE_ERROR, + "DAT Registry: sysconfdir, " + "bad filename - %s, retry default at %s\n", + sr_path, DAT_SR_CONF_DEFAULT); + /* try default after sysconfdir fails */ + sr_path = DAT_SR_CONF_DEFAULT; + sr_file = dat_os_fopen(sr_path); + if (sr_file == NULL) { +#endif + dat_os_dbg_print(DAT_OS_DBG_TYPE_ERROR, + "DAT Registry: default, " + "bad filename - %s, aborting\n", + sr_path); + goto bail; +#ifdef DAT_CONF + } +#endif + } } + dat_os_dbg_print(DAT_OS_DBG_TYPE_GENERIC, + "DAT Registry: using config file %s\n", sr_path); + for (;;) { if ( DAT_SUCCESS == dat_sr_parse_eof (sr_file) ) @@ -312,20 +343,26 @@ dat_sr_load (void) } else { + dat_os_dbg_print(DAT_OS_DBG_TYPE_ERROR, + "DAT Registry: ERROR parsing - %s\n", + sr_path); goto cleanup; } } - if (0 != dat_os_fclose (sr_file)) + if (0 != dat_os_fclose (sr_file)) + { + dat_os_dbg_print(DAT_OS_DBG_TYPE_ERROR, + "DAT Registry: ERROR closing - %s\n", + sr_path); goto bail; + } return DAT_SUCCESS; cleanup: dat_os_fclose(sr_file); bail: - dat_os_dbg_print(DAT_OS_DBG_TYPE_ERROR, - "ERROR: unable to parse static registry file, dat.conf\n"); return DAT_INTERNAL_ERROR; } -- 1.5.2.5 From arlin.r.davis at intel.com Wed Oct 15 09:45:16 2008 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Wed, 15 Oct 2008 09:45:16 -0700 Subject: [ofa-general] [PATCH][dat1.2] Current static registration (SR) assumes DAT_OVERRIDE or /etc/dat.conf. Message-ID: Sysconfdir dat.conf patch for dat 1.2 package. Change SR to include sysconfdir. SR file access in the following order: - DAT_OVERRIDE - sysconfdir - /etc if DAT_OVERRIDE is set, assume administration override and do not failover to other locations. Add debug messages for each failure and retries. Signed-off-by: Arlin Davis Acked-by: Doug Ledford --- Makefile.am | 4 +- dat/udat/udat_sr_parser.c | 65 +++++++++++++++++++++++++++++++++++--------- 2 files changed, 53 insertions(+), 16 deletions(-) diff --git a/Makefile.am b/Makefile.am index 5f39ece..dbf1611 100644 --- a/Makefile.am +++ b/Makefile.am @@ -12,9 +12,9 @@ OSFLAGS += -DREDHAT_EL5 endif if DEBUG -AM_CFLAGS = -g -Wall -D_GNU_SOURCE -DDAPL_DBG +AM_CFLAGS = -g -Wall -D_GNU_SOURCE -DDAPL_DBG -DDAT_CONF="\"$(sysconfdir)/dat.conf\"" else -AM_CFLAGS = -g -Wall -D_GNU_SOURCE +AM_CFLAGS = -g -Wall -D_GNU_SOURCE -DDAT_CONF="\"$(sysconfdir)/dat.conf\"" endif datlibdir = $(libdir) diff --git a/dat/udat/udat_sr_parser.c b/dat/udat/udat_sr_parser.c index aca4da5..f8f8a3b 100644 --- a/dat/udat/udat_sr_parser.c +++ b/dat/udat/udat_sr_parser.c @@ -281,21 +281,52 @@ dat_sr_load (void) char *sr_path; DAT_OS_FILE *sr_file; - sr_path = dat_os_getenv (DAT_SR_CONF_ENV); - if ( sr_path == NULL ) - { - sr_path = DAT_SR_CONF_DEFAULT; - } + sr_path = dat_os_getenv(DAT_SR_CONF_ENV); - dat_os_dbg_print (DAT_OS_DBG_TYPE_SR, - "DAT Registry: static registry file <%s> \n", sr_path); - - sr_file = dat_os_fopen (sr_path); - if ( sr_file == NULL ) + /* environment override */ + if ((sr_path != NULL) && ((sr_file = dat_os_fopen(sr_path)) == NULL)) { - goto bail; + dat_os_dbg_print(DAT_OS_DBG_TYPE_ERROR, + "DAT Registry: DAT_OVERRIDE, " + "bad filename - %s, aborting\n", + sr_path); + goto bail; + } + + if (sr_path == NULL) { + +#ifdef DAT_CONF + sr_path = DAT_CONF; +#else + sr_path = DAT_SR_CONF_DEFAULT; +#endif + sr_file = dat_os_fopen (sr_path); + if (sr_file == NULL) + { +#ifdef DAT_CONF + dat_os_dbg_print(DAT_OS_DBG_TYPE_ERROR, + "DAT Registry: sysconfdir, " + "bad filename - %s, retry default at %s\n", + sr_path, DAT_SR_CONF_DEFAULT); + /* try default after sysconfdir fails */ + sr_path = DAT_SR_CONF_DEFAULT; + sr_file = dat_os_fopen(sr_path); + if (sr_file == NULL) { +#endif + dat_os_dbg_print(DAT_OS_DBG_TYPE_ERROR, + "DAT Registry: default, " + "bad filename - %s, aborting\n", + sr_path); + goto bail; +#ifdef DAT_CONF + } +#endif + } } + dat_os_dbg_print(DAT_OS_DBG_TYPE_GENERIC, + "DAT Registry: using config file %s\n", sr_path); + for (;;) { if ( DAT_SUCCESS == dat_sr_parse_eof (sr_file) ) @@ -308,20 +339,26 @@ dat_sr_load (void) } else { + dat_os_dbg_print(DAT_OS_DBG_TYPE_ERROR, + "DAT Registry: ERROR parsing - %s\n", + sr_path); goto cleanup; } } - if (0 != dat_os_fclose (sr_file)) + if (0 != dat_os_fclose (sr_file)) + { + dat_os_dbg_print(DAT_OS_DBG_TYPE_ERROR, + "DAT Registry: ERROR closing - %s\n", + sr_path); goto bail; + } return DAT_SUCCESS; cleanup: dat_os_fclose(sr_file); bail: - dat_os_dbg_print(DAT_OS_DBG_TYPE_ERROR, - "ERROR: unable to parse static registry file, dat.conf\n"); return DAT_INTERNAL_ERROR; } -- 1.5.2.5 From andy.grover at oracle.com Wed Oct 15 11:15:20 2008 From: andy.grover at oracle.com (Andy Grover) Date: Wed, 15 Oct 2008 11:15:20 -0700 Subject: [ofa-general] Re: [rds-devel] Please pull RDS In-Reply-To: <48F5E3FB.1020900@dev.mellanox.co.il> References: <48F526F0.3060007@oracle.com> <48F5AF97.3040805@dev.mellanox.co.il> <48F5E3FB.1020900@dev.mellanox.co.il> Message-ID: <48F63338.5070002@oracle.com> Really sorry to waste your time, I'll resubmit once debugged and 1.4.1 opens up. Regards -- Andy Vladimir Sokolovsky wrote: > Vladimir Sokolovsky wrote: >> Andy Grover wrote: >>> Hi Vlad, >>> >>> This git tree has Jon Mason's initial iWarp rdma support. It also >>> removes rds's seldom-used tcp transport, and moves everything to >>> infiniband/ulp. >>> >>> Thanks -- Andy >>> >>> pull: www.openfabrics.org:/pub/scm/~agrover/ofed_1_4/linux-2.6.git >>> code-drop/20081014 >>> >>> Andy Grover (4): >>> RDS: fixup compilation errors & warnings >>> RDS: Remove TCP transport support >>> RDS: Move files from net/rds to drivers/infiniband/ulp/rds >>> RDS: Alter build files for new RDS location >>> >>> Jon Mason (1): >>> RDS: iWARP RDMA enablement >>> >>> _______________________________________________ >>> rds-devel mailing list >>> rds-devel at oss.oracle.com >>> http://oss.oracle.com/mailman/listinfo/rds-devel >>> >> >> Done, >> Following your commits I did the following changes in the ofed git tree: >> >> RDS: include vmalloc.h to fix compilation issue on ppc64 and ia64 >> >> kernel_patches/backports: updated RDS backport patches due to >> moved files from net/rds to drivers/infiniband/ulp/rds >> >> ofed_scripts: updated scripts due to: >> moved files from net/rds to drivers/infiniband/ulp/rds >> RDS: removed TCP transport support >> >> >> Regards, >> Vladimir > > Hi Andy, > I tried to run RDS tests over Mellanox HCAs after the changes above and > all tests failed. > We are planning to release OFED-1.4-rc3 tomorrow. > So, I have to revert all the commits above till the RDS issues will be > solved. > > Please remember to check compilation on the OFA server after changes > that you do to the ofed_kernel git tree. > For details, see > https://wiki.openfabrics.org/tiki-index.php?page=HOWTO+Build+OFA+kernel+package > > > Regards, > Vladimir From svenar at simula.no Wed Oct 15 14:21:04 2008 From: svenar at simula.no (Sven-Arne Reinemo) Date: Wed, 15 Oct 2008 23:21:04 +0200 Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer overflow In-Reply-To: <00cf01c92d63$e6ce45e0$b46ad1a0$@com> References: <20081008012149.GK7563@sashak.voltaire.com> <20081009190549.GQ4912@sashak.voltaire.com> <005901c92cb8$cf13dd80$6d3b9880$@com> <000001c92cc6$13b90890$3b2b19b0$@com> <00cf01c92d63$e6ce45e0$b46ad1a0$@com> Message-ID: <1224105664.11246.12.camel@localhost> On Mon, 2008-10-13 at 13:45 -0500, Robert Pearson wrote: > Thanks Hal, > > I grabbed the head of tree and recompiled with some print statements thrown > in. As you say the '<<' fixes the decoding correctly and ibsim is reporting > 4 not 1 so life is good and I see no problems in ibsim or opensm. > > I was looking at the binary init values for port info for switches and it > looked like the op_vl was set to 1 but when I ran opensm over it I saw 4 > which is the correct answer. The last point is moot in this case. > > I artificially relaxed the limit on the number of VLs in my case and found > that I needed 12 VLs to route my problem with lash so it is not going to > work since the switches only support 8 data vls. With regards to the number of layers required, LASH is sensitive to the port numbering used in regular topologies such as meshes and tori. If the cabling is consistent (0=W, 1=E, 2=N, 3=S) LASH requires 1 VL for 2d meshes of any size (equals dimension order routing) and 3 VLs for 2d tori of any size. For 3d tori it requires 5 VLs. This is observed based on simulations with IBMgtSim. I do not know if this is relevant for your problem, but thought I should mention it. Regards, Sven-Arne > > Regards, > > Bob > -----Original Message----- > From: Hal Rosenstock [mailto:hal.rosenstock at gmail.com] > Sent: Monday, October 13, 2008 11:58 AM > To: Robert Pearson > Cc: Sasha Khapyorsky; OpenIB > Subject: Re: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer > overflow > > On Sun, Oct 12, 2008 at 7:55 PM, Robert Pearson > wrote: > > I spent a little time looking at osm_ucast_lash.c and ibsim. > > > > It looks like ibsim reports vl_cap = 4 and op_vl = 1 by default for a > > switch. > > Yes, ibsim only has one canned template for this. I think it is mainly > that ibnetdiscover output format doesn't include this information > currently and something needs to be assumed. > > > Osm_ucast_lash.c computes the minimum over all switches of op_vl as > > extracted from the port info mads starting from (5 which would correspond > to > > > > It then uses the encoded value as though it was an integer instead of an > > encoding of an integer which seems wrong. > > osm_ucast_lash.c::discover_network_properties fixes this up properly: > vl_min = 1 << (vl_min - 1); > if (vl_min > 15) > vl_min = 15; > > > I am not yet sure when the SM is supposed to set the op_vl field away from > > 1. If later then you are using the wrong value and should be comparing to > > the decoded value of vl_cap instead. > > I don't understand what you mean here. vl_cap would never be right to be > used. > > -- Hal > > > -----Original Message----- > > From: general-bounces at lists.openfabrics.org > > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Robert Pearson > > Sent: Sunday, October 12, 2008 5:21 PM > > To: 'Sasha Khapyorsky'; 'Hal Rosenstock' > > Cc: 'OpenIB' > > Subject: RE: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer > > overflow > > > > How does lash know how many VLs are available? Especially, with ibsim. > > Is there a way to have lash report the number of VLs required independent > of > > the type of switch used? > > > > -----Original Message----- > > From: general-bounces at lists.openfabrics.org > > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of Sasha > Khapyorsky > > Sent: Thursday, October 09, 2008 2:06 PM > > To: Hal Rosenstock > > Cc: OpenIB > > Subject: Re: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix buffer > > overflow > > > > On 07:04 Wed 08 Oct , Hal Rosenstock wrote: > >> > >> Minor simplification as it seems like this could just be: > >> > >> if (++lanes_needed > p_lash->vl_min) > >> goto Error_Not_Enough_Lanes; > > > > Works for me. Thanks! > > > > Sasha > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From kliteyn at dev.mellanox.co.il Wed Oct 15 14:29:29 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 15 Oct 2008 23:29:29 +0200 Subject: [ofa-general] [PATCH 0/6 v2] opensm: Unicast Routing Cache In-Reply-To: <48F13DEC.2030109@dev.mellanox.co.il> References: <48E96928.8030200@dev.mellanox.co.il> <20081009171103.GF4912@sashak.voltaire.com> <48EEB00E.7000209@dev.mellanox.co.il> <20081010082827.GX4912@sashak.voltaire.com> <48F13DEC.2030109@dev.mellanox.co.il> Message-ID: <48F660B9.5010804@dev.mellanox.co.il> Hi Sasha, I'm sending v2 of the patches: - patch 1/6: move lft_buf from ucast_mgr to osm_switch No change whatsoever, just rebased to the new master - patch 2/6: Add "-A" or "--ucast_cache" option to opensm No functional change, changed appearance of the help message to match the recent OpenSM help changes. - patch 3/6: adding osm_ucast_cache.{c,h} files. + All your patches integrated, many small fixes, including all the fixes from the review. + Port array allocation reworked to save many reallocations + Better malloc() handling: checking for returning NULL, no casting after malloc(). + Back-2-back connections - cache is now disabled when there are no switches, but it also takes care of the case when SM port is disconnected and then connected with back-2-back link. Besides adding check for zero switches in fabric, I found only one back-2-back connection check that could be removed - in function osm_ucast_cache_add_node(). - patch 4/6: adding new cache files to makefile. No changes, just rebased. - patch 5/6: integrating unicast cache into the discovery and ucast manager. This patch includes your changes (having ucast_cache_process function that does all the job instead of integrating into osm_ucast_mgr_process). Also, the cache is now a part of the ucast manager struct, which simplifies a code and saves some functions. - patch 6/6: man entry for cached routing. No changes, just rebased. The job that still needs to be done: - Check how the cache handles port moving during discovery. Might be a bug there. - Check how unicast manager handles fast reset of switches. AFAIK SM will now write the LFT there - need to fix it (unrelated to cache, general ucast mgr issue) - Optimize LFT usage - simplify current switch LFT, hold two LFTs (current and cached) only when these LFTs are not identical. -- Yevgeny Yevgeny Kliteynik wrote: > Hi Sasha, > > Sasha Khapyorsky wrote: >> Hi Yevgeny, >> >> On 03:29 Fri 10 Oct , Yevgeny Kliteynik wrote: >>> Thanks for the review and the patches. Didn't manage to address >>> all your comments yet - will do it tomorrow. >>> One question though: how to deal with the incremental patches that >>> you sent me? Should I apply them to my branch and then issue one >>> V2 patch instead of the old one, or will you apply the original >>> patch, followed by all the incremental (yours and mine)? >> >> It is up to you. You can merge all in single V2 (guess it is simpler) > > I'll send a v2 patches for 3/6 and 5/6 when I'm done fixing all the stuff. > > -- Yevgeny > >> or leave it unchanged and I will apply later. Except integration patch >> others are not critical IMHO. >> >> Sasha >> > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From kliteyn at dev.mellanox.co.il Wed Oct 15 14:34:58 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 15 Oct 2008 23:34:58 +0200 Subject: [ofa-general] [PATCH 1/6 v2] opensm/Unicast Routing Cache: move lft_buf from ucast_mgr to osm_switch Message-ID: <48F66202.40908@dev.mellanox.co.il> Instead of having single lft_buf in ucast_mgr, each switch will hold lft_buf which is the LFT that was calculated by the recent routing engine execution. Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_switch.h | 7 ++++++- opensm/include/opensm/osm_ucast_mgr.h | 6 +----- opensm/opensm/osm_switch.c | 12 +++++++++++- opensm/opensm/osm_ucast_file.c | 5 +++-- opensm/opensm/osm_ucast_ftree.c | 2 +- opensm/opensm/osm_ucast_lash.c | 8 ++++---- opensm/opensm/osm_ucast_mgr.c | 18 +++++------------- 7 files changed, 31 insertions(+), 27 deletions(-) diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h index 1c0f6e9..3d9a72d 100644 --- a/opensm/include/opensm/osm_switch.h +++ b/opensm/include/opensm/osm_switch.h @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -102,6 +102,7 @@ typedef struct osm_switch { uint8_t **hops; osm_port_profile_t *p_prof; osm_fwd_tbl_t fwd_tbl; + uint8_t *lft_buf; osm_mcast_tbl_t mcast_tbl; uint32_t discovery_count; unsigned need_update; @@ -137,6 +138,10 @@ typedef struct osm_switch { * fwd_tbl * This switch's forwarding table. * +* lft_buf +* This switch's linear forwarding table, as was +* calculated by the last routing engine execution. +* * mcast_tbl * Multicast forwarding table for this switch. * diff --git a/opensm/include/opensm/osm_ucast_mgr.h b/opensm/include/opensm/osm_ucast_mgr.h index 12be97a..27e89e9 100644 --- a/opensm/include/opensm/osm_ucast_mgr.h +++ b/opensm/include/opensm/osm_ucast_mgr.h @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -97,7 +97,6 @@ typedef struct osm_ucast_mgr { cl_qlist_t port_order_list; boolean_t is_dor; boolean_t some_hop_count_set; - uint8_t *lft_buf; } osm_ucast_mgr_t; /* * FIELDS @@ -129,9 +128,6 @@ typedef struct osm_ucast_mgr { * tables calculation iteration cycle, set to TRUE to indicate * that some hop count changes were done. * -* lft_buf -* LFT buffer - used during LFT calculation/setup. -* * SEE ALSO * Unicast Manager object *********/ diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c index 77ef61e..9bf76e0 100644 --- a/opensm/opensm/osm_switch.c +++ b/opensm/opensm/osm_switch.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -101,6 +101,13 @@ osm_switch_init(IN osm_switch_t * const p_sw, if (status != IB_SUCCESS) goto Exit; + p_sw->lft_buf = malloc(IB_LID_UCAST_END_HO + 1); + if (!p_sw->lft_buf) { + status = IB_INSUFFICIENT_MEMORY; + goto Exit; + } + memset(p_sw->lft_buf, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); + p_sw->p_prof = malloc(sizeof(*p_sw->p_prof) * num_ports); if (p_sw->p_prof == NULL) { status = IB_INSUFFICIENT_MEMORY; @@ -132,6 +139,8 @@ void osm_switch_delete(IN OUT osm_switch_t ** const pp_sw) osm_mcast_tbl_destroy(&p_sw->mcast_tbl); free(p_sw->p_prof); osm_fwd_tbl_destroy(&p_sw->fwd_tbl); + if (p_sw->lft_buf) + free(p_sw->lft_buf); if (p_sw->hops) { for (i = 0; i < p_sw->num_hops; i++) if (p_sw->hops[i]) @@ -537,6 +546,7 @@ osm_switch_prepare_path_rebuild(IN osm_switch_t * p_sw, IN uint16_t max_lids) osm_port_prof_construct(&p_sw->p_prof[i]); osm_switch_clear_hops(p_sw); + memset(p_sw->lft_buf, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); if (!p_sw->hops) { hops = malloc((max_lids + 1) * sizeof(hops[0])); diff --git a/opensm/opensm/osm_ucast_file.c b/opensm/opensm/osm_ucast_file.c index cbd65c1..a6edf5d 100644 --- a/opensm/opensm/osm_ucast_file.c +++ b/opensm/opensm/osm_ucast_file.c @@ -1,5 +1,6 @@ /* * Copyright (c) 2006,2007 Voltaire, Inc. All rights reserved. + * Copyright (c) 2008 Mellanox Technologies LTD. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -91,7 +92,7 @@ static void add_path(osm_opensm_t * p_osm, new_lid); } - p_osm->sm.ucast_mgr.lft_buf[new_lid] = port_num; + p_sw->lft_buf[new_lid] = port_num; if (!(p_osm->subn.opt.port_profile_switch_nodes && port_guid && osm_get_switch_by_guid(&p_osm->subn, port_guid))) osm_switch_count_path(p_sw, port_num); @@ -195,7 +196,7 @@ static int do_ucast_file_load(void *context) cl_ntoh64(sw_guid)); continue; } - memset(p_osm->sm.ucast_mgr.lft_buf, OSM_NO_PATH, + memset(p_sw->lft_buf, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); } else if (p_sw && !strncmp(p, "0x", 2)) { p += 2; diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c index 15168b7..35a6a1c 100644 --- a/opensm/opensm/osm_ucast_ftree.c +++ b/opensm/opensm/osm_ucast_ftree.c @@ -1945,7 +1945,7 @@ static void __osm_ftree_set_sw_fwd_table(IN cl_map_item_t * const p_map_item, p_sw->p_osm_sw->max_lid_ho = p_ftree->lft_max_lid_ho; - memcpy(p_ftree->p_osm->sm.ucast_mgr.lft_buf, p_sw->lft_buf, lft_len); + memcpy(p_sw->p_osm_sw->lft_buf, p_sw->lft_buf, lft_len); osm_ucast_mgr_set_fwd_table(&p_ftree->p_osm->sm.ucast_mgr, p_sw->p_osm_sw); } diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index 929c288..c7dbade 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2007 Simula Research Laboratory. All rights reserved. * Copyright (c) 2007 Silicon Graphics Inc. All rights reserved. @@ -1064,7 +1064,7 @@ static void populate_fwd_tbls(lash_t * p_lash) current_guid = p_sw->p_node->node_info.port_guid; sw = p_sw->priv; - memset(p_osm->sm.ucast_mgr.lft_buf, 0xff, + memset(p_sw->lft_buf, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); for (lid = 1; lid <= max_lid_ho; lid++) { @@ -1076,7 +1076,7 @@ static void populate_fwd_tbls(lash_t * p_lash) if (p_dst_sw == p_sw) { uint8_t egress_port = port->p_node->sw ? 0 : port->p_physp->p_remote_physp->port_num; - p_osm->sm.ucast_mgr.lft_buf[lid] = egress_port; + p_sw->lft_buf[lid] = egress_port; OSM_LOG(p_log, OSM_LOG_VERBOSE, "LASH fwd MY SRC SRC GUID 0x%016" PRIx64 " src lash id (%d), src lid no (%u) src lash port (%d) " @@ -1096,7 +1096,7 @@ static void populate_fwd_tbls(lash_t * p_lash) virtual_physical_port_table [lash_egress_port]; - p_osm->sm.ucast_mgr.lft_buf[lid] = + p_sw->lft_buf[lid] = physical_egress_port; OSM_LOG(p_log, OSM_LOG_VERBOSE, "LASH fwd SRC GUID 0x%016" PRIx64 diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index be8e724..2dc5dd4 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -73,10 +73,6 @@ void osm_ucast_mgr_destroy(IN osm_ucast_mgr_t * const p_mgr) CL_ASSERT(p_mgr); OSM_LOG_ENTER(p_mgr->p_log); - - if (p_mgr->lft_buf) - free(p_mgr->lft_buf); - OSM_LOG_EXIT(p_mgr->p_log); } @@ -96,10 +92,6 @@ osm_ucast_mgr_init(IN osm_ucast_mgr_t * const p_mgr, IN osm_sm_t * sm) p_mgr->p_subn = sm->p_subn; p_mgr->p_lock = sm->p_lock; - p_mgr->lft_buf = malloc(IB_LID_UCAST_END_HO + 1); - if (!p_mgr->lft_buf) - return IB_INSUFFICIENT_MEMORY; - OSM_LOG_EXIT(p_mgr->p_log); return (status); } @@ -297,7 +289,7 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr, We have selected the port for this LID. Write it to the forwarding tables. */ - p_mgr->lft_buf[lid_ho] = port; + p_sw->lft_buf[lid_ho] = port; if (!is_ignored_by_port_prof) { struct osm_remote_node *rem_node_used; osm_switch_count_path(p_sw, port); @@ -397,14 +389,14 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr, osm_switch_get_fwd_tbl_block(p_sw, block_id_ho, block); block_id_ho++) { if (!p_sw->need_update && - !memcmp(block, p_mgr->lft_buf + block_id_ho * 64, 64)) + !memcmp(block, p_sw->lft_buf + block_id_ho * 64, 64)) continue; OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, "Writing FT block %u\n", block_id_ho); status = osm_req_set(p_mgr->sm, p_path, - p_mgr->lft_buf + block_id_ho * 64, + p_sw->lft_buf + block_id_ho * 64, sizeof(block), IB_MAD_ATTR_LIN_FWD_TBL, cl_hton32(block_id_ho), @@ -481,7 +473,7 @@ __osm_ucast_mgr_process_tbl(IN cl_map_item_t * const p_map_item, cl_ntoh64(osm_node_get_node_guid(p_sw->p_node))); /* Initialize LIDs in buffer to invalid port number. */ - memset(p_mgr->lft_buf, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); + memset(p_sw->lft_buf, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); if (p_mgr->p_subn->opt.lmc) alloc_ports_priv(p_mgr); -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Wed Oct 15 14:35:09 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 15 Oct 2008 23:35:09 +0200 Subject: [ofa-general] [PATCH 2/6 v2] opensm/Unicast Routing Cache: add -A / --ucast_cache option Message-ID: <48F6620D.2090809@dev.mellanox.co.il> Add "-A" or "--ucast_cache" option to opensm. Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_subnet.h | 6 +++++- opensm/opensm/main.c | 19 +++++++++++++++++-- opensm/opensm/osm_subnet.c | 11 ++++++++++- 3 files changed, 32 insertions(+), 4 deletions(-) diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index 0c7f3b9..1ee6362 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. * @@ -183,6 +183,7 @@ typedef struct osm_subn_opt { boolean_t port_profile_switch_nodes; boolean_t sweep_on_trap; char *routing_engine_names; + boolean_t use_ucast_cache; boolean_t connect_roots; char *lid_matrix_dump_file; char *lfts_file; @@ -361,6 +362,9 @@ typedef struct osm_subn_opt { * up/down routing engine (even if this violates "pure" deadlock * free up/down algorithm) * +* use_ucast_cache +* When TRUE enables unicast routing cache. +* * lid_matrix_dump_file * Name of the lid matrix dump file from where switch * lid matrices (min hops tables) will be loaded diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index 81b0a01..89c98fa 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -180,6 +180,15 @@ static void show_usage(void) " and in this way be IBA compliant. In many cases,\n" " this can violate \"pure\" deadlock free algorithm, so\n" " use it carefully.\n\n"); + printf("--ucast_cache, -A\n" + " This option enables unicast routing cache to prevent\n" + " routing recalculation (which is a heavy task in a\n" + " large cluster) when there was no topology change\n" + " detected during the heavy sweep, or when the topology\n" + " change does not require new routing calculation,\n" + " e.g. in case of host reboot.\n" + " This option becomes very handy when the cluster size\n" + " is thousands of nodes.\n\n"); printf("--lid_matrix_file, -M \n" " This option specifies the name of the lid matrix dump file\n" " from where switch lid matrices (min hops tables will be\n" @@ -516,7 +525,7 @@ int main(int argc, char *argv[]) uint32_t val; unsigned config_file_done = 0; const char *const short_option = - "F:c:i:f:ed:D:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:NBIQvVhoryxp:n:q:k:C:"; + "F:c:i:f:ed:D:g:l:L:s:t:a:u:m:X:R:zM:U:S:P:Y:ANBIQvVhoryxp:n:q:k:C:"; /* In the array below, the 2nd parameter specifies the number @@ -553,6 +562,7 @@ int main(int argc, char *argv[]) {"priority", 1, NULL, 'p'}, {"smkey", 1, NULL, 'k'}, {"routing_engine", 1, NULL, 'R'}, + {"ucast_cache", 0, NULL, 'A'}, {"connect_roots", 0, NULL, 'z'}, {"lid_matrix_file", 1, NULL, 'M'}, {"lfts_file", 1, NULL, 'U'}, @@ -832,6 +842,11 @@ int main(int argc, char *argv[]) printf(" Connect roots option is on\n"); break; + case 'A': + opt.use_ucast_cache = TRUE; + printf(" Unicast routing cache option is on\n"); + break; + case 'M': opt.lid_matrix_dump_file = optarg; printf(" Lid matrix dump file is \'%s\'\n", optarg); diff --git a/opensm/opensm/osm_subnet.c b/opensm/opensm/osm_subnet.c index a39ce75..63c111c 100644 --- a/opensm/opensm/osm_subnet.c +++ b/opensm/opensm/osm_subnet.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. * @@ -442,6 +442,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * const p_opt) p_opt->port_prof_ignore_file = NULL; p_opt->port_profile_switch_nodes = FALSE; p_opt->sweep_on_trap = TRUE; + p_opt->use_ucast_cache = FALSE; p_opt->routing_engine_names = NULL; p_opt->connect_roots = FALSE; p_opt->lid_matrix_dump_file = NULL; @@ -1269,6 +1270,9 @@ int osm_subn_parse_conf_file(char *file_name, osm_subn_opt_t * const p_opts) opts_unpack_boolean("connect_roots", p_key, p_val, &p_opts->connect_roots); + opts_unpack_boolean("use_ucast_cache", + p_key, p_val, &p_opts->use_ucast_cache); + opts_unpack_charp("log_file", p_key, p_val, &p_opts->log_file); opts_unpack_uint32("log_max_size", @@ -1534,6 +1538,11 @@ int osm_subn_write_conf_file(char *file_name, IN osm_subn_opt_t *const p_opts) p_opts->connect_roots ? "TRUE" : "FALSE"); fprintf(opts_file, + "# Use unicast routing cache (use FALSE if unsure)\n" + "use_ucast_cache %s\n\n", + p_opts->use_ucast_cache ? "TRUE" : "FALSE"); + + fprintf(opts_file, "# Lid matrix dump file name\n" "lid_matrix_dump_file %s\n\n", p_opts->lid_matrix_dump_file ? p_opts->lid_matrix_dump_file : null_str); -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Wed Oct 15 15:00:34 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 16 Oct 2008 00:00:34 +0200 Subject: [ofa-general] [PATCH 3/6 v2] opensm/Unicast Routing Cache: add osm_ucast_cache.{c, h} files Message-ID: <48F66802.8080500@dev.mellanox.co.il> Implementation of the Unicast Routing Cache. Unicast Manager stores all the links and ports that went down in order to avoid routing recalculation whenever is possible. There is a map of the switches that went down or have one or more ports that went down. For ports the only thing that is cached is connectivity (remote side lid and type). If the whole switch went down/disconnected, ucast cache also stores all its unicast routing information (lft, max_lid, lid matrices). This information will be restored to the switch when it will be back online. When new link/node is spotted during the discovery process, ucast cache checks whether it was previously cached or is it a brand new link/node and decides whether or not new routing calculation needed. Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_ucast_cache.h | 249 +++++++ opensm/opensm/osm_ucast_cache.c | 1131 +++++++++++++++++++++++++++++++ 2 files changed, 1380 insertions(+), 0 deletions(-) create mode 100644 opensm/include/opensm/osm_ucast_cache.h create mode 100644 opensm/opensm/osm_ucast_cache.c diff --git a/opensm/include/opensm/osm_ucast_cache.h b/opensm/include/opensm/osm_ucast_cache.h new file mode 100644 index 0000000..ce77b89 --- /dev/null +++ b/opensm/include/opensm/osm_ucast_cache.h @@ -0,0 +1,249 @@ +/* + * Copyright (c) 2008 Mellanox Technologies LTD. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +/* + * Abstract: + * Header file that describes Unicast Cache functions. + * + * Environment: + * Linux User Mode + * + * $Revision: 1.4 $ + */ + +#ifndef _OSM_UCAST_CACHE_H_ +#define _OSM_UCAST_CACHE_H_ + +#include +#include +#include + +#ifdef __cplusplus +# define BEGIN_C_DECLS extern "C" { +# define END_C_DECLS } +#else /* !__cplusplus */ +# define BEGIN_C_DECLS +# define END_C_DECLS +#endif /* __cplusplus */ + +BEGIN_C_DECLS + +struct osm_ucast_mgr; + +/****h* OpenSM/Unicast Manager/Unicast Cache +* NAME +* Unicast Cache +* +* DESCRIPTION +* The Unicast Cache object encapsulates the information +* needed to cache and write unicast routing of the subnet. +* +* The Unicast Cache object is NOT thread safe. +* +* This object should be treated as opaque and should be +* manipulated only through the provided functions. +* +* AUTHOR +* Yevgeny Kliteynik, Mellanox +* +*********/ + +/****f* OpenSM: Unicast Cache/osm_ucast_cache_invalidate +* NAME +* osm_ucast_cache_invalidate +* +* DESCRIPTION +* The osm_ucast_cache_invalidate function purges the +* unicast cache and marks the cache as invalid. +* +* SYNOPSIS +*/ +void +osm_ucast_cache_invalidate(struct osm_ucast_mgr * p_mgr); +/* +* PARAMETERS +* p_mgr +* [in] Pointer to the ucast mgr object. +* +* RETURN VALUE +* This function does not return any value. +* +* NOTES +* +* SEE ALSO +* Unicast Manager object +*********/ + +/****f* OpenSM: Unicast Cache/osm_ucast_cache_check_new_link +* NAME +* osm_ucast_cache_check_new_link +* +* DESCRIPTION +* The osm_ucast_cache_check_new_link checks whether +* the newly discovered link still allows us to use +* cached unicast routing. +* +* SYNOPSIS +*/ +void +osm_ucast_cache_check_new_link(struct osm_ucast_mgr * p_mgr, + osm_node_t * p_node_1, + uint8_t port_num_1, + osm_node_t * p_node_2, + uint8_t port_num_2); +/* +* PARAMETERS +* p_mgr +* [in] Pointer to the unicast manager object. +* +* physp1 +* [in] Pointer to the first physical port of the link. +* +* physp2 +* [in] Pointer to the second physical port of the link. +* +* RETURN VALUE +* This function does not return any value. +* +* NOTES +* The function checks whether the link was previously +* cached/dropped or is this a completely new link. +* If it decides that the new link makes cached routing +* invalid, the cache is purged and marked as invalid. +* +* SEE ALSO +* Unicast Cache object +*********/ + +/****f* OpenSM: Unicast Cache/osm_ucast_cache_add_link +* NAME +* osm_ucast_cache_add_link +* +* DESCRIPTION +* The osm_ucast_cache_add_link adds link to the cache. +* +* SYNOPSIS +*/ +void +osm_ucast_cache_add_link(struct osm_ucast_mgr * p_mgr, + osm_physp_t * physp1, + osm_physp_t * physp2); +/* +* PARAMETERS +* p_mgr +* [in] Pointer to the unicast manager object. +* +* physp1 +* [in] Pointer to the first physical port of the link. +* +* physp2 +* [in] Pointer to the second physical port of the link. +* +* RETURN VALUE +* This function does not return any value. +* +* NOTES +* Since the cache operates with ports and not links, +* the function adds two port entries (both sides of the +* link) to the cache. +* If it decides that the dropped link makes cached routing +* invalid, the cache is purged and marked as invalid. +* +* SEE ALSO +* Unicast Manager object +*********/ + +/****f* OpenSM: Unicast Cache/osm_ucast_cache_add_node +* NAME +* osm_ucast_cache_add_node +* +* DESCRIPTION +* The osm_ucast_cache_add_node adds node and all +* its links to the cache. +* +* SYNOPSIS +*/ +void +osm_ucast_cache_add_node(struct osm_ucast_mgr * p_mgr, + osm_node_t * p_node); +/* +* PARAMETERS +* p_mgr +* [in] Pointer to the unicast manager object. +* +* p_node +* [in] Pointer to the node object that should be cached. +* +* RETURN VALUE +* This function does not return any value. +* +* NOTES +* If the function decides that the dropped node makes cached +* routing invalid, the cache is purged and marked as invalid. +* +* SEE ALSO +* Unicast Manager object +*********/ + +/****f* OpenSM: Unicast Cache/osm_ucast_cache_process +* NAME +* osm_ucast_cache_process +* +* DESCRIPTION +* The osm_ucast_cache_process function writes the +* cached unicast routing on the subnet switches. +* +* SYNOPSIS +*/ +int +osm_ucast_cache_process(struct osm_ucast_mgr * p_mgr); +/* +* PARAMETERS +* p_mgr +* [in] Pointer to the unicast manager object. +* +* RETURN VALUE +* This function returns zero on sucess and non-zero +* value otherwise. +* +* NOTES +* Iterates through all the subnet switches and writes +* the LFTs that were calculated during the last routing +* engine execution to the switches. +* +* SEE ALSO +* Unicast Manager object +*********/ + +END_C_DECLS +#endif /* _OSM_UCAST_CACHE_H_ */ diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c new file mode 100644 index 0000000..a54dbf9 --- /dev/null +++ b/opensm/opensm/osm_ucast_cache.c @@ -0,0 +1,1131 @@ +/* + * Copyright (c) 2008 Mellanox Technologies LTD. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +/* + * Abstract: + * Implementation of OpenSM Cached Unicast Routing + * + * Environment: + * Linux User Mode + * + */ + +#if HAVE_CONFIG_H +# include +#endif + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define CACHE_SW_PORTS 36 + +typedef struct cache_port { + boolean_t is_leaf; + uint16_t remote_lid_ho; +} cache_port_t; + +typedef struct cache_switch { + cl_map_item_t map_item; + boolean_t dropped; + uint16_t max_lid_ho; + uint8_t num_ports; + cache_port_t * ports; + uint16_t num_hops; + uint8_t ** hops; + uint8_t * lft; +} cache_switch_t; + +/********************************************************************** + **********************************************************************/ + +static uint16_t +__cache_sw_get_base_lid_ho(cache_switch_t * p_sw) +{ + return p_sw->ports[0].remote_lid_ho; +} + +/********************************************************************** + **********************************************************************/ + +static boolean_t +__cache_sw_is_leaf(cache_switch_t * p_sw) +{ + return p_sw->ports[0].is_leaf; +} + +/********************************************************************** + **********************************************************************/ + +static void +__cache_sw_set_leaf(cache_switch_t * p_sw) +{ + p_sw->ports[0].is_leaf = TRUE; +} + +/********************************************************************** + **********************************************************************/ + +static cache_switch_t * +__cache_sw_new(uint16_t lid_ho) +{ + cache_switch_t * p_cache_sw = malloc(sizeof(cache_switch_t)); + if (!p_cache_sw) + return NULL; + + memset(p_cache_sw, 0, sizeof(*p_cache_sw)); + + p_cache_sw->ports = + malloc(sizeof(cache_port_t) * (CACHE_SW_PORTS + 1)); + if (!p_cache_sw->ports) { + free(p_cache_sw); + return NULL; + } + + memset(p_cache_sw->ports, 0, sizeof(*p_cache_sw->ports)); + p_cache_sw->num_ports = CACHE_SW_PORTS + 1; + + /* port[0] fields represent this switch details - lid and type */ + p_cache_sw->ports[0].remote_lid_ho = lid_ho; + p_cache_sw->ports[0].is_leaf = FALSE; + + return p_cache_sw; +} + +/********************************************************************** + **********************************************************************/ + +static void +__cache_sw_destroy(cache_switch_t * p_sw) +{ + if (!p_sw) + return; + + if (p_sw->lft) + free(p_sw->lft); + if (p_sw->hops) + free(p_sw->hops); + if (p_sw->ports) + free(p_sw->ports); + free(p_sw); +} + +/********************************************************************** + **********************************************************************/ + +static cache_switch_t * +__cache_get_sw(osm_ucast_mgr_t * p_mgr, uint16_t lid_ho) +{ + cache_switch_t * p_cache_sw = (cache_switch_t *) + cl_qmap_get(&p_mgr->cache_sw_tbl, lid_ho); + if (p_cache_sw == (cache_switch_t *) + cl_qmap_end(&p_mgr->cache_sw_tbl)) + p_cache_sw = NULL; + + return p_cache_sw; +} + +/********************************************************************** + **********************************************************************/ + +static cache_switch_t * +__cache_get_or_add_sw(osm_ucast_mgr_t * p_mgr, uint16_t lid_ho) +{ + cache_switch_t * p_cache_sw = __cache_get_sw(p_mgr, lid_ho); + if (!p_cache_sw) { + p_cache_sw = __cache_sw_new(lid_ho); + if (p_cache_sw) + cl_qmap_insert(&p_mgr->cache_sw_tbl, lid_ho, + &p_cache_sw->map_item); + } + return p_cache_sw; +} + +/********************************************************************** + **********************************************************************/ + +static void +__cache_add_port(osm_ucast_mgr_t * p_mgr, + uint16_t lid_ho, + uint8_t port_num, + uint16_t remote_lid_ho, + boolean_t is_ca) +{ + cache_switch_t * p_cache_sw; + + OSM_LOG_ENTER(p_mgr->p_log); + + if (!lid_ho || !remote_lid_ho || !port_num) + goto Exit; + + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, + "Caching switch port: lid %u [port %u] -> lid %u (%s)\n", + lid_ho, port_num, remote_lid_ho, + (is_ca)? "CA/RTR" : "SW"); + + p_cache_sw = __cache_get_or_add_sw(p_mgr, lid_ho); + if (!p_cache_sw) { + OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, + "ERR AD01: Out of memory - cache is invalid\n"); + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + + if (port_num >= p_cache_sw->num_ports) { + /* calculate new size of ports array, rounded + up to a multiple of CACHE_SW_PORTS */ + uint8_t new_size = CACHE_SW_PORTS * + ((port_num + CACHE_SW_PORTS) / CACHE_SW_PORTS); + cache_port_t * ports = + malloc(sizeof(cache_port_t)*(new_size+1)); + if (!ports) { + OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, + "ERR AD02: Out of memory - cache is invalid\n"); + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + + memset(ports, 0, sizeof(*ports)); + + if (p_cache_sw->ports) { + memcpy(ports, p_cache_sw->ports, + sizeof(*p_cache_sw->ports)); + free(p_cache_sw->ports); + } + + p_cache_sw->ports = ports; + p_cache_sw->num_ports = new_size + 1; + } + + if (is_ca) + __cache_sw_set_leaf(p_cache_sw); + + if (p_cache_sw->ports[port_num].remote_lid_ho == 0) { + /* cache this link only if it hasn't been already cached */ + p_cache_sw->ports[port_num].remote_lid_ho = remote_lid_ho; + p_cache_sw->ports[port_num].is_leaf = is_ca; + } +Exit: + OSM_LOG_EXIT(p_mgr->p_log); +} + +/********************************************************************** + **********************************************************************/ + +static void +__cache_cleanup_switches(osm_ucast_mgr_t * p_mgr) +{ + cache_switch_t * p_sw; + cache_switch_t * p_next_sw; + unsigned port_num; + boolean_t found_port; + + if (!p_mgr->cache_valid) + return; + + p_next_sw = (cache_switch_t *) cl_qmap_head(&p_mgr->cache_sw_tbl); + while (p_next_sw != (cache_switch_t *) cl_qmap_end(&p_mgr->cache_sw_tbl)) { + p_sw = p_next_sw; + p_next_sw = (cache_switch_t *) cl_qmap_next(&p_sw->map_item); + + found_port = FALSE; + for (port_num = 1; port_num < p_sw->num_ports; port_num++) + if (p_sw->ports[port_num].remote_lid_ho) + found_port = TRUE; + + if (!found_port) { + cl_qmap_remove_item(&p_mgr->cache_sw_tbl, &p_sw->map_item); + __cache_sw_destroy(p_sw); + } + } +} + +/********************************************************************** + **********************************************************************/ + +static void +__cache_check_link_change(osm_ucast_mgr_t * p_mgr, + osm_physp_t * p_physp_1, + osm_physp_t * p_physp_2) +{ + OSM_LOG_ENTER(p_mgr->p_log); + CL_ASSERT(p_physp_1 && p_physp_2); + + if (!p_mgr->cache_valid) + goto Exit; + + if (!p_physp_1->p_remote_physp && !p_physp_2->p_remote_physp) + /* both ports were down - new link */ + goto Exit; + + /* unicast cache cannot tolerate any link location change */ + + if ((p_physp_1->p_remote_physp && + p_physp_1->p_remote_physp->p_remote_physp) || + (p_physp_2->p_remote_physp && + p_physp_2->p_remote_physp->p_remote_physp)) { + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "Link location change discovered - cache is invalid\n"); + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } +Exit: + OSM_LOG_EXIT(p_mgr->p_log); +} + +/********************************************************************** + **********************************************************************/ + +static void +__cache_remove_port(osm_ucast_mgr_t * p_mgr, + uint16_t lid_ho, + uint8_t port_num, + uint16_t remote_lid_ho, + boolean_t is_ca) +{ + cache_switch_t * p_cache_sw; + + OSM_LOG_ENTER(p_mgr->p_log); + + if (!p_mgr->cache_valid) + goto Exit; + + p_cache_sw = __cache_get_sw(p_mgr, lid_ho); + if (!p_cache_sw) { + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "Found uncached switch/link (lid %u, port %u) - " + "cache is invalid\n", lid_ho, port_num); + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + + if (port_num >= p_cache_sw->num_ports || + !p_cache_sw->ports[port_num].remote_lid_ho) { + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "Found uncached switch link (lid %u, port %u) - " + "cache is invalid\n", lid_ho, port_num); + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + + if (p_cache_sw->ports[port_num].remote_lid_ho != remote_lid_ho) { + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "Remote lid change on switch lid %u, port %u " + "(was %u, now %u) - cache is invalid\n", + lid_ho, port_num, + p_cache_sw->ports[port_num].remote_lid_ho, + remote_lid_ho); + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + + if ((p_cache_sw->ports[port_num].is_leaf && !is_ca) || + (!p_cache_sw->ports[port_num].is_leaf && is_ca)) { + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "Remote node type change on switch lid %u, port %u - " + "cache is invalid\n", + lid_ho, port_num); + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, + "New link from lid %u, port %u to lid %u - " + "found in cache\n", + lid_ho, port_num, remote_lid_ho); + + /* the new link was cached - clean it from the cache */ + + p_cache_sw->ports[port_num].remote_lid_ho = 0; + p_cache_sw->ports[port_num].is_leaf = FALSE; +Exit: + OSM_LOG_EXIT(p_mgr->p_log); +} /* __cache_remove_port() */ + +/********************************************************************** + **********************************************************************/ + +static void +__cache_restore_ucast_info(osm_ucast_mgr_t * p_mgr, + cache_switch_t * p_cache_sw, + osm_switch_t * p_sw) +{ + if (!p_mgr->cache_valid) + return; + + /* when seting unicast info, the cached port + should have all the required info */ + CL_ASSERT(p_cache_sw->max_lid_ho && p_cache_sw->lft && + p_cache_sw->num_hops && p_cache_sw->hops); + + p_sw->max_lid_ho = p_cache_sw->max_lid_ho; + + if (p_sw->lft_buf) + free(p_sw->lft_buf); + p_sw->lft_buf = p_cache_sw->lft; + p_cache_sw->lft = NULL; + + p_sw->num_hops = p_cache_sw->num_hops; + p_cache_sw->num_hops = 0; + if (p_sw->hops) + free(p_sw->hops); + p_sw->hops = p_cache_sw->hops; + p_cache_sw->hops = NULL; +} + +/********************************************************************** + **********************************************************************/ + +static void +__ucast_cache_dump(osm_ucast_mgr_t * p_mgr) +{ + cache_switch_t * p_sw; + unsigned i; + + OSM_LOG_ENTER(p_mgr->p_log); + + if (!osm_log_is_active(p_mgr->p_log, OSM_LOG_DEBUG)) + goto Exit; + + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, + "Dumping missing nodes/links as logged by unicast cache:\n"); + for (p_sw = (cache_switch_t *) cl_qmap_head(&p_mgr->cache_sw_tbl); + p_sw != (cache_switch_t *) cl_qmap_end(&p_mgr->cache_sw_tbl); + p_sw = (cache_switch_t *) cl_qmap_next(&p_sw->map_item)) { + + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, + "\t Switch lid %u %s%s\n", + __cache_sw_get_base_lid_ho(p_sw), + (__cache_sw_is_leaf(p_sw))? "[leaf switch] " : "", + (p_sw->dropped)? "[whole switch missing]" : ""); + + for (i = 1; i < p_sw->num_ports; i++) + if (p_sw->ports[i].remote_lid_ho > 0) + OSM_LOG(p_mgr->p_log, + OSM_LOG_DEBUG, + "\t - port %u -> lid %u %s\n", + i, p_sw->ports[i].remote_lid_ho, + (p_sw->ports[i].is_leaf) ? + "[remote node is leaf]" : ""); + } +Exit: + OSM_LOG_EXIT(p_mgr->p_log); +} + +/********************************************************************** + **********************************************************************/ + +void +osm_ucast_cache_invalidate(osm_ucast_mgr_t * p_mgr) +{ + cache_switch_t * p_sw; + cache_switch_t * p_next_sw; + + OSM_LOG_ENTER(p_mgr->p_log); + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, + "Invalidating unicast cache\n"); + + if (!p_mgr->cache_valid) + goto Exit; + + p_mgr->cache_valid = FALSE; + + p_next_sw = (cache_switch_t *) cl_qmap_head(&p_mgr->cache_sw_tbl); + while (p_next_sw != (cache_switch_t *) cl_qmap_end(&p_mgr->cache_sw_tbl)) { + p_sw = p_next_sw; + p_next_sw = (cache_switch_t *) cl_qmap_next(&p_sw->map_item); + __cache_sw_destroy(p_sw); + } + cl_qmap_remove_all(&p_mgr->cache_sw_tbl); +Exit: + OSM_LOG_EXIT(p_mgr->p_log); +} + +/********************************************************************** + **********************************************************************/ + +static void +ucast_cache_validate(osm_ucast_mgr_t * p_mgr) +{ + cache_switch_t * p_cache_sw; + cache_switch_t * p_remote_cache_sw; + unsigned port_num; + unsigned max_ports; + uint8_t remote_node_type; + uint16_t lid_ho; + uint16_t remote_lid_ho; + osm_switch_t * p_sw; + osm_switch_t * p_remote_sw; + osm_node_t * p_node; + osm_physp_t * p_physp; + osm_physp_t * p_remote_physp; + osm_port_t * p_remote_port; + cl_qmap_t * p_sw_tbl; + + OSM_LOG_ENTER(p_mgr->p_log); + if (!p_mgr->cache_valid) + goto Exit; + + /* If there are no switches in the subnet, we are done */ + p_sw_tbl = &p_mgr->p_subn->sw_guid_tbl; + if (cl_qmap_count(p_sw_tbl) == 0) { + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + + /* + * Scan all the physical switch ports in the subnet. + * If the port need_update flag is on, check whether + * it's just some node/port reset or a cached topology + * change. Otherwise the cache is invalid. + */ + for (p_sw = (osm_switch_t *) cl_qmap_head(p_sw_tbl); + p_sw != (osm_switch_t *) cl_qmap_end(p_sw_tbl); + p_sw = (osm_switch_t *) cl_qmap_next(&p_sw->map_item)) { + + p_node = p_sw->p_node; + + lid_ho = cl_ntoh16(osm_node_get_base_lid(p_node,0)); + p_cache_sw = __cache_get_sw(p_mgr, lid_ho); + + max_ports = osm_node_get_num_physp(p_node); + + /* skip port 0 */ + for (port_num = 1; port_num < max_ports; port_num++) { + + p_physp = osm_node_get_physp_ptr(p_node, port_num); + + if (!p_physp || !p_physp->p_remote_physp || + !osm_physp_link_exists(p_physp, p_physp->p_remote_physp)) + /* no valid link */ + continue; + + /* + * While scanning all the physical ports in the subnet, + * mark corresponding leaf switches in the cache. + */ + if (p_cache_sw && + !p_cache_sw->dropped && + !__cache_sw_is_leaf(p_cache_sw) && + p_physp->p_remote_physp->p_node && + osm_node_get_type( + p_physp->p_remote_physp->p_node) != + IB_NODE_TYPE_SWITCH) + __cache_sw_set_leaf(p_cache_sw); + + if (!p_physp->need_update) + continue; + + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, + "Checking switch lid %u, port %u\n", + lid_ho, port_num); + + p_remote_physp = osm_physp_get_remote(p_physp); + remote_node_type = osm_node_get_type(p_remote_physp->p_node); + + if (remote_node_type == IB_NODE_TYPE_SWITCH) + remote_lid_ho = cl_ntoh16(osm_node_get_base_lid( + p_remote_physp->p_node, 0)); + else + remote_lid_ho = cl_ntoh16(osm_node_get_base_lid( + p_remote_physp->p_node, + osm_physp_get_port_num(p_remote_physp))); + + if (!p_cache_sw || + port_num >= p_cache_sw->num_ports || + !p_cache_sw->ports[port_num].remote_lid_ho) { + /* + * There is some uncached change on the port. + * In general, the reasons might be as follows: + * - switch reset + * - port reset (or port down/up) + * - quick connection location change + * - new link (or new switch) + * + * First two reasons allow cache usage, while + * the last two reasons should invalidate cache. + * + * In case of quick connection location change, + * cache would have been invalidated by + * osm_ucast_cache_check_new_link() function. + * + * In case of new link between two known nodes, + * cache also would have been invalidated by + * osm_ucast_cache_check_new_link() function. + * + * Another reason is cached link between two + * known switches went back. In this case the + * osm_ucast_cache_check_new_link() function would + * clear both sides of the link from the cache + * during the discovery process, so effectively + * this would be equivalent to port reset. + * + * So three possible reasons remain: + * - switch reset + * - port reset (or port down/up) + * - link of a new switch + * + * To validate cache, we need to check only the + * third reason - link of a new node/switch: + * - If this is the local switch that is new, + * then it should have (p_sw->need_update == 2). + * - If the remote node is switch and it's new, + * then it also should have + * (p_sw->need_update == 2). + * - If the remote node is CA/RTR and it's new, + * then its port should have is_new flag on. + */ + if (p_sw->need_update == 2) { + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "New switch found (lid %u) - " + "cache is invalid\n", + lid_ho); + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + + if (remote_node_type == IB_NODE_TYPE_SWITCH) { + + p_remote_sw = p_remote_physp->p_node->sw; + if (p_remote_sw->need_update == 2) { + /* this could also be case of + switch coming back with an + additional link that it + didn't have before */ + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "New switch/link found (lid %u) - " + "cache is invalid\n", + remote_lid_ho); + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + } + else { + /* + * Remote node is CA/RTR. + * Get p_port of the remote node and + * check its p_port->is_new flag. + */ + p_remote_port = osm_get_port_by_guid( + p_mgr->p_subn, + osm_physp_get_port_guid(p_remote_physp)); + if (p_remote_port->is_new) { + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "New CA/RTR found (lid %u) - " + "cache is invalid\n", + remote_lid_ho); + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + } + } + else { + /* + * The change on the port is cached. + * In general, the reasons might be as follows: + * - link between two known nodes went back + * - one or more nodes went back, causing all + * the links to reappear + * + * If it was link that went back, then this case + * would have been taken care of during the + * discovery by osm_ucast_cache_check_new_link(), + * so it's some node that went back. + */ + if ((p_cache_sw->ports[port_num].is_leaf && + remote_node_type == IB_NODE_TYPE_SWITCH) || + (!p_cache_sw->ports[port_num].is_leaf && + remote_node_type != IB_NODE_TYPE_SWITCH)) { + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "Remote node type change on switch lid %u, port %u - " + "cache is invalid\n", + lid_ho, port_num); + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + + if (p_cache_sw->ports[port_num].remote_lid_ho != + remote_lid_ho) { + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "Remote lid change on switch lid %u, port %u" + "(was %u, now %u) - cache is invalid\n", + lid_ho, port_num, + p_cache_sw->ports[port_num].remote_lid_ho, + remote_lid_ho); + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + + /* + * We don't care who is the node that has + * reappeared in the subnet (local or remote). + * What's important that the cached link matches + * the real fabrics link. + * Just clean it from cache. + */ + + p_cache_sw->ports[port_num].remote_lid_ho = 0; + p_cache_sw->ports[port_num].is_leaf = FALSE; + if (p_cache_sw->dropped) { + __cache_restore_ucast_info( + p_mgr, p_cache_sw, p_sw); + p_cache_sw->dropped = FALSE; + } + + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, + "Restored link from cache: lid %u, port %u to lid %u\n", + lid_ho, port_num, remote_lid_ho); + } + } + } + + /* Remove all the cached switches that + have all their ports restored */ + __cache_cleanup_switches(p_mgr); + + /* + * Done scanning all the physical switch ports in the subnet. + * Now we need to check the other side: + * Scan all the cached switches and their ports: + * - If the cached switch is missing in the subnet + * (dropped flag is on), check that it's a leaf switch. + * If it's not a leaf, the cache is invalid, because + * cache can tolerate only leaf switch removal. + * - If the cached switch exists in fabric, check all + * its cached ports. These cached ports represent + * missing link in the fabric. + * The missing links that can be tolerated are: + * + link to missing CA/RTR + * + link to missing leaf switch + */ + for (p_cache_sw = (cache_switch_t *) cl_qmap_head(&p_mgr->cache_sw_tbl); + p_cache_sw != (cache_switch_t *) cl_qmap_end(&p_mgr->cache_sw_tbl); + p_cache_sw = (cache_switch_t *) cl_qmap_next(&p_cache_sw->map_item)) { + + if (p_cache_sw->dropped) { + if (!__cache_sw_is_leaf(p_cache_sw)){ + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "Missing non-leaf switch (lid %u) - " + "cache is invalid\n", + __cache_sw_get_base_lid_ho(p_cache_sw)); + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, + "Missing leaf switch (lid %u) - " + "continuing validation\n", + __cache_sw_get_base_lid_ho(p_cache_sw)); + continue; + } + + for (port_num = 1; port_num < p_cache_sw->num_ports; port_num++) { + if (!p_cache_sw->ports[port_num].remote_lid_ho) + continue; + + if (p_cache_sw->ports[port_num].is_leaf){ + CL_ASSERT(__cache_sw_is_leaf(p_cache_sw)); + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, + "Switch lid %u, port %u: missing link to CA/RTR - " + "continuing validation\n", + __cache_sw_get_base_lid_ho(p_cache_sw), port_num); + continue; + } + + p_remote_cache_sw = __cache_get_sw(p_mgr, + p_cache_sw->ports[port_num].remote_lid_ho); + + if (!p_remote_cache_sw || !p_remote_cache_sw->dropped) { + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "Switch lid %u, port %u: missing link to existing switch - " + "cache is invalid\n", + __cache_sw_get_base_lid_ho(p_cache_sw), port_num); + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + + if (!__cache_sw_is_leaf(p_remote_cache_sw)) { + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "Switch lid %u, port %u: missing link to non-leaf switch - " + "cache is invalid\n", + __cache_sw_get_base_lid_ho(p_cache_sw), port_num); + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + + /* + * At this point we know that the missing link is to + * a leaf switch. However, one case deserves a special + * treatment. If there was a link between two leaf + * switches, then missing leaf switch might break + * routing. It is possible that there are routes + * that use leaf switches to get from switch to switch + * and not just to get to the CAs behind the leaf switch. + */ + if (__cache_sw_is_leaf(p_cache_sw) && + __cache_sw_is_leaf(p_remote_cache_sw)) { + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "Switch lid %u, port %u: missing leaf-2-leaf link - " + "cache is invalid\n", + __cache_sw_get_base_lid_ho(p_cache_sw), port_num); + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, + "Switch lid %u, port %u: missing remote leaf switch - " + "continuing validation\n", + __cache_sw_get_base_lid_ho(p_cache_sw), port_num); + } + } + + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, + "Unicast cache is valid\n"); + __ucast_cache_dump(p_mgr); +Exit: + OSM_LOG_EXIT(p_mgr->p_log); +} /* osm_ucast_cache_validate() */ + +/********************************************************************** + **********************************************************************/ + +void +osm_ucast_cache_check_new_link(osm_ucast_mgr_t * p_mgr, + osm_node_t * p_node_1, + uint8_t port_num_1, + osm_node_t * p_node_2, + uint8_t port_num_2) +{ + uint16_t lid_ho_1; + uint16_t lid_ho_2; + + OSM_LOG_ENTER(p_mgr->p_log); + + if (!p_mgr->cache_valid) + goto Exit; + + __cache_check_link_change(p_mgr, + osm_node_get_physp_ptr(p_node_1, port_num_1), + osm_node_get_physp_ptr(p_node_2, port_num_2)); + + if (!p_mgr->cache_valid) + goto Exit; + + if (osm_node_get_type(p_node_1) != IB_NODE_TYPE_SWITCH && + osm_node_get_type(p_node_2) != IB_NODE_TYPE_SWITCH) { + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "Found CA/RTR-2-CA/RTR link - cache is invalid\n"); + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + + /* for code simplicity, we want the first node to be switch */ + if (osm_node_get_type(p_node_1) != IB_NODE_TYPE_SWITCH) { + osm_node_t * tmp_node = p_node_1; + uint8_t tmp_port_num = port_num_1; + p_node_1 = p_node_2; + port_num_1 = port_num_2; + p_node_2 = tmp_node; + port_num_2 = tmp_port_num; + } + + lid_ho_1 = cl_ntoh16(osm_node_get_base_lid(p_node_1, 0)); + + if (osm_node_get_type(p_node_2) == IB_NODE_TYPE_SWITCH) + lid_ho_2 = cl_ntoh16( + osm_node_get_base_lid(p_node_2, 0)); + else + lid_ho_2 = cl_ntoh16( + osm_node_get_base_lid(p_node_2, port_num_2)); + + if (!lid_ho_1 || !lid_ho_2) { + /* + * No lid assigned, which means that one of the nodes is new. + * Need to wait for lid manager to process this node. + * The switches and their links will be checked later when + * the whole cache validity will be verified. + */ + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, + "Link port %u <-> %u reveals new node - cache will " + "be validated later\n", + port_num_1, port_num_2); + goto Exit; + } + + __cache_remove_port(p_mgr, lid_ho_1, port_num_1, lid_ho_2, + (osm_node_get_type(p_node_2) != IB_NODE_TYPE_SWITCH)); + + /* if node_2 is a switch, the link should be cleaned from its cache */ + + if (osm_node_get_type(p_node_2) == IB_NODE_TYPE_SWITCH) + __cache_remove_port(p_mgr, lid_ho_2, + port_num_2, lid_ho_1, FALSE); + +Exit: + OSM_LOG_EXIT(p_mgr->p_log); +} /* osm_ucast_cache_check_new_link() */ + +/********************************************************************** + **********************************************************************/ + +void +osm_ucast_cache_add_link(osm_ucast_mgr_t * p_mgr, + osm_physp_t * p_physp1, + osm_physp_t * p_physp2) +{ + osm_node_t * p_node_1 = p_physp1->p_node, + * p_node_2 = p_physp2->p_node; + uint16_t lid_ho_1, lid_ho_2; + + OSM_LOG_ENTER(p_mgr->p_log); + + if (!p_mgr->cache_valid) + goto Exit; + + if (osm_node_get_type(p_node_1) != IB_NODE_TYPE_SWITCH && + osm_node_get_type(p_node_2) != IB_NODE_TYPE_SWITCH) { + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "Dropping CA-2-CA link - cache invalid\n"); + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + + if ((osm_node_get_type(p_node_1) == IB_NODE_TYPE_SWITCH && + !osm_node_get_physp_ptr(p_node_1, 0)) || + (osm_node_get_type(p_node_2) == IB_NODE_TYPE_SWITCH && + !osm_node_get_physp_ptr(p_node_2, 0))) { + /* we're caching a link when one of the nodes + has already been dropped and cached */ + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, + "Port %u <-> port %u: port0 on one of the nodes " + "has already been dropped and cached\n", + p_physp1->port_num, p_physp2->port_num); + goto Exit; + } + + /* One of the nodes is switch. Just for code + simplicity, make sure that it's the first node. */ + + if (osm_node_get_type(p_node_1) != IB_NODE_TYPE_SWITCH) { + osm_physp_t *tmp = p_physp1; + p_physp1 = p_physp2; + p_physp2 = tmp; + p_node_1 = p_physp1->p_node; + p_node_2 = p_physp2->p_node; + } + + if (!p_node_1->sw) { + /* something is wrong - we'd better not use cache */ + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + + lid_ho_1 = cl_ntoh16(osm_node_get_base_lid(p_node_1, 0)); + + if (osm_node_get_type(p_node_2) == IB_NODE_TYPE_SWITCH) { + + if (!p_node_2->sw) { + /* something is wrong - we'd better not use cache */ + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + + lid_ho_2 = cl_ntoh16(osm_node_get_base_lid(p_node_2, 0)); + + /* lost switch-2-switch link - cache both sides */ + __cache_add_port(p_mgr, lid_ho_1, p_physp1->port_num, + lid_ho_2, FALSE); + __cache_add_port(p_mgr, lid_ho_2, p_physp2->port_num, + lid_ho_1, FALSE); + } + else { + lid_ho_2 = cl_ntoh16(osm_physp_get_base_lid(p_physp2)); + + /* lost link to CA/RTR - cache only switch side */ + __cache_add_port(p_mgr, lid_ho_1, p_physp1->port_num, + lid_ho_2, TRUE); + } + +Exit: + OSM_LOG_EXIT(p_mgr->p_log); +} /* osm_ucast_cache_add_link() */ + +/********************************************************************** + **********************************************************************/ + +void +osm_ucast_cache_add_node(osm_ucast_mgr_t * p_mgr, + osm_node_t * p_node) +{ + uint16_t lid_ho; + uint8_t max_ports; + uint8_t port_num; + osm_physp_t * p_physp; + cache_switch_t * p_cache_sw; + + OSM_LOG_ENTER(p_mgr->p_log); + + if (!p_mgr->cache_valid) + goto Exit; + + if (osm_node_get_type(p_node) == IB_NODE_TYPE_SWITCH) { + + lid_ho = cl_ntoh16(osm_node_get_base_lid(p_node,0)); + + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, + "Caching dropped switch lid %u\n", lid_ho); + + if (!p_node->sw) { + /* something is wrong - forget about cache */ + OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, + "ERR AD03: no switch info for node lid %u - " + "clearing cache\n", lid_ho); + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + + /* unlink (add to cache) all the ports of this switch */ + max_ports = osm_node_get_num_physp(p_node); + for (port_num = 1; port_num < max_ports; port_num++) { + + p_physp = osm_node_get_physp_ptr(p_node, port_num); + if (!p_physp || !p_physp->p_remote_physp) + continue; + + osm_ucast_cache_add_link(p_mgr, p_physp, + p_physp->p_remote_physp); + } + + /* + * All the ports have been dropped (cached). + * If one of the ports was connected to CA/RTR, + * then the cached switch would be marked as leaf. + * If it isn't, then the dropped switch isn't a leaf, + * and cache can't handle it. + */ + + p_cache_sw = __cache_get_sw(p_mgr, lid_ho); + CL_ASSERT(p_cache_sw); + + if (!__cache_sw_is_leaf(p_cache_sw)) { + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "Dropped non-leaf switch (lid %u) - " + "cache is invalid\n", lid_ho); + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + + p_cache_sw->dropped = TRUE; + + if (!p_node->sw->num_hops || !p_node->sw->hops) { + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "No LID matrices for switch lid %u - " + "cache is invalid\n", lid_ho); + osm_ucast_cache_invalidate(p_mgr); + goto Exit; + } + + /* lid matrices */ + + p_cache_sw->num_hops = p_node->sw->num_hops; + p_node->sw->num_hops = 0; + p_cache_sw->hops = p_node->sw->hops; + p_node->sw->hops = NULL; + + /* linear forwarding table */ + + p_cache_sw->lft = p_node->sw->lft_buf; + p_node->sw->lft_buf = NULL; + p_cache_sw->max_lid_ho = p_node->sw->max_lid_ho; + } + else { + /* dropping CA/RTR: add to cache all the ports of this switch */ + max_ports = osm_node_get_num_physp(p_node); + for (port_num = 1; port_num < max_ports; port_num++) { + + p_physp = osm_node_get_physp_ptr(p_node, port_num); + if (!p_physp || !p_physp->p_remote_physp) + continue; + + CL_ASSERT(osm_node_get_type( + p_physp->p_remote_physp->p_node) == + IB_NODE_TYPE_SWITCH); + + osm_ucast_cache_add_link(p_mgr, + p_physp->p_remote_physp, + p_physp); + } + } +Exit: + OSM_LOG_EXIT(p_mgr->p_log); +} /* osm_ucast_cache_add_node() */ + +/********************************************************************** + **********************************************************************/ + +int +osm_ucast_cache_process(osm_ucast_mgr_t * p_mgr) +{ + cl_qmap_t *tbl = &p_mgr->p_subn->sw_guid_tbl; + cl_map_item_t *item; + + if (!p_mgr->p_subn->opt.use_ucast_cache) + return 1; + + ucast_cache_validate(p_mgr); + if (!p_mgr->cache_valid) + return 1; + + OSM_LOG(p_mgr->p_log, OSM_LOG_INFO, + "Configuring switch tables using cached routing\n"); + + for (item = cl_qmap_head(tbl); item != cl_qmap_end(tbl); + item = cl_qmap_next(item)) + osm_ucast_mgr_set_fwd_table(p_mgr, (osm_switch_t *)item); + + return 0; +} + +/********************************************************************** + **********************************************************************/ -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Wed Oct 15 15:02:00 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 16 Oct 2008 00:02:00 +0200 Subject: [ofa-general] [PATCH 4/6 v2] opensm/Unicast Routing Cache: compile cache files Message-ID: <48F66858.1040705@dev.mellanox.co.il> Adding cache files to makefile. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/Makefile.am | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/Makefile.am b/opensm/opensm/Makefile.am index 1b82a86..1d345a5 100644 --- a/opensm/opensm/Makefile.am +++ b/opensm/opensm/Makefile.am @@ -54,7 +54,7 @@ opensm_SOURCES = main.c osm_console_io.c osm_console.c osm_db_files.c \ osm_ucast_lash.c osm_ucast_file.c osm_ucast_ftree.c \ osm_vl15intf.c osm_vl_arb_rcv.c \ st.c osm_perfmgr.c osm_perfmgr_db.c \ - osm_event_plugin.c osm_dump.c \ + osm_event_plugin.c osm_dump.c osm_ucast_cache.c \ osm_qos_parser_y.y osm_qos_parser_l.l osm_qos_policy.c AM_YFLAGS:= -d @@ -113,6 +113,7 @@ opensminclude_HEADERS = \ $(srcdir)/../include/opensm/osm_subnet.h \ $(srcdir)/../include/opensm/osm_switch.h \ $(srcdir)/../include/opensm/osm_ucast_mgr.h \ + $(srcdir)/../include/opensm/osm_ucast_cache.h \ $(srcdir)/../include/opensm/osm_vl15intf.h \ $(top_builddir)/include/opensm/osm_version.h \ $(top_builddir)/include/opensm/osm_config.h -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Wed Oct 15 15:19:43 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 16 Oct 2008 00:19:43 +0200 Subject: [ofa-general] [PATCH 5/6 v2] opensm/Unicast Routing Cache: integrate cache into opensm Message-ID: <48F66C7F.5040209@dev.mellanox.co.il> Integrating unicast cache into the discovery and ucast manager. Unicast cache is integrated as part of osm_ucast_mgt_t object. The cache is marked valid after the first routing engine execution. During the discovery or in the drop mgr, whenever some node/port goes down/disconnected, it is cached in the unicast cache. Whenever new node or port were spotted, unicast cache checks whether this is a new node/port or is it something that was previously cached. As long as fabric changes do not require full routing recalculation, unicast cache will prevent osm_ucast_mgr_process() execution (which also prevents routing engine execution). When the topology change is not something that can be tolerated by the unicast cache, cache is invalidated, all its data structures are freed, new routing is calculated by the routing engine, and the cache is marked as valid again. Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_ucast_mgr.h | 9 +++++++++ opensm/opensm/osm_drop_mgr.c | 10 +++++++++- opensm/opensm/osm_node_info_rcv.c | 9 ++++++++- opensm/opensm/osm_port_info_rcv.c | 8 +++++++- opensm/opensm/osm_state_mgr.c | 12 ++++++++++-- opensm/opensm/osm_ucast_mgr.c | 10 ++++++++++ 6 files changed, 53 insertions(+), 5 deletions(-) diff --git a/opensm/include/opensm/osm_ucast_mgr.h b/opensm/include/opensm/osm_ucast_mgr.h index 27e89e9..13b17b8 100644 --- a/opensm/include/opensm/osm_ucast_mgr.h +++ b/opensm/include/opensm/osm_ucast_mgr.h @@ -49,6 +49,7 @@ #include #include #include +#include #ifdef __cplusplus # define BEGIN_C_DECLS extern "C" { @@ -97,6 +98,8 @@ typedef struct osm_ucast_mgr { cl_qlist_t port_order_list; boolean_t is_dor; boolean_t some_hop_count_set; + cl_qmap_t cache_sw_tbl; + boolean_t cache_valid; } osm_ucast_mgr_t; /* * FIELDS @@ -128,6 +131,12 @@ typedef struct osm_ucast_mgr { * tables calculation iteration cycle, set to TRUE to indicate * that some hop count changes were done. * +* cache_sw_tbl +* Cached switches table. +* +* cache_valid +* TRUE if the unicast cache is valid. +* * SEE ALSO * Unicast Manager object *********/ diff --git a/opensm/opensm/osm_drop_mgr.c b/opensm/opensm/osm_drop_mgr.c index 8c6e7fb..45fc670 100644 --- a/opensm/opensm/osm_drop_mgr.c +++ b/opensm/opensm/osm_drop_mgr.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * Copyright (c) 2008 Xsigo Systems Inc. All rights reserved. * @@ -61,6 +61,7 @@ #include #include #include +#include /********************************************************************** **********************************************************************/ @@ -134,6 +135,10 @@ static void drop_mgr_clean_physp(osm_sm_t * sm, IN osm_physp_t * p_physp) (p_remote_physp->p_node)), p_remote_physp->port_num); + if (sm->ucast_mgr.cache_valid) + osm_ucast_cache_add_link(&sm->ucast_mgr, + p_physp, p_remote_physp); + osm_physp_unlink(p_physp, p_remote_physp); } @@ -308,6 +313,9 @@ __osm_drop_mgr_process_node(osm_sm_t * sm, IN osm_node_t * p_node) "Unreachable node 0x%016" PRIx64 "\n", cl_ntoh64(osm_node_get_node_guid(p_node))); + if (sm->ucast_mgr.cache_valid) + osm_ucast_cache_add_node(&sm->ucast_mgr, p_node); + /* Delete all the logical and physical port objects associated with this node. diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c index a37ce0a..ee2f8e8 100644 --- a/opensm/opensm/osm_node_info_rcv.c +++ b/opensm/opensm/osm_node_info_rcv.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -59,6 +59,7 @@ #include #include #include +#include static void report_duplicated_guid(IN osm_sm_t * sm, @@ -240,6 +241,12 @@ __osm_ni_rcv_set_links(IN osm_sm_t * sm, cl_ntoh64(osm_node_get_node_guid(p_node)), port_num, cl_ntoh64(p_ni_context->node_guid), p_ni_context->port_num); + if (sm->ucast_mgr.cache_valid) + osm_ucast_cache_check_new_link(&sm->ucast_mgr, + p_node, port_num, + p_neighbor_node, + p_ni_context->port_num); + osm_node_link(p_node, port_num, p_neighbor_node, p_ni_context->port_num); diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c index 73afd8e..efb8830 100644 --- a/opensm/opensm/osm_port_info_rcv.c +++ b/opensm/opensm/osm_port_info_rcv.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -60,6 +60,7 @@ #include #include #include +#include /********************************************************************** **********************************************************************/ @@ -244,6 +245,11 @@ __osm_pi_rcv_process_switch_port(IN osm_sm_t * sm, (p_remote_node)), remote_port_num); + if (sm->ucast_mgr.cache_valid) + osm_ucast_cache_add_link(&sm->ucast_mgr, + p_physp, + p_remote_physp); + osm_node_unlink(p_node, (uint8_t) port_num, p_remote_node, (uint8_t) remote_port_num); diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index b4eb87b..ce010cb 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 2002-2008 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two @@ -1075,6 +1075,10 @@ static void do_sweep(osm_sm_t * sm) /* Re-program the switches fully */ sm->p_subn->ignore_existing_lfts = TRUE; + /* we want to re-route, so cache should be invalidated */ + if (sm->p_subn->opt.use_ucast_cache) + osm_ucast_cache_invalidate(&sm->ucast_mgr); + osm_ucast_mgr_process(&sm->ucast_mgr); /* Reset flag */ @@ -1229,7 +1233,11 @@ _repeat_discovery: /* * Proceed with unicast forwarding table configuration. */ - osm_ucast_mgr_process(&sm->ucast_mgr); + + if (!sm->ucast_mgr.cache_valid || + osm_ucast_cache_process(&sm->ucast_mgr)) + osm_ucast_mgr_process(&sm->ucast_mgr); + if (wait_for_pending_transactions(&sm->p_subn->p_osm->stats)) return; diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index 2dc5dd4..3bc3912 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -73,6 +73,10 @@ void osm_ucast_mgr_destroy(IN osm_ucast_mgr_t * const p_mgr) CL_ASSERT(p_mgr); OSM_LOG_ENTER(p_mgr->p_log); + + if (p_mgr->cache_valid) + osm_ucast_cache_invalidate(p_mgr); + OSM_LOG_EXIT(p_mgr->p_log); } @@ -92,6 +96,9 @@ osm_ucast_mgr_init(IN osm_ucast_mgr_t * const p_mgr, IN osm_sm_t * sm) p_mgr->p_subn = sm->p_subn; p_mgr->p_lock = sm->p_lock; + if (sm->p_subn->opt.use_ucast_cache) + cl_qmap_init(&p_mgr->cache_sw_tbl); + OSM_LOG_EXIT(p_mgr->p_log); return (status); } @@ -840,6 +847,9 @@ int osm_ucast_mgr_process(IN osm_ucast_mgr_t * const p_mgr) "%s tables configured on all switches\n", osm_routing_engine_type_str(p_osm->routing_engine_used)); + if (p_mgr->p_subn->opt.use_ucast_cache) + p_mgr->cache_valid = TRUE; + Exit: CL_PLOCK_RELEASE(p_mgr->p_lock); OSM_LOG_EXIT(p_mgr->p_log); -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Wed Oct 15 15:21:20 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 16 Oct 2008 00:21:20 +0200 Subject: [ofa-general] [PATCH 6/6 v2] opensm/Unicast Routing Cache: manpage entry Message-ID: <48F66CE0.3020903@dev.mellanox.co.il> Signed-off-by: Yevgeny Kliteynik --- opensm/man/opensm.8.in | 14 +++++++++++++- 1 files changed, 13 insertions(+), 1 deletions(-) diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in index c1ea584..efadf8e 100644 --- a/opensm/man/opensm.8.in +++ b/opensm/man/opensm.8.in @@ -10,7 +10,7 @@ opensm \- InfiniBand subnet manager and administration (SM/SA) [\-g(uid) ] [\-l(mc) ] [\-p(riority) ] [\-smkey ] [\-r(eassign_lids)] [\-R | \-\-routing_engine ] -[\-z | \-\-connect_roots] +[\-A | \-\-ucast_cache] [\-z | \-\-connect_roots] [\-M | \-\-lid_matrix_file ] [\-U | \-\-lfts_file ] [\-S | \-\-sadb_file ] [\-a | \-\-root_guid_file ] @@ -122,6 +122,18 @@ separated by commas so that specific ordering of routing algorithms will be tried if earlier routing engines fail. Supported engines: minhop, updn, file, ftree, lash, dor .TP +\fB\-A\fR, \fB\-\-ucast_cache\fR +This option enables unicast routing cache and prevents routing +recalculation (which is a heavy task in a large cluster) when +there was no topology change detected during the heavy sweep, or +when the topology change does not require new routing calculation, +e.g. when one or more CAs/RTRs/leaf switches going down, or one or +more of these nodes coming back after being down. +A very common case that is handled by the unicast routing cache +is host reboot, which otherwise would cause two full routing +recalculations: one when the host goes down, and the other when +the host comes back online. +.TP \fB\-z\fR, \fB\-\-connect_roots\fR This option enforces a routing engine (currently up/down only) to make connectivity between root switches and in -- 1.5.1.4 From bohnsack at cdsinc.com Wed Oct 15 15:36:08 2008 From: bohnsack at cdsinc.com (Matthew Bohnsack) Date: Wed, 15 Oct 2008 16:36:08 -0600 Subject: [ofa-general] Questions Concerning a 3D Torus Message-ID: <1224110168.28670.66.camel@localhost.localdomain> Hello, I have a number of questions related to the construction and operation of a 3D torus with Infiniband. We would like to create an IB network arranged as a 3D mesh with wrap-around links. That is, a 3D torus. Each vertex of this torus would be an Infiniscale IV (I4), having single 4x connections to up to 12 host ConnectX-based HCAs and 3-4x connections to its neighbors in each of six dimensions. The smallest network that illustrates the most interesting aspects of this setup is a 3x3x3 torus. I've created a basic illustration of this here. Pick your favorite file format: http://bohnsack.com/3DTorus.svg http://bohnsack.com/3DTorus.pdf In this diagram, each square is a single I4, and each line represents 3 independent 4x connections. To avoid making the diagram too complicated, host connections are only shown for a single switch. You should imagine 12 hosts hanging off of each square. Again, to avoid an unreadable diagram, not all of the Y dimension wrap-around links are shown, but you should consider them present for the purposes of the network I'm describing. Questions: 1) What do you call the topology I'm describing, strictly speaking? It's kind of like each switch chip vertex has a sub-graph connected to all the host HCAs. Perhaps this thing is a "decorated 3D Torus"? 2) I think this network is a little bit different than the 3D tori that have been previously deployed in machines like Red Storm where there is a network switch for one or at maximum a very few compute clients. Does the fact that there are order 10 times more hosts hanging off of each switch chip vertex in the network I'm describing matter from a routing perspective? It seems that the routing problem is mostly the same. I.e., algorithms to determine a set of deadlock-free routes on the same basic topology, ignoring the "decorations", are similar. Is this right? 3) Is there good current support for computing deadlock-free routes on the network I'm describing in OFED 1.3.1, 1.4, or other? With which routing algorithm? I tried to find a answer to this, by looking through various OFED documentation, but I still have a bit of confusion on how to proceed. Here's the data that I was able to gather. Can someone please help to clarify? - An OpenSM wiki page says that OpenSM supports "Torus routing": https://wiki.openfabrics.org/tiki-index.php?page=OpenSM&highlight=opensm - However, the latest release notes I can find don't make any explicit mention of tori: http://www.openfabrics.org/git/?p=~sashak/management.git;a=blob_plain;f=opensm/doc/opensm_release_notes-3.2.txt;hb=HEAD - The release notes do mention Dimension Order routing (DOR), and this might work for a 3D torus, but it seems (per the notes) that this algorithm, as implemented, is only considered deadlock-free for meshes (no wrap-around) and hypercubes - no tori. I understand that when you "wrap around" the dimension, the virtual channel used needs to change to avoid deadlock in DOR, and DOR as implemented today doesn't do that. - Commit b204932d5bd2a88af5ce0989d2dff65d753b3d54 from git://git.openfabrics.org/~sashak/management.git in March of 2007 mentions some degree of success with LASH on 2D tori, but it's considered "unoptimized". Would this work deadlock-free for a 3D torus? What's the implication of "unoptimized" on something like an 8x8x8 torus with lots of hosts at each vertex? - I didn't see any other mention of a torus or tori in the OpenSM commit logs. 8) Is there an existing utility inside of OFED that can be used to verify that routes generated by the SM are deadlock free? I.e., can I dump routes from OpenSM and then run a utility on them that can identify potentials for deadlock? 9) I'm aware of ibsim, and others I'm collaborating with have done the first level of testing with it on 3D tori. My question is how much more simulation can I do on a "mock" topology with freely-available tools? Am I limited to MAD traffic, or is there a way to simulate a real workload (perhaps MPI traffic)? How about commercial tools? 10) As I previously mentioned, there are 3 4x links running between each switch chip, along each dimension. Our plan is to run these as 3 independent links, but it should be possible to logically aggregate them into a single 12x link. At first glance, a 12x link might be subject to less congestion but it could also incur store-and-forward penalties producing unwanted latency as 4x flows are converted to 12x and then back to 4x again. My questions around this topic: Is any additional insight into this issue available (theoretical or empirical)? Should we even worry about testing 12x connections? 11) An artifact that results from the way we intend to connect the I4 switches together is that there's a possibility of having 9-4x connections between every other link in one dimension. E.g., looking in the the Z dimension, switch-to-switch "widths" could alternate between 9-4x and 3-4x links. Links in other dimensions would all be limited to 3-4x as before. The end result of this configuration would be that certain groups of two I4s and their connected hosts along one dimension would have a little bit better connectivity than other I4 groupings. This kind of configuration in our setup would be "free" by simply doing some logical configuration. This change may or not be beneficial depending on our applications and workloads. I'm not so concerned about that issue at present. My main question today is whether this kind of "heterogeneity" would cause routing issues/complications. Would it? If not, it might be a no-brainer to enable it. What do you think? An diagram of the heterogeneous 3D torus I'm talking about is available here: http://bohnsack.com/3DTorus-heterogeneous-Z.svg http://bohnsack.com/3DTorus-heterogeneous-Z.pdf Thanks in advance for your help, -Matthew From vlad at lists.openfabrics.org Thu Oct 16 03:17:34 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 16 Oct 2008 03:17:34 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081016-0200 daily build status Message-ID: <20081016101734.569C5E60DBC@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From kliteyn at dev.mellanox.co.il Thu Oct 16 04:42:59 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 16 Oct 2008 13:42:59 +0200 Subject: [ofa-general] [PATCH] opensm/doc/current-routing.txt: added ucast cache info Message-ID: <48F728C3.9010602@dev.mellanox.co.il> Added ucast cache info in current-routing.txt Signed-off-by: Yevgeny Kliteynik --- opensm/doc/current-routing.txt | 11 +++++++++++ 1 files changed, 11 insertions(+), 0 deletions(-) diff --git a/opensm/doc/current-routing.txt b/opensm/doc/current-routing.txt index 2bb6c2f..af008bc 100644 --- a/opensm/doc/current-routing.txt +++ b/opensm/doc/current-routing.txt @@ -28,6 +28,17 @@ two switches. This provides deadlock free routes for hypercubes when the fabric is cabled as a hypercube and for meshes when cabled as a mesh (see details below). +OpenSM provides an optional unicast routing cache (enabled by -A or +--ucast_cache options). When enabled, unicast routing cache prevents +routing recalculation (which is a heavy task in a large cluster) when +there was no topology change detected during the heavy sweep, or when +the topology change does not require new routing calculation, e.g. when +one or more CAs/RTRs/leaf switches going down, or one or more of these +nodes coming back after being down. +A very common case that is handled by the unicast routing cache is host +reboot, which otherwise would cause two full routing recalculations: one +when the host goes down, and the other when the host comes back online. + OpenSM also supports a file method which can load routes from a table. See modular-routing.txt for more information on this. -- 1.5.1.4 From ogerlitz at voltaire.com Thu Oct 16 06:13:04 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 16 Oct 2008 15:13:04 +0200 (IST) Subject: [ofa-general] [PATCH] IB/ipoib: enhance ethtool support Message-ID: Add a get rx checksum callback, remove the driver own get tso callback as the ethtool kernel code uses a default one if nothing is provided. Signed-off-by: Or Gerlitz Index: linux-2.6.27/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c =================================================================== --- linux-2.6.27.orig/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c +++ linux-2.6.27/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c @@ -42,6 +42,13 @@ static void ipoib_get_drvinfo(struct net strncpy(drvinfo->driver, "ipoib", sizeof(drvinfo->driver) - 1); } +static u32 ipoib_get_rx_csum(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + return test_bit(IPOIB_FLAG_CSUM, &priv->flags) && + !test_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags); +} + static int ipoib_get_coalesce(struct net_device *dev, struct ethtool_coalesce *coal) { @@ -129,7 +136,7 @@ static void ipoib_get_ethtool_stats(stru static const struct ethtool_ops ipoib_ethtool_ops = { .get_drvinfo = ipoib_get_drvinfo, - .get_tso = ethtool_op_get_tso, + .get_rx_csum = ipoib_get_rx_csum, .get_coalesce = ipoib_get_coalesce, .set_coalesce = ipoib_set_coalesce, .get_flags = ethtool_op_get_flags, From ogerlitz at voltaire.com Thu Oct 16 06:13:46 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 16 Oct 2008 15:13:46 +0200 (IST) Subject: [ofa-general] [PATCH] IB/ipoib: fix netdev offload features support for child (VLAN) devices Message-ID: Child devices were created without any offload features set, fix this by moving the code that computes the features into generic function which is now called through non-child and child device creation. Signed-off-by: Or Gerlitz Index: linux-2.6.27/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.27.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ linux-2.6.27/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1185,11 +1185,48 @@ int ipoib_add_pkey_attr(struct net_devic return device_create_file(&dev->dev, &dev_attr_pkey); } +int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca) +{ + struct ib_device_attr *device_attr; + int result = -ENOMEM; + + device_attr = kmalloc(sizeof *device_attr, GFP_KERNEL); + if (!device_attr) { + printk(KERN_WARNING "%s: allocation of %zu bytes failed\n", + hca->name, sizeof *device_attr); + return result; + } + + result = ib_query_device(hca, device_attr); + if (result) { + printk(KERN_WARNING "%s: ib_query_device failed (ret = %d)\n", + hca->name, result); + kfree(device_attr); + return result; + } + priv->hca_caps = device_attr->device_cap_flags; + + kfree(device_attr); + + if (priv->hca_caps & IB_DEVICE_UD_IP_CSUM) { + set_bit(IPOIB_FLAG_CSUM, &priv->flags); + priv->dev->features |= NETIF_F_SG | NETIF_F_IP_CSUM; + } + + if (lro) + priv->dev->features |= NETIF_F_LRO; + + if (priv->dev->features & NETIF_F_SG && priv->hca_caps & IB_DEVICE_UD_TSO) + priv->dev->features |= NETIF_F_TSO; + + return 0; +} + + static struct net_device *ipoib_add_port(const char *format, struct ib_device *hca, u8 port) { struct ipoib_dev_priv *priv; - struct ib_device_attr *device_attr; struct ib_port_attr attr; int result = -ENOMEM; @@ -1218,31 +1255,8 @@ static struct net_device *ipoib_add_port goto device_init_failed; } - device_attr = kmalloc(sizeof *device_attr, GFP_KERNEL); - if (!device_attr) { - printk(KERN_WARNING "%s: allocation of %zu bytes failed\n", - hca->name, sizeof *device_attr); + if (ipoib_set_dev_features(priv, hca)) goto device_init_failed; - } - - result = ib_query_device(hca, device_attr); - if (result) { - printk(KERN_WARNING "%s: ib_query_device failed (ret = %d)\n", - hca->name, result); - kfree(device_attr); - goto device_init_failed; - } - priv->hca_caps = device_attr->device_cap_flags; - - kfree(device_attr); - - if (priv->hca_caps & IB_DEVICE_UD_IP_CSUM) { - set_bit(IPOIB_FLAG_CSUM, &priv->flags); - priv->dev->features |= NETIF_F_SG | NETIF_F_IP_CSUM; - } - - if (lro) - priv->dev->features |= NETIF_F_LRO; /* * Set the full membership bit, so that we join the right @@ -1278,9 +1292,6 @@ static struct net_device *ipoib_add_port goto event_failed; } - if (priv->dev->features & NETIF_F_SG && priv->hca_caps & IB_DEVICE_UD_TSO) - priv->dev->features |= NETIF_F_TSO; - result = register_netdev(priv->dev); if (result) { printk(KERN_WARNING "%s: couldn't register ipoib port %d; error %d\n", Index: linux-2.6.27/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- linux-2.6.27.orig/drivers/infiniband/ulp/ipoib/ipoib.h +++ linux-2.6.27/drivers/infiniband/ulp/ipoib/ipoib.h @@ -509,6 +509,7 @@ int ipoib_pkey_dev_delay_open(struct net void ipoib_drain_cq(struct net_device *dev); void ipoib_set_ethtool_ops(struct net_device *dev); +int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca); #ifdef CONFIG_INFINIBAND_IPOIB_CM Index: linux-2.6.27/drivers/infiniband/ulp/ipoib/ipoib_vlan.c =================================================================== --- linux-2.6.27.orig/drivers/infiniband/ulp/ipoib/ipoib_vlan.c +++ linux-2.6.27/drivers/infiniband/ulp/ipoib/ipoib_vlan.c @@ -93,6 +93,9 @@ int ipoib_vlan_add(struct net_device *pd priv->mcast_mtu = priv->admin_mtu = priv->dev->mtu; set_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags); + if (ipoib_set_dev_features(priv, ppriv->ca)) + goto device_init_failed; + priv->pkey = pkey; memcpy(priv->dev->dev_addr, ppriv->dev->dev_addr, INFINIBAND_ALEN); From ogerlitz at voltaire.com Thu Oct 16 06:20:50 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 16 Oct 2008 15:20:50 +0200 (IST) Subject: [ofa-general] [PATCH v2] IB/ipoib: fix netdev offload features support for child (VLAN) devices In-Reply-To: References: Message-ID: Child devices were created without any offload features set, fix this by moving the code that computes the features into generic function which is now called through non-child and child device creation. Signed-off-by: Or Gerlitz -- v1 has a bug where the 'result' flag in ipoib_vlan_add may be used uninitialized Index: linux-2.6.27/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- linux-2.6.27.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ linux-2.6.27/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1185,11 +1185,48 @@ int ipoib_add_pkey_attr(struct net_devic return device_create_file(&dev->dev, &dev_attr_pkey); } +int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca) +{ + struct ib_device_attr *device_attr; + int result = -ENOMEM; + + device_attr = kmalloc(sizeof *device_attr, GFP_KERNEL); + if (!device_attr) { + printk(KERN_WARNING "%s: allocation of %zu bytes failed\n", + hca->name, sizeof *device_attr); + return result; + } + + result = ib_query_device(hca, device_attr); + if (result) { + printk(KERN_WARNING "%s: ib_query_device failed (ret = %d)\n", + hca->name, result); + kfree(device_attr); + return result; + } + priv->hca_caps = device_attr->device_cap_flags; + + kfree(device_attr); + + if (priv->hca_caps & IB_DEVICE_UD_IP_CSUM) { + set_bit(IPOIB_FLAG_CSUM, &priv->flags); + priv->dev->features |= NETIF_F_SG | NETIF_F_IP_CSUM; + } + + if (lro) + priv->dev->features |= NETIF_F_LRO; + + if (priv->dev->features & NETIF_F_SG && priv->hca_caps & IB_DEVICE_UD_TSO) + priv->dev->features |= NETIF_F_TSO; + + return 0; +} + + static struct net_device *ipoib_add_port(const char *format, struct ib_device *hca, u8 port) { struct ipoib_dev_priv *priv; - struct ib_device_attr *device_attr; struct ib_port_attr attr; int result = -ENOMEM; @@ -1218,31 +1255,8 @@ static struct net_device *ipoib_add_port goto device_init_failed; } - device_attr = kmalloc(sizeof *device_attr, GFP_KERNEL); - if (!device_attr) { - printk(KERN_WARNING "%s: allocation of %zu bytes failed\n", - hca->name, sizeof *device_attr); + if (ipoib_set_dev_features(priv, hca)) goto device_init_failed; - } - - result = ib_query_device(hca, device_attr); - if (result) { - printk(KERN_WARNING "%s: ib_query_device failed (ret = %d)\n", - hca->name, result); - kfree(device_attr); - goto device_init_failed; - } - priv->hca_caps = device_attr->device_cap_flags; - - kfree(device_attr); - - if (priv->hca_caps & IB_DEVICE_UD_IP_CSUM) { - set_bit(IPOIB_FLAG_CSUM, &priv->flags); - priv->dev->features |= NETIF_F_SG | NETIF_F_IP_CSUM; - } - - if (lro) - priv->dev->features |= NETIF_F_LRO; /* * Set the full membership bit, so that we join the right @@ -1278,9 +1292,6 @@ static struct net_device *ipoib_add_port goto event_failed; } - if (priv->dev->features & NETIF_F_SG && priv->hca_caps & IB_DEVICE_UD_TSO) - priv->dev->features |= NETIF_F_TSO; - result = register_netdev(priv->dev); if (result) { printk(KERN_WARNING "%s: couldn't register ipoib port %d; error %d\n", Index: linux-2.6.27/drivers/infiniband/ulp/ipoib/ipoib.h =================================================================== --- linux-2.6.27.orig/drivers/infiniband/ulp/ipoib/ipoib.h +++ linux-2.6.27/drivers/infiniband/ulp/ipoib/ipoib.h @@ -509,6 +509,7 @@ int ipoib_pkey_dev_delay_open(struct net void ipoib_drain_cq(struct net_device *dev); void ipoib_set_ethtool_ops(struct net_device *dev); +int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca); #ifdef CONFIG_INFINIBAND_IPOIB_CM Index: linux-2.6.27/drivers/infiniband/ulp/ipoib/ipoib_vlan.c =================================================================== --- linux-2.6.27.orig/drivers/infiniband/ulp/ipoib/ipoib_vlan.c +++ linux-2.6.27/drivers/infiniband/ulp/ipoib/ipoib_vlan.c @@ -93,6 +93,10 @@ int ipoib_vlan_add(struct net_device *pd priv->mcast_mtu = priv->admin_mtu = priv->dev->mtu; set_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags); + result = ipoib_set_dev_features(priv, ppriv->ca); + if (result) + goto device_init_failed; + priv->pkey = pkey; memcpy(priv->dev->dev_addr, ppriv->dev->dev_addr, INFINIBAND_ALEN); From tziporet at mellanox.co.il Thu Oct 16 06:53:15 2008 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Thu, 16 Oct 2008 15:53:15 +0200 Subject: [ofa-general] OFED October 15 2008 meeting summary In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EADB77090@mtlexch01.mtl.com> Message-ID: <5D49E7A8952DC44FB38C38FA0D758EADB77482@mtlexch01.mtl.com> OFED October 15 2008 meeting summary on OFED 1.4 status toward RC3: Meeting minutes on the web: http://www.openfabrics.org/txt/documentation/linux/EWG_meeting_minutes/ Meeting Summary: ============== 1. RC3 is delayed - will be released early next week 2. GA date was moved to Nov 10 3. NFS-RDMA will be disabled by default and its quall level will be technical preview 4. RDS development must move to a new branch since last changed broke functionality 5. Three issues found in the interop event - bugs should be opened Details: ====== 1. OFED 1.4 status: - Moved to kernel 2.6.27 - MPI: Open MPI 1.2.8 ready, MVAPICH2 and MVAPICH updated. - uDAPL: compat-dapl-1.2.11 and dapl-2.0.14 released - NFS-RDMA critical bug with NFS_RDMA on SLES 10 SP2 - still on work - should be fixed for RC3 Decision: NFS-RDMA will not be installed by default. Jeff will see if this can be a separate RPM Reason: This is a new component that was not tested enough - OSM: Cashed routing - Should be checked in once patches V2 will be ready (patches were sent yesterday) - RC3 schedule: Going to build it on Sunday, and release on Monday - RDS: We see many changes in RDS that break functionality - need to open a stable branch for the release 2. Rupert reported on three issues found in the Interop event in UNH : - IPoIB - issue with packet loss when package size larger then 8196 bytes - SRP - issue with one vendor on large table size - link is not us if OSM is not running for some cards Rupert will make sure that bugs will be opened 3. GA date: Decided on a new target date: 10 Nov 2008 This means the Logo program at UNH will be done after sc08 4. Bugs review: 1128 blo Othe stefan.roscher at de.ibm.com release IPoIB-CM QP resources in flushing CQE context 1278 blo RHEL perkinjo at cse.ohio-state.edu Mvapich2 fails to compile when uDAPL option is specified 1242 cri RHEL yannick.cote at qlogic.com kernel panic while running mpi2007 against ofed1.4 -- ib_... - should be for rc3 1257 cri All eli at mellanox.co.il Severe performance penalty for PCIe strict ordering - on work 1262 cri Othe andy.grover at oracle.com congestion hang with RDS 1164 maj SLES yosefe at voltaire.com iperf over IPoIB fails for 100 tcp connections - on work 1221 maj SLES Jeffrey.C.Becker at nasa.gov SLES10 sp2: remote logins via ssh fail due to rpcbind and... 1248 maj SLES monis at voltaire.com Bonding - after reboot the host stucks while raising the ... - not reproduced 1282 cri RHEL amirv at mellanox.co.il Kernel panic during Netperf run Tziporet From jimkress_35 at kressworks.com Thu Oct 16 09:57:44 2008 From: jimkress_35 at kressworks.com (Jim Kress) Date: Thu, 16 Oct 2008 12:57:44 -0400 Subject: [ofa-general] 32 bit mvapich 1.0.1 shared libraries Message-ID: <1CEE7F87A9234E5398239BE834FA4147@inspiron9100> When installing OFED 1.3.1, the person doing the install is presented with a option: --build32 Build 32-bit libraries. Relevant for x86_64 and ppc64 platforms However, there are no 32 bit shared libraries for mvapich-1.0.1 created, even though the installation script finishes with "installation successful" How does one get the 32 bit mvapich-1.0.1 libraries built and where will they be located? Thanks, Jim Kress From murray at tradeworx.com Thu Oct 16 10:21:02 2008 From: murray at tradeworx.com (murray smigel) Date: Thu, 16 Oct 2008 13:21:02 -0400 Subject: [ofa-general] problems with epoll and ipoib (memory leak and cpu creep) Message-ID: <48F777FE.7080601@tradeworx.com> Hi, We have an application using epoll to listen to a group of udp multicast broadcasts that come in over a set of ethernet ports built into a Voltaire ISR2004 switch via the IPR module. The IPR ports are assigned ethernet addresses and routing is set up to listen to the multicasts over the appropriate IPR port. There are a large number of multicast groups involved ~100 distributed over two IPR ports. When we run the application using poll, things work fine (except for the occasional dropping of packets due to the large set of fds passed to poll). To try to remedy the problem, we switched to epoll. Now, as the program runs, the cpu utilization rises over time towards 100% and the memory usage grows as well. The same epoll based program runs fine when it is on a machine with physical ethernet ports (eth1 eth2) rather then ipoib mapped IPR ports (ib0.8200, ib0.8600). Hardware is x86-64 based dualxdual core intel processors. We are running Debian Etch with a vanilla 2.6.26.5 kernel and the ofed stack that is part of the standard kernel distro. Any analysis or suggestions would be appreciated. thanks, murray smigel From jimkress_35 at kressworks.com Thu Oct 16 10:47:29 2008 From: jimkress_35 at kressworks.com (Jim Kress) Date: Thu, 16 Oct 2008 13:47:29 -0400 Subject: [ofa-general] 32 bit mvapich 1.0.1 shared libraries Message-ID: <4A17A34A516444CAB4D27F3619DED72C@inspiron9100> When installing OFED 1.3.1, the person doing the install is presented with a option: --build32 Build 32-bit libraries. Relevant for x86_64 and ppc64 platforms However, there are no 32 bit shared libraries for mvapich-1.0.1 created, even though the installation script finishes with "installation successful" How does one get the 32 bit mvapich-1.0.1 libraries built and where will they be located? Thanks, Jim Kress From publications at kressworks.com Thu Oct 16 10:49:04 2008 From: publications at kressworks.com (publications) Date: Thu, 16 Oct 2008 13:49:04 -0400 Subject: [ofa-general] 32 bit mvapich 1.0.1 shared libraries Message-ID: <6938774FE5D54C2BA6E010D193AEBD78@inspiron9100> When installing OFED 1.3.1, the person doing the install is presented with a option: --build32 Build 32-bit libraries. Relevant for x86_64 and ppc64 platforms However, there are no 32 bit shared libraries for mvapich-1.0.1 created, even though the installation script finishes with "installation successful" How does one get the 32 bit mvapich-1.0.1 libraries built and where will they be located? Thanks, Jim Kress From michael.heinz at qlogic.com Thu Oct 16 12:28:33 2008 From: michael.heinz at qlogic.com (Mike Heinz) Date: Thu, 16 Oct 2008 14:28:33 -0500 Subject: [ofa-general] [PATCH] mvapich2-trunk-3073 Message-ID: The mpivars.sh and mpivars.csh scripts (created by the RPM spec file) do not set the LD_LIBRARY_PATH, which means that mvapich2 programs may not run or link unless the path is explicitly set. A similar problem was found with mvapich earlier this year. [mheinz at homer SRPMS]> diff -ud mvapich2.spec.orig mvapich2.spec --- mvapich2.spec.orig 2008-10-16 15:19:36.000000000 -0400 +++ mvapich2.spec 2008-10-16 15:22:40.000000000 -0400 @@ -126,6 +126,14 @@ set path = ( %{_prefix}/bin ) endif +if ("1" == "\$?LD_LIBRARY_PATH") then + if ("\$LD_LIBRARY_PATH" !~ *%{_prefix}/lib) then + setenv LD_LIBRARY_PATH %{_prefix}/lib:\${LD_LIBRARY_PATH} + endif +else + setenv LD_LIBRARY_PATH %{_prefix}/lib +endif + if (\$?MANPATH) then if ( "\${MANPATH}" !~ *%{_prefix}/man* ) then setenv MANPATH %{_prefix}/man:\$MANPATH @@ -140,6 +148,10 @@ PATH=%{_prefix}/bin:\${PATH} fi +if ! echo \${LD_LIBRARY_PATH} | grep -q %{_prefix}/lib ; then + export LD_LIBRARY_PATH=%{_prefix}/lib:\${LD_LIBRARY_PATH} +fi + if ! echo \${MANPATH} | grep -q %{_prefix}/man ; then MANPATH=%{_prefix}/man:\${MANPATH} fi -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -------------- next part -------------- An HTML attachment was scrubbed... URL: From mottyg at voltaire.com Thu Oct 16 11:12:10 2008 From: mottyg at voltaire.com (Motty Grosman) Date: Thu, 16 Oct 2008 20:12:10 +0200 Subject: [ofa-general] US Case 00012515: problems with epoll and ipoib (memory leak and cpu creep) [[ ref:00D38IO.50085NTZf:ref ]] In-Reply-To: <48F777FE.7080601@tradeworx.com> References: <48F777FE.7080601@tradeworx.com> Message-ID: <39C75744D164D948A170E9792AF8E7CA01989A67@exil.voltaire.com> Hi Murray, 1. As we spoke, please use this link to the pre-released VMA: www.voltaire.com/ftp/support-products/Tradewox/libvma-2.1.8-0-RH-x86_64. rpm 2. In general while using VMA the cpu utilization is rises, please find attached "VMA optimization guide" aim to optimize your system and the performance BDW- when using the VMA, make sure the "umcast" flag is enable Please look at VMA optimization guide for more details Should you have any further question, please let me know. Best Regards, Motty Grosman | +978-439-5428 | +978-9955-978(m) System Engineer Voltaire - The Grid Backbone www.voltaire.com -----Original Message----- From: murray smigel [mailto:murray at tradeworx.com] Sent: Thursday, October 16, 2008 13:21 To: general at lists.openfabrics.org; support; Joohan Lee Subject: problems with epoll and ipoib (memory leak and cpu creep) Hi, We have an application using epoll to listen to a group of udp multicast broadcasts that come in over a set of ethernet ports built into a Voltaire ISR2004 switch via the IPR module. The IPR ports are assigned ethernet addresses and routing is set up to listen to the multicasts over the appropriate IPR port. There are a large number of multicast groups involved ~100 distributed over two IPR ports. When we run the application using poll, things work fine (except for the occasional dropping of packets due to the large set of fds passed to poll). To try to remedy the problem, we switched to epoll. Now, as the program runs, the cpu utilization rises over time towards 100% and the memory usage grows as well. The same epoll based program runs fine when it is on a machine with physical ethernet ports (eth1 eth2) rather then ipoib mapped IPR ports (ib0.8200, ib0.8600). Hardware is x86-64 based dualxdual core intel processors. We are running Debian Etch with a vanilla 2.6.26.5 kernel and the ofed stack that is part of the standard kernel distro. Any analysis or suggestions would be appreciated. thanks, murray smigel -------------- next part -------------- A non-text attachment was scrubbed... Name: VMA_User_Manual_DOC-00393-A00.pdf Type: application/octet-stream Size: 548224 bytes Desc: VMA_User_Manual_DOC-00393-A00.pdf URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: VMA Optimizations.doc Type: application/msword Size: 58880 bytes Desc: VMA Optimizations.doc URL: From perkinjo at cse.ohio-state.edu Thu Oct 16 12:52:13 2008 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Thu, 16 Oct 2008 15:52:13 -0400 Subject: [ofa-general] [PATCH] mvapich2-trunk-3073 In-Reply-To: References: Message-ID: <20081016195213.GQ3120@cse.ohio-state.edu> Mike: Thanks for the patch. I'll take a look at this and include it in our next srpm. Please note that you can mail mvapich-discuss at cse.ohio-state.edu for issues related to mvapich or mvapich2. On Thu, Oct 16, 2008 at 02:28:33PM -0500, Mike Heinz wrote: > The mpivars.sh and mpivars.csh scripts (created by the RPM spec file) do > not set the LD_LIBRARY_PATH, which means that mvapich2 programs may not > run or link unless the path is explicitly set. A similar problem was > found with mvapich earlier this year. > > [mheinz at homer SRPMS]> diff -ud mvapich2.spec.orig mvapich2.spec > --- mvapich2.spec.orig 2008-10-16 15:19:36.000000000 -0400 > +++ mvapich2.spec 2008-10-16 15:22:40.000000000 -0400 > @@ -126,6 +126,14 @@ > set path = ( %{_prefix}/bin ) > endif > > +if ("1" == "\$?LD_LIBRARY_PATH") then > + if ("\$LD_LIBRARY_PATH" !~ *%{_prefix}/lib) then > + setenv LD_LIBRARY_PATH %{_prefix}/lib:\${LD_LIBRARY_PATH} > + endif > +else > + setenv LD_LIBRARY_PATH %{_prefix}/lib > +endif > + > if (\$?MANPATH) then > if ( "\${MANPATH}" !~ *%{_prefix}/man* ) then > setenv MANPATH %{_prefix}/man:\$MANPATH > @@ -140,6 +148,10 @@ > PATH=%{_prefix}/bin:\${PATH} > fi > > +if ! echo \${LD_LIBRARY_PATH} | grep -q %{_prefix}/lib ; then > + export LD_LIBRARY_PATH=%{_prefix}/lib:\${LD_LIBRARY_PATH} > +fi > + > if ! echo \${MANPATH} | grep -q %{_prefix}/man ; then > MANPATH=%{_prefix}/man:\${MANPATH} > fi > > -- > Michael Heinz > Principal Engineer, Qlogic Corporation > King of Prussia, Pennsylvania > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From vst at vlnb.net Thu Oct 16 12:57:29 2008 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Thu, 16 Oct 2008 23:57:29 +0400 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48ECEA4D.7080504@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> Message-ID: <48F79CA9.8090806@vlnb.net> (Sorry for delay. I was busy with related task, without completing which we can't go further.) Cameron Harr wrote: > Vu Pham wrote: >> Cameron Harr wrote: >>> One thing that makes results hard to interpret is that they vary >>> enormously. I've been doing more testing with 3 physical LUNs >>> (instead of two) on the target, srpt_thread=0, and changing between >>> scst_thread=[1,2,3]. With scst_thread=1, I'm fairly low (50K IOPs), >>> while at 2 and three threads, the results are higher, though in all >>> cases, the context switches are low, often less than 1:1. >>> >> Can you test again with srpt_thread=0,1 and scst_threads=1,2,3 in >> NULLIO mode (with 1,2,3 export NULLIO luns) > srpt_thread=0: > scst_t: | 1 | 2 | 3 | > -------------------------------------------| > 1 LUN* | 54K | 54K-75K | 54K-75K | > 2 LUNs* |120K-200K|150K-200K**| 120K-180K**| > 3 LUNs* |170K-195K|160K-195K | 130K-170K**| > > srpt_thread=1: > scst_t: | 1 | 2 | 3 | > ------------------------------------------| > 1 LUN* | 74K | 54K | 55K | > 2 LUNs* |140K-190K| 130K-200K | 150K-220K | > 3 LUNs* |170K-195K| 170K-195K | 175K-195K | > > * a FIO (benchmark) process was run for each LUN, so when there were 3 > LUNs, there were three FIO processes runnning simultaneously. What FIO script do you use? Also how long each run take? Usually big variations are due to too quick test runs. Also, it would be better if you use O_DIRECT mode or sg interface directly to pass requests to the target. Otherwise, the variations can be due to cache activities on the initiators. > ** Sometimes the benchmark "zombied" (process doing no work, but process > can't be killed) after running a certain amount of time. However, it > wasn't repeatable in a reliable way, so I mark that this particular run > has zombied before. That means that there is a bug somewhere. Usually such bugs are found in few hours of code auditing (srpt driver is pretty simple) or by using kernel debug facilities (example diff to .config attached). I personally always prefer put my effort on fixing real things, not inventing various workarounds, like srpt_thread in this case. So I would: 1. Completely remove srpt thread and all related code. It doesn't do anything, which can't be done in SIRQ context (tasklet) 2. Audit the code to check if it does any action, which it shouldn't do on SIRQ and fix it. This step isn't required, but usually it saves a lot of time of puzzled debugging in the future. 3. Change in srpt_handle_rdma_comp() and srpt_handle_new_iu() SCST_CONTEXT_THREAD to SCST_CONTEXT_DIRECT_ATOMIC. Then I would run the problematic tests (heavy tpc-h workload, e.g.) on debug kernel and fix found problems. Anyway, Cameron, can you get the latest code from SCST trunk and try with it? It was recently updated. Also please add the case with changes from (3) above. > - Note 1: There were a number of outliers (often between 98K and 230K), > but I tried to capture where the bulk of the activity happened. It's > still somewhat of a rough guess though. Where the range is large, it > usually mean the results were just really scattered. > > Summary: It's hard to draw a good summary due to the variation of > results. I would say the runs with srpt_thread=1 tended to have fewer > outliers at the beginning, but as time went on, they scattered as well. > Running with 2 or 3 threads almost seems to be a toss-up. >>> Also a little disconcerting is that my average request size on the >>> target has gotten larger. I'm always writing 512B packets, and when I >>> run on one initiator, the average reqsz is around 600-800B. When I >>> add an initiator, the average reqsz basically doubles and is now >>> around 1200 - 1600B. I'm specifying direct IO in the test and scst is >>> configured as blockio (and thus direct IO), but it appears something >>> is cached at some point and seems to be coalesced when another >>> initiator is involved. Does this seem odd or normal? This shows true >>> whether the initiators are writing to different partitions on the >>> same LUN or the same LUN with no partitions. >> What io scheduler are you running on local storage? Since you are >> using blockio you should play around with io scheduler's tuned >> parameters (for example deadline scheduler: front_merges, >> write_starved,...) Please see ~/Documentation/block/*.txt > I'm using CFQ. Months ago, I tried different schedulers with their > default options and saw basically no difference. I can try some of that > again; however I don't believe I can tune the schedulers because my back > end doesn't give me a "queue" directory in /sys/block// > > -Cameron > From vst at vlnb.net Thu Oct 16 12:58:48 2008 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Thu, 16 Oct 2008 23:58:48 +0400 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48ED3489.4030905@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> Message-ID: <48F79CF8.3010905@vlnb.net> Cameron Harr wrote: > Cameron Harr wrote: >>>> Also a little disconcerting is that my average request size on the >>>> target has gotten larger. I'm always writing 512B packets, and when >>>> I run on one initiator, the average reqsz is around 600-800B. When I >>>> add an initiator, the average reqsz basically doubles and is now >>>> around 1200 - 1600B. I'm specifying direct IO in the test and scst >>>> is configured as blockio (and thus direct IO), but it appears >>>> something is cached at some point and seems to be coalesced when >>>> another initiator is involved. Does this seem odd or normal? This >>>> shows true whether the initiators are writing to different >>>> partitions on the same LUN or the same LUN with no partitions. > > I've been doing some testing trying to determine why my average req sz > is bloated beyond the 512B packets I'm sending. It appears to me to be > caused by heavy utilization of the middleware: SRPT or SCST. As I add > processes on an initiator, the ave req sz goes up, and really jumps when > I have more than 2 processes (running on 1 or 2 initiators) or if I'm > writing to the same target LUN. My hunch is that the calculation of the > ave req sz over a 1s interval is skewed due to some requests having to > wait for either the IB layer or the SCST layer. > > Thinking that perhaps the srpt_thread was a cause, I turned off > threading there, but that caused the packet sizing to be much more wild > - never dropping to 512B and growing to as much as 4KB. Using the > default deadline scheduler as opposed to the default cfq scheduler > didn't seem to make a difference. I guess, you use a regular caching IO? The lowest packet size it can produce is a PAGE_SIZE (4K). Target can't change it. You can have lower packets only with O_DIRECT or sg interface. But I'm not sure it will be performance effective. I'd recommend you to use 4K packets and deadline IO scheduler. > Cameron > From sriram at pnl.gov Thu Oct 16 13:48:38 2008 From: sriram at pnl.gov (Krishnamoorthy, Sriram) Date: Thu, 16 Oct 2008 13:48:38 -0700 Subject: [ofa-general] Querying the number of open queue pairs Message-ID: I am trying to figure out how many more queue pairs can be created on a device at some point (say after MPI has been initialized). ibv_query_device() returns, among other things, the maximum number of queue pairs that can be created. In OFED 1.3, is there a way to query the number of queue pairs that have been created so far, on a device? Thanks, Sriram.K -------------- next part -------------- An HTML attachment was scrubbed... URL: From kliteyn at dev.mellanox.co.il Thu Oct 16 14:36:23 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 16 Oct 2008 23:36:23 +0200 Subject: [ofa-general] [PATCH 1/2] opensm: replace switch's fwd_tbl with simple LFT Message-ID: <48F7B3D7.3070004@dev.mellanox.co.il> Replace the unnecessarily complex switch's forwarding table implementation with a simple LFT that is implemented as plain uint8_t array. Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_port_profile.h | 1 - opensm/include/opensm/osm_router.h | 1 - opensm/include/opensm/osm_switch.h | 138 +++++++----------------------- opensm/opensm/osm_console.c | 5 +- opensm/opensm/osm_lin_fwd_rcv.c | 2 +- opensm/opensm/osm_sa_lft_record.c | 2 +- opensm/opensm/osm_sw_info_rcv.c | 4 +- opensm/opensm/osm_switch.c | 54 +++++++----- opensm/opensm/osm_ucast_file.c | 2 +- opensm/opensm/osm_ucast_lash.c | 1 - opensm/opensm/osm_ucast_mgr.c | 10 ++- 11 files changed, 77 insertions(+), 143 deletions(-) diff --git a/opensm/include/opensm/osm_port_profile.h b/opensm/include/opensm/osm_port_profile.h index 9b33e3a..be1b850 100644 --- a/opensm/include/opensm/osm_port_profile.h +++ b/opensm/include/opensm/osm_port_profile.h @@ -51,7 +51,6 @@ #include #include #include -#include #include #ifdef __cplusplus diff --git a/opensm/include/opensm/osm_router.h b/opensm/include/opensm/osm_router.h index 8cabdf8..4901aca 100644 --- a/opensm/include/opensm/osm_router.h +++ b/opensm/include/opensm/osm_router.h @@ -48,7 +48,6 @@ #include #include #include -#include #include #include diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h index 3d9a72d..0225f9d 100644 --- a/opensm/include/opensm/osm_switch.h +++ b/opensm/include/opensm/osm_switch.h @@ -48,7 +48,6 @@ #include #include #include -#include #include #include @@ -101,7 +100,7 @@ typedef struct osm_switch { uint16_t num_hops; uint8_t **hops; osm_port_profile_t *p_prof; - osm_fwd_tbl_t fwd_tbl; + uint8_t *lft; uint8_t *lft_buf; osm_mcast_tbl_t mcast_tbl; uint32_t discovery_count; @@ -135,8 +134,8 @@ typedef struct osm_switch { * p_prof * Pointer to array of Port Profile objects for this switch. * -* fwd_tbl -* This switch's forwarding table. +* lft +* This switch's linear forwarding table. * * lft_buf * This switch's linear forwarding table, as was @@ -275,33 +274,6 @@ osm_switch_get_hop_count(IN const osm_switch_t * const p_sw, * SEE ALSO *********/ -/****f* OpenSM: Switch/osm_switch_get_fwd_tbl_ptr -* NAME -* osm_switch_get_fwd_tbl_ptr -* -* DESCRIPTION -* Returns a pointer to the switch's forwarding table. -* -* SYNOPSIS -*/ -static inline osm_fwd_tbl_t *osm_switch_get_fwd_tbl_ptr(IN const osm_switch_t * - const p_sw) -{ - return ((osm_fwd_tbl_t *) & p_sw->fwd_tbl); -} -/* -* PARAMETERS -* p_sw -* [in] Pointer to a Switch object. -* -* RETURN VALUES -* Returns a pointer to the switch's forwarding table. -* -* NOTES -* -* SEE ALSO -*********/ - /****f* OpenSM: Switch/osm_switch_set_hops * NAME * osm_switch_set_hops @@ -437,7 +409,9 @@ static inline uint8_t osm_switch_get_port_by_lid(IN const osm_switch_t * const p_sw, IN const uint16_t lid_ho) { - return (osm_fwd_tbl_get(&p_sw->fwd_tbl, lid_ho)); + if (lid_ho == 0 || lid_ho > IB_LID_UCAST_END_HO) + return OSM_NO_PATH; + return p_sw->lft[lid_ho]; } /* * PARAMETERS @@ -500,12 +474,13 @@ static inline osm_physp_t *osm_switch_get_route_by_lid(IN const osm_switch_t * const p_sw, IN const ib_net16_t lid) { - uint8_t port_num; + uint8_t port_num = OSM_NO_PATH; CL_ASSERT(p_sw); CL_ASSERT(lid); - port_num = osm_fwd_tbl_get(&p_sw->fwd_tbl, cl_ntoh16(lid)); + port_num = osm_switch_get_port_by_lid(p_sw, cl_ntoh16(lid)); + /* In order to avoid holes in the subnet (usually happens when running UPDN algorithm), i.e. cases where port is @@ -572,35 +547,6 @@ osm_switch_sp0_is_lmc_capable(IN const osm_switch_t * const p_sw, * SEE ALSO *********/ -/****f* OpenSM: Switch/osm_switch_get_max_block_id -* NAME -* osm_switch_get_max_block_id -* -* DESCRIPTION -* Returns the maximum block ID (host order) of this switch. -* -* SYNOPSIS -*/ -static inline uint32_t -osm_switch_get_max_block_id(IN const osm_switch_t * const p_sw) -{ - return ((uint32_t) (osm_fwd_tbl_get_size(&p_sw->fwd_tbl) / - osm_fwd_tbl_get_lids_per_block(&p_sw->fwd_tbl))); -} -/* -* PARAMETERS -* p_sw -* [in] Pointer to an osm_switch_t object. -* -* RETURN VALUES -* Returns the maximum block ID (host order) of this switch. -* -* NOTES -* -* SEE ALSO -* Switch object -*********/ - /****f* OpenSM: Switch/osm_switch_get_max_block_id_in_use * NAME * osm_switch_get_max_block_id_in_use @@ -614,9 +560,8 @@ osm_switch_get_max_block_id(IN const osm_switch_t * const p_sw) static inline uint16_t osm_switch_get_max_block_id_in_use(IN const osm_switch_t * const p_sw) { - return (osm_fwd_tbl_get_max_block_id_in_use(&p_sw->fwd_tbl, - cl_ntoh16(p_sw->switch_info. - lin_top))); + return (uint16_t)(cl_ntoh16(p_sw->switch_info.lin_top) / + IB_SMP_DATA_SIZE); } /* * PARAMETERS @@ -632,19 +577,19 @@ osm_switch_get_max_block_id_in_use(IN const osm_switch_t * const p_sw) * Switch object *********/ -/****f* OpenSM: Switch/osm_switch_get_fwd_tbl_block +/****f* OpenSM: Switch/osm_switch_get_lft_block * NAME -* osm_switch_get_fwd_tbl_block +* osm_switch_get_lft_block * * DESCRIPTION -* Retrieve a forwarding table block. +* Retrieve a linear forwarding table block. * * SYNOPSIS */ boolean_t -osm_switch_get_fwd_tbl_block(IN const osm_switch_t * const p_sw, - IN const uint32_t block_id, - OUT uint8_t * const p_block); +osm_switch_get_lft_block(IN const osm_switch_t * const p_sw, + IN const uint32_t block_id, + OUT uint8_t * const p_block); /* * PARAMETERS * p_sw @@ -758,22 +703,30 @@ osm_switch_count_path(IN osm_switch_t * const p_sw, IN const uint8_t port) * SEE ALSO *********/ -/****f* OpenSM: Switch/osm_switch_set_ft_block +/****f* OpenSM: Switch/osm_switch_set_lft_block * NAME -* osm_switch_set_ft_block +* osm_switch_set_lft_block * * DESCRIPTION -* Copies in the specified block into the switch's Forwarding Table object. +* Copies in the specified block into +* the switch's Linear Forwarding Table. * * SYNOPSIS */ static inline ib_api_status_t -osm_switch_set_ft_block(IN osm_switch_t * const p_sw, - IN const uint8_t * const p_block, - IN const uint32_t block_num) +osm_switch_set_lft_block(IN osm_switch_t * const p_sw, + IN const uint8_t * const p_block, + IN const uint32_t block_num) { + uint16_t lid_start = + (uint16_t) (block_num * IB_SMP_DATA_SIZE); CL_ASSERT(p_sw); - return (osm_fwd_tbl_set_block(&p_sw->fwd_tbl, p_block, block_num)); + + if (lid_start + IB_SMP_DATA_SIZE > IB_LID_UCAST_END_HO) + return IB_INVALID_PARAMETER; + + memcpy(&p_sw->lft[lid_start], p_block, IB_SMP_DATA_SIZE); + return IB_SUCCESS; } /* * PARAMETERS @@ -1044,33 +997,6 @@ osm_switch_recommend_mcast_path(IN osm_switch_t * const p_sw, * SEE ALSO *********/ -/****f* OpenSM: Switch/osm_switch_get_fwd_tbl_size -* NAME -* osm_switch_get_fwd_tbl_size -* -* DESCRIPTION -* Returns the number of entries available in the forwarding table. -* -* SYNOPSIS -*/ -static inline uint16_t -osm_switch_get_fwd_tbl_size(IN const osm_switch_t * const p_sw) -{ - return (osm_fwd_tbl_get_size(&p_sw->fwd_tbl)); -} -/* -* PARAMETERS -* p_sw -* [in] Pointer to the switch. -* -* RETURN VALUE -* Returns the number of entries available in the forwarding table. -* -* NOTES -* -* SEE ALSO -*********/ - /****f* OpenSM: Switch/osm_switch_get_mcast_fwd_tbl_size * NAME * osm_switch_get_mcast_fwd_tbl_size diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index 9be88c7..b33ac67 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -52,7 +52,6 @@ #include #include #include -#include struct command { char *name; @@ -766,7 +765,7 @@ static void switchbalance_check(osm_opensm_t * p_osm, continue; for (lid_ho = min_lid_ho; lid_ho <= max_lid_ho; lid_ho++) { - port_num = osm_fwd_tbl_get(&(p_sw->fwd_tbl), lid_ho); + port_num = osm_switch_get_port_by_lid(p_sw, lid_ho); if (port_num == OSM_NO_PATH) continue; @@ -916,7 +915,7 @@ static void lidbalance_check(osm_opensm_t * p_osm, boolean_t rem_node_found = FALSE; unsigned int indx = 0; - port_num = osm_fwd_tbl_get(&(p_sw->fwd_tbl), lid_ho); + port_num = osm_switch_get_port_by_lid(p_sw, lid_ho); if (port_num == OSM_NO_PATH) continue; diff --git a/opensm/opensm/osm_lin_fwd_rcv.c b/opensm/opensm/osm_lin_fwd_rcv.c index c0ec72d..00ab760 100644 --- a/opensm/opensm/osm_lin_fwd_rcv.c +++ b/opensm/opensm/osm_lin_fwd_rcv.c @@ -87,7 +87,7 @@ void osm_lft_rcv_process(IN void *context, IN void *data) "LFT received for nonexistent node " "0x%" PRIx64 "\n", cl_ntoh64(node_guid)); } else { - status = osm_switch_set_ft_block(p_sw, p_block, block_num); + status = osm_switch_set_lft_block(p_sw, p_block, block_num); if (status != IB_SUCCESS) { OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0402: " "Setting forwarding table block failed (%s)" diff --git a/opensm/opensm/osm_sa_lft_record.c b/opensm/opensm/osm_sa_lft_record.c index e1fe8d5..dc3187b 100644 --- a/opensm/opensm/osm_sa_lft_record.c +++ b/opensm/opensm/osm_sa_lft_record.c @@ -100,7 +100,7 @@ __osm_lftr_rcv_new_lftr(IN osm_sa_t * sa, p_rec_item->rec.block_num = block; /* copy the lft block */ - osm_switch_get_fwd_tbl_block(p_sw, block, p_rec_item->rec.lft); + osm_switch_get_lft_block(p_sw, block, p_rec_item->rec.lft); cl_qlist_insert_tail(p_list, &p_rec_item->list_item); diff --git a/opensm/opensm/osm_sw_info_rcv.c b/opensm/opensm/osm_sw_info_rcv.c index f75be65..ff3132d 100644 --- a/opensm/opensm/osm_sw_info_rcv.c +++ b/opensm/opensm/osm_sw_info_rcv.c @@ -299,8 +299,8 @@ __osm_si_rcv_process_new(IN osm_sm_t * sm, } /* set subnet max unicast lid to the minimum LinearFDBCap of all switches */ - if (p_sw->fwd_tbl.p_lin_tbl->size < sm->p_subn->max_ucast_lid_ho) { - sm->p_subn->max_ucast_lid_ho = p_sw->fwd_tbl.p_lin_tbl->size; + if (cl_ntoh16(p_si->lin_cap) < sm->p_subn->max_ucast_lid_ho) { + sm->p_subn->max_ucast_lid_ho = cl_ntoh16(p_si->lin_cap); OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, "Subnet max unicast lid is 0x%X\n", sm->p_subn->max_ucast_lid_ho); diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c index 9bf76e0..bdfc7d0 100644 --- a/opensm/opensm/osm_switch.c +++ b/opensm/opensm/osm_switch.c @@ -97,9 +97,26 @@ osm_switch_init(IN osm_switch_t * const p_sw, p_sw->num_ports = num_ports; p_sw->need_update = 2; - status = osm_fwd_tbl_init(&p_sw->fwd_tbl, p_si); - if (status != IB_SUCCESS) + /* Initiate the linear forwarding table */ + + if (!p_si->lin_cap) { + /* This switch does not support linear forwarding tables */ + status = IB_UNSUPPORTED; goto Exit; + } + + /* The capacity reported by the switch includes LID 0, + so add 1 to the end of the range here for this assert. */ + CL_ASSERT(cl_ntoh16(p_si->lin_cap) <= IB_LID_UCAST_END_HO + 1); + + p_sw->lft = malloc(IB_LID_UCAST_END_HO + 1); + if (!p_sw->lft) { + status = IB_INSUFFICIENT_MEMORY; + goto Exit; + } + + /* Initialize the table to OSM_NO_PATH, which is "invalid port" */ + memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); p_sw->lft_buf = malloc(IB_LID_UCAST_END_HO + 1); if (!p_sw->lft_buf) { @@ -138,7 +155,8 @@ void osm_switch_delete(IN OUT osm_switch_t ** const pp_sw) osm_mcast_tbl_destroy(&p_sw->mcast_tbl); free(p_sw->p_prof); - osm_fwd_tbl_destroy(&p_sw->fwd_tbl); + if (p_sw->lft) + free(p_sw->lft); if (p_sw->lft_buf) free(p_sw->lft_buf); if (p_sw->hops) { @@ -176,44 +194,36 @@ osm_switch_t *osm_switch_new(IN osm_node_t * const p_node, /********************************************************************** **********************************************************************/ boolean_t -osm_switch_get_fwd_tbl_block(IN const osm_switch_t * const p_sw, - IN const uint32_t block_id, - OUT uint8_t * const p_block) +osm_switch_get_lft_block(IN const osm_switch_t * const p_sw, + IN const uint32_t block_id, + OUT uint8_t * const p_block) { uint16_t base_lid_ho; - uint16_t max_lid_ho; - uint16_t lid_ho; uint16_t block_top_lid_ho; - uint32_t lids_per_block; - osm_fwd_tbl_t *p_tbl; boolean_t return_flag = FALSE; CL_ASSERT(p_sw); CL_ASSERT(p_block); - p_tbl = osm_switch_get_fwd_tbl_ptr(p_sw); - max_lid_ho = p_sw->max_lid_ho; - lids_per_block = osm_fwd_tbl_get_lids_per_block(&p_sw->fwd_tbl); - base_lid_ho = (uint16_t) (block_id * lids_per_block); + base_lid_ho = (uint16_t) (block_id * IB_SMP_DATA_SIZE); - if (base_lid_ho <= max_lid_ho) { + if (base_lid_ho <= p_sw->max_lid_ho) { /* Initialize LIDs in block to invalid port number. */ memset(p_block, OSM_NO_PATH, IB_SMP_DATA_SIZE); /* Determine the range of LIDs we can return with this block. */ block_top_lid_ho = - (uint16_t) (base_lid_ho + lids_per_block - 1); - if (block_top_lid_ho > max_lid_ho) - block_top_lid_ho = max_lid_ho; + (uint16_t) (base_lid_ho + IB_SMP_DATA_SIZE - 1); + if (block_top_lid_ho > p_sw->max_lid_ho) + block_top_lid_ho = p_sw->max_lid_ho; /* Configure the forwarding table with the routing information for the specified block of LIDs. */ - for (lid_ho = base_lid_ho; lid_ho <= block_top_lid_ho; lid_ho++) - p_block[lid_ho - base_lid_ho] = - osm_fwd_tbl_get(p_tbl, lid_ho); + memcpy(p_block, &(p_sw->lft[base_lid_ho]), + block_top_lid_ho - base_lid_ho + 1); return_flag = TRUE; } @@ -359,7 +369,7 @@ osm_switch_recommend_path(IN const osm_switch_t * const p_sw, 4. the port has min-hops to the target (avoid loops) */ if (!ignore_existing) { - port_num = osm_fwd_tbl_get(&p_sw->fwd_tbl, lid_ho); + port_num = osm_switch_get_port_by_lid(p_sw, lid_ho); if (port_num != OSM_NO_PATH) { CL_ASSERT(port_num < num_ports); diff --git a/opensm/opensm/osm_ucast_file.c b/opensm/opensm/osm_ucast_file.c index a6edf5d..865ad82 100644 --- a/opensm/opensm/osm_ucast_file.c +++ b/opensm/opensm/osm_ucast_file.c @@ -83,7 +83,7 @@ static void add_path(osm_opensm_t * p_osm, uint8_t old_port; new_lid = port_guid ? remap_lid(p_osm, lid, port_guid) : lid; - old_port = osm_fwd_tbl_get(osm_switch_get_fwd_tbl_ptr(p_sw), new_lid); + old_port = osm_switch_get_port_by_lid(p_sw, new_lid); if (old_port != OSM_NO_PATH && old_port != port_num) { OSM_LOG(&p_osm->log, OSM_LOG_VERBOSE, "LID collision is detected on switch " diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index c7dbade..54a2aa3 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -52,7 +52,6 @@ #include #include #include -#include /* //////////////////////////// */ /* Local types */ diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index 3bc3912..cb1f5a2 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -247,7 +247,7 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr, lid_ho, min_lid_ho, max_lid_ho); /* TODO - This should be runtime error, not a CL_ASSERT() */ - CL_ASSERT(max_lid_ho < osm_switch_get_fwd_tbl_size(p_sw)); + CL_ASSERT(max_lid_ho <= IB_LID_UCAST_END_HO); node_guid = osm_node_get_node_guid(p_sw->p_node); @@ -393,17 +393,19 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr, context.lft_context.set_method = TRUE; for (block_id_ho = 0; - osm_switch_get_fwd_tbl_block(p_sw, block_id_ho, block); + osm_switch_get_lft_block(p_sw, block_id_ho, block); block_id_ho++) { if (!p_sw->need_update && - !memcmp(block, p_sw->lft_buf + block_id_ho * 64, 64)) + !memcmp(block, + p_sw->lft_buf + block_id_ho * IB_SMP_DATA_SIZE, + IB_SMP_DATA_SIZE)) continue; OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, "Writing FT block %u\n", block_id_ho); status = osm_req_set(p_mgr->sm, p_path, - p_sw->lft_buf + block_id_ho * 64, + p_sw->lft_buf + block_id_ho * IB_SMP_DATA_SIZE, sizeof(block), IB_MAD_ATTR_LIN_FWD_TBL, cl_hton32(block_id_ho), -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Thu Oct 16 14:37:02 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 16 Oct 2008 23:37:02 +0200 Subject: [ofa-general] [PATCH 2/2] opensm: replace switch's fwd_tbl with simple LFT - remove obsolete files Message-ID: <48F7B3FE.4000702@dev.mellanox.co.il> Remove all the fwd_tbl files that became obsolete. Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_fwd_tbl.h | 373 ------------------------------ opensm/include/opensm/osm_lin_fwd_tbl.h | 359 ---------------------------- opensm/include/opensm/osm_rand_fwd_tbl.h | 337 --------------------------- opensm/opensm/Makefile.am | 7 +- opensm/opensm/osm_fwd_tbl.c | 100 -------- opensm/opensm/osm_lin_fwd_tbl.c | 88 ------- 6 files changed, 2 insertions(+), 1262 deletions(-) delete mode 100644 opensm/include/opensm/osm_fwd_tbl.h delete mode 100644 opensm/include/opensm/osm_lin_fwd_tbl.h delete mode 100644 opensm/include/opensm/osm_rand_fwd_tbl.h delete mode 100644 opensm/opensm/osm_fwd_tbl.c delete mode 100644 opensm/opensm/osm_lin_fwd_tbl.c diff --git a/opensm/include/opensm/osm_fwd_tbl.h b/opensm/include/opensm/osm_fwd_tbl.h deleted file mode 100644 index 55e853f..0000000 --- a/opensm/include/opensm/osm_fwd_tbl.h +++ /dev/null @@ -1,373 +0,0 @@ -/* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. - * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -/* - * Abstract: - * Declaration of osm_fwd_tbl_t. - * This object represents a unicast forwarding table. - * This object is part of the OpenSM family of objects. - */ - -#ifndef _OSM_FWD_TBL_H_ -#define _OSM_FWD_TBL_H_ - -#include -#include -#include -#include - -#ifdef __cplusplus -# define BEGIN_C_DECLS extern "C" { -# define END_C_DECLS } -#else /* !__cplusplus */ -# define BEGIN_C_DECLS -# define END_C_DECLS -#endif /* __cplusplus */ - -BEGIN_C_DECLS -/****h* OpenSM/Forwarding Table -* NAME -* Forwarding Table -* -* DESCRIPTION -* The Forwarding Table objects encapsulate the information -* needed by the OpenSM to manage forwarding tables. The OpenSM -* allocates one Forwarding Table object per switch in the -* IBA subnet. -* -* The Forwarding Table objects are not thread safe, thus -* callers must provide serialization. -* -* AUTHOR -* Steve King, Intel -* -*********/ -/****s* OpenSM: Forwarding Table/osm_fwd_tbl_t -* NAME -* osm_fwd_tbl_t -* -* DESCRIPTION -* Forwarding Table structure. This object hides the type -* of fowarding table (linear or random) actually used by -* the switch. -* -* This object should be treated as opaque and should -* be manipulated only through the provided functions. -* -* SYNOPSIS -*/ -typedef struct osm_fwd_tbl { - osm_rand_fwd_tbl_t *p_rnd_tbl; - osm_lin_fwd_tbl_t *p_lin_tbl; -} osm_fwd_tbl_t; -/* -* FIELDS -* p_rnd_tbl -* Pointer to the switch's Random Forwarding Table object. -* If the switch does not use a Random Forwarding Table, -* then this pointer is NULL. -* -* p_lin_tbl -* Pointer to the switch's Linear Forwarding Table object. -* If the switch does not use a Linear Forwarding Table, -* then this pointer is NULL. -* -* SEE ALSO -* Forwarding Table object, Random Forwarding Table object. -*********/ - -/****f* OpenSM: Forwarding Table/osm_fwd_tbl_init -* NAME -* osm_fwd_tbl_init -* -* DESCRIPTION -* Initializes a Forwarding Table object. -* -* SYNOPSIS -*/ -ib_api_status_t -osm_fwd_tbl_init(IN osm_fwd_tbl_t * const p_tbl, - IN const ib_switch_info_t * const p_si); -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* p_si -* [in] Pointer to the SwitchInfo attribute of the associated -* switch. -* -* RETURN VALUE -* IB_SUCCESS if the operation is successful. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_fwd_tbl_destroy -* NAME -* osm_fwd_tbl_destroy -* -* DESCRIPTION -* Destroys a Forwarding Table object. -* -* SYNOPSIS -*/ -void osm_fwd_tbl_destroy(IN osm_fwd_tbl_t * const p_tbl); -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* None. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_fwd_tbl_get -* NAME -* osm_fwd_tbl_get -* -* DESCRIPTION -* Returns the port that routes the specified LID. -* -* SYNOPSIS -*/ -static inline uint8_t -osm_fwd_tbl_get(IN const osm_fwd_tbl_t * const p_tbl, IN uint16_t const lid_ho) -{ - if (p_tbl->p_lin_tbl) - return (osm_lin_fwd_tbl_get(p_tbl->p_lin_tbl, lid_ho)); - else - return (osm_rand_fwd_tbl_get(p_tbl->p_rnd_tbl, lid_ho)); -} - -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* lid_ho -* [in] LID (host order) for which to find the route. -* -* RETURN VALUE -* Returns the port that routes the specified LID. -* IB_INVALID_PORT_NUM if the table does not have a route for this LID. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_fwd_tbl_set -* NAME -* osm_fwd_tbl_set -* -* DESCRIPTION -* Sets the port to route the specified LID. -* -* SYNOPSIS -*/ -static inline void -osm_fwd_tbl_set(IN osm_fwd_tbl_t * const p_tbl, - IN const uint16_t lid_ho, IN const uint8_t port) -{ - CL_ASSERT(p_tbl); - if (p_tbl->p_lin_tbl) - osm_lin_fwd_tbl_set(p_tbl->p_lin_tbl, lid_ho, port); - else - osm_rand_fwd_tbl_set(p_tbl->p_rnd_tbl, lid_ho, port); -} - -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* lid_ho -* [in] LID value (host order) for which to set the route. -* -* port -* [in] Port to route the specified LID value. -* -* RETURN VALUE -* None. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_fwd_tbl_set_block -* NAME -* osm_fwd_tbl_set_block -* -* DESCRIPTION -* Copies the specified block into the Forwarding Table. -* -* SYNOPSIS -*/ -static inline ib_api_status_t -osm_fwd_tbl_set_block(IN osm_fwd_tbl_t * const p_tbl, - IN const uint8_t * const p_block, - IN const uint32_t block_num) -{ - CL_ASSERT(p_tbl); - if (p_tbl->p_lin_tbl) - return (osm_lin_fwd_tbl_set_block(p_tbl->p_lin_tbl, - p_block, block_num)); - else - return (osm_rand_fwd_tbl_set_block(p_tbl->p_rnd_tbl, - p_block, block_num)); -} - -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* None. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_fwd_tbl_get_size -* NAME -* osm_fwd_tbl_get_size -* -* DESCRIPTION -* Returns the number of entries available in the forwarding table. -* -* SYNOPSIS -*/ -static inline uint16_t -osm_fwd_tbl_get_size(IN const osm_fwd_tbl_t * const p_tbl) -{ - CL_ASSERT(p_tbl); - if (p_tbl->p_lin_tbl) - return (osm_lin_fwd_tbl_get_size(p_tbl->p_lin_tbl)); - else - return (osm_rand_fwd_tbl_get_size(p_tbl->p_rnd_tbl)); -} - -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* Returns the number of entries available in the forwarding table. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_fwd_tbl_get_lids_per_block -* NAME -* osm_fwd_tbl_get_lids_per_block -* -* DESCRIPTION -* Returns the number of LIDs per LID block. -* -* SYNOPSIS -*/ -static inline uint16_t -osm_fwd_tbl_get_lids_per_block(IN const osm_fwd_tbl_t * const p_tbl) -{ - CL_ASSERT(p_tbl); - if (p_tbl->p_lin_tbl) - return (osm_lin_fwd_tbl_get_lids_per_block(p_tbl->p_lin_tbl)); - else - return (osm_rand_fwd_tbl_get_lids_per_block(p_tbl->p_rnd_tbl)); -} - -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* Returns the number of LIDs per LID block. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_fwd_tbl_get_max_block_id_in_use -* NAME -* osm_fwd_tbl_get_max_block_id_in_use -* -* DESCRIPTION -* Returns the number of LIDs per LID block. -* -* SYNOPSIS -*/ -static inline uint16_t -osm_fwd_tbl_get_max_block_id_in_use(IN const osm_fwd_tbl_t * const p_tbl, - IN const uint16_t lid_top_ho) -{ - CL_ASSERT(p_tbl); - if (p_tbl->p_lin_tbl) - return (osm_lin_fwd_tbl_get_max_block_id_in_use - (p_tbl->p_lin_tbl, lid_top_ho)); - else - return (osm_rand_fwd_tbl_get_max_block_id_in_use - (p_tbl->p_rnd_tbl, lid_top_ho)); -} - -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* Returns the number of LIDs per LID block. -* -* NOTES -* -* SEE ALSO -*********/ - -END_C_DECLS -#endif /* _OSM_FWD_TBL_H_ */ diff --git a/opensm/include/opensm/osm_lin_fwd_tbl.h b/opensm/include/opensm/osm_lin_fwd_tbl.h deleted file mode 100644 index be3a3ee..0000000 --- a/opensm/include/opensm/osm_lin_fwd_tbl.h +++ /dev/null @@ -1,359 +0,0 @@ -/* - * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. - * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -/* - * Abstract: - * Declaration of osm_lin_fwd_tbl_t. - * This object represents a linear forwarding table. - * This object is part of the OpenSM family of objects. - */ - -#ifndef _OSM_LIN_FWD_TBL_H_ -#define _OSM_LIN_FWD_TBL_H_ - -#include -#include -#include - -#ifdef __cplusplus -# define BEGIN_C_DECLS extern "C" { -# define END_C_DECLS } -#else /* !__cplusplus */ -# define BEGIN_C_DECLS -# define END_C_DECLS -#endif /* __cplusplus */ - -BEGIN_C_DECLS -/****h* OpenSM/Linear Forwarding Table -* NAME -* Linear Forwarding Table -* -* DESCRIPTION -* The Linear Forwarding Table objects encapsulate the information -* needed by the OpenSM to manage linear forwarding tables. The OpenSM -* allocates one Linear Forwarding Table object per switch in the -* IBA subnet, if that switch uses a linear table. -* -* The Linear Forwarding Table objects are not thread safe, thus -* callers must provide serialization. -* -* AUTHOR -* Steve King, Intel -* -*********/ -/****s* OpenSM: Forwarding Table/osm_lin_fwd_tbl_t -* NAME -* osm_lin_fwd_tbl_t -* -* DESCRIPTION -* Linear Forwarding Table structure. -* -* Callers may directly access this object. -* -* SYNOPSIS -*/ -typedef struct osm_lin_fwd_tbl { - uint16_t size; - uint8_t port_tbl[1]; -} osm_lin_fwd_tbl_t; -/* -* FIELDS -* Size -* Number of entries in the linear forwarding table. This value -* is taken from the SwitchInfo attribute. -* -* port_tbl -* The array that specifies the port number which routes the -* corresponding LID. Index is by LID. -* -* SEE ALSO -* Forwarding Table object, Random Forwarding Table object. -*********/ - -/****f* OpenSM: Forwarding Table/osm_lin_tbl_new -* NAME -* osm_lin_tbl_new -* -* DESCRIPTION -* This function creates and initializes a Linear Forwarding Table object. -* -* SYNOPSIS -*/ -osm_lin_fwd_tbl_t *osm_lin_tbl_new(IN uint16_t const size); -/* -* PARAMETERS -* size -* [in] Number of entries in the Linear Forwarding Table. -* -* RETURN VALUE -* On success, returns a pointer to a new Linear Forwarding Table object -* of the specified size. -* NULL otherwise. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_lin_tbl_delete -* NAME -* osm_lin_tbl_delete -* -* DESCRIPTION -* This destroys and deallocates a Linear Forwarding Table object. -* -* SYNOPSIS -*/ -void osm_lin_tbl_delete(IN osm_lin_fwd_tbl_t ** const pp_tbl); -/* -* PARAMETERS -* pp_tbl -* [in] Pointer a Pointer to the Linear Forwarding Table object. -* -* RETURN VALUE -* On success, returns a pointer to a new Linear Forwarding Table object -* of the specified size. -* NULL otherwise. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_lin_fwd_tbl_set -* NAME -* osm_lin_fwd_tbl_set -* -* DESCRIPTION -* Sets the port to route the specified LID. -* -* SYNOPSIS -*/ -static inline void -osm_lin_fwd_tbl_set(IN osm_lin_fwd_tbl_t * const p_tbl, - IN const uint16_t lid_ho, IN const uint8_t port) -{ - CL_ASSERT(lid_ho < p_tbl->size); - if (lid_ho < p_tbl->size) - p_tbl->port_tbl[lid_ho] = port; -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Linear Forwarding Table object. -* -* lid_ho -* [in] LID value (host order) for which to set the route. -* -* port -* [in] Port to route the specified LID value. -* -* RETURN VALUE -* None. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_lin_fwd_tbl_get -* NAME -* osm_lin_fwd_tbl_get -* -* DESCRIPTION -* Returns the port that routes the specified LID. -* -* SYNOPSIS -*/ -static inline uint8_t -osm_lin_fwd_tbl_get(IN const osm_lin_fwd_tbl_t * const p_tbl, - IN const uint16_t lid_ho) -{ - if (lid_ho < p_tbl->size) - return (p_tbl->port_tbl[lid_ho]); - else - return (OSM_NO_PATH); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Linear Forwarding Table object. -* -* lid_ho -* [in] LID value (host order) for which to get the route. -* -* RETURN VALUE -* Returns the port that routes the specified LID. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_lin_fwd_tbl_get_size -* NAME -* osm_lin_fwd_tbl_get_size -* -* DESCRIPTION -* Returns the number of entries available in the forwarding table. -* -* SYNOPSIS -*/ -static inline uint16_t -osm_lin_fwd_tbl_get_size(IN const osm_lin_fwd_tbl_t * const p_tbl) -{ - return (p_tbl->size); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* Returns the number of entries available in the forwarding table. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_lin_fwd_tbl_get_lids_per_block -* NAME -* osm_lin_fwd_tbl_get_lids_per_block -* -* DESCRIPTION -* Returns the number of LIDs per LID block. -* -* SYNOPSIS -*/ -static inline uint16_t -osm_lin_fwd_tbl_get_lids_per_block(IN const osm_lin_fwd_tbl_t * const p_tbl) -{ - UNUSED_PARAM(p_tbl); - return (64); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* Returns the number of LIDs per LID block. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_lin_fwd_tbl_get_max_block_id_in_use -* NAME -* osm_lin_fwd_tbl_get_max_block_id_in_use -* -* DESCRIPTION -* Returns the maximum block ID in actual use by the forwarding table. -* -* SYNOPSIS -*/ -static inline uint16_t -osm_lin_fwd_tbl_get_max_block_id_in_use(IN const osm_lin_fwd_tbl_t * - const p_tbl, - IN const uint16_t lid_top_ho) -{ - return ((uint16_t) (lid_top_ho / - osm_lin_fwd_tbl_get_lids_per_block(p_tbl))); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* Returns the maximum block ID in actual use by the forwarding table. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_lin_fwd_tbl_set_block -* NAME -* osm_lin_fwd_tbl_set_block -* -* DESCRIPTION -* Copies the specified block into the Linear Forwarding Table. -* -* SYNOPSIS -*/ -static inline ib_api_status_t -osm_lin_fwd_tbl_set_block(IN osm_lin_fwd_tbl_t * const p_tbl, - IN const uint8_t * const p_block, - IN const uint32_t block_num) -{ - uint16_t lid_start; - uint16_t num_lids; - - CL_ASSERT(p_tbl); - CL_ASSERT(p_block); - - num_lids = osm_lin_fwd_tbl_get_lids_per_block(p_tbl); - lid_start = (uint16_t) (block_num * num_lids); - - if (lid_start + num_lids > p_tbl->size) - return (IB_INVALID_PARAMETER); - - memcpy(&p_tbl->port_tbl[lid_start], p_block, num_lids); - return (IB_SUCCESS); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Linear Forwarding Table object. -* -* p_block -* [in] Pointer to the Forwarding Table block. -* -* block_num -* [in] Block number of this block. -* -* RETURN VALUE -* None. -* -* NOTES -* -* SEE ALSO -*********/ - -END_C_DECLS -#endif /* _OSM_LIN_FWD_TBL_H_ */ diff --git a/opensm/include/opensm/osm_rand_fwd_tbl.h b/opensm/include/opensm/osm_rand_fwd_tbl.h deleted file mode 100644 index 31098b9..0000000 --- a/opensm/include/opensm/osm_rand_fwd_tbl.h +++ /dev/null @@ -1,337 +0,0 @@ -/* - * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. - * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -/* - * Abstract: - * Declaration of osm_rand_fwd_tbl_t. - * This object represents a random forwarding table. - * This object is part of the OpenSM family of objects. - */ - -#ifndef _OSM_RAND_FWD_TBL_H_ -#define _OSM_RAND_FWD_TBL_H_ - -#include -#include -#include - -#ifdef __cplusplus -# define BEGIN_C_DECLS extern "C" { -# define END_C_DECLS } -#else /* !__cplusplus */ -# define BEGIN_C_DECLS -# define END_C_DECLS -#endif /* __cplusplus */ - -BEGIN_C_DECLS -/****h* OpenSM/Random Forwarding Table -* NAME -* Random Forwarding Table -* -* DESCRIPTION -* The Random Forwarding Table objects encapsulate the information -* needed by the OpenSM to manage random forwarding tables. The OpenSM -* allocates one Random Forwarding Table object per switch in the -* IBA subnet, if that switch uses a random forwarding table. -* -* The Random Forwarding Table objects are not thread safe, thus -* callers must provide serialization. -* -* ** RANDOM FORWARDING TABLES ARE NOT SUPPORTED IN THE CURRENT VERSION ** -* -* AUTHOR -* Steve King, Intel -* -*********/ -/****s* OpenSM: Forwarding Table/osm_rand_fwd_tbl_t -* NAME -* osm_rand_fwd_tbl_t -* -* DESCRIPTION -* Random Forwarding Table structure. -* -* THIS OBJECT IS PLACE HOLDER. SUPPORT FOR SWITCHES WITH -* RANDOM FORWARDING TABLES HAS NOT BEEN IMPLEMENTED YET. -* -* SYNOPSIS -*/ -typedef struct osm_rand_fwd_tbl { - /* PLACE HOLDER STRUCTURE ONLY!! */ - uint32_t size; -} osm_rand_fwd_tbl_t; -/* -* FIELDS -* RANDOM FORWARDING TABLES ARE NOT SUPPORTED YET!! -* -* SEE ALSO -* Forwarding Table object, Random Forwarding Table object. -*********/ - -/****f* OpenSM: Forwarding Table/osm_rand_tbl_delete -* NAME -* osm_rand_tbl_delete -* -* DESCRIPTION -* This destroys and deallocates a Random Forwarding Table object. -* -* SYNOPSIS -*/ -static inline void osm_rand_tbl_delete(IN osm_rand_fwd_tbl_t ** const pp_tbl) -{ - /* - TO DO - This is a place holder function only! - */ - free(*pp_tbl); - *pp_tbl = NULL; -} -/* -* PARAMETERS -* pp_tbl -* [in] Pointer a Pointer to the Random Forwarding Table object. -* -* RETURN VALUE -* On success, returns a pointer to a new Random Forwarding Table object -* of the specified size. -* NULL otherwise. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_rand_fwd_tbl_set -* NAME -* osm_rand_fwd_tbl_set -* -* DESCRIPTION -* Sets the port to route the specified LID. -* -* SYNOPSIS -*/ -static inline void -osm_rand_fwd_tbl_set(IN osm_rand_fwd_tbl_t * const p_tbl, - IN const uint16_t lid_ho, IN const uint8_t port) -{ - /* Random forwarding tables not supported yet. */ - UNUSED_PARAM(p_tbl); - UNUSED_PARAM(lid_ho); - UNUSED_PARAM(port); - CL_ASSERT(FALSE); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Random Forwarding Table object. -* -* lid_ho -* [in] LID value (host order) for which to set the route. -* -* port -* [in] Port to route the specified LID value. -* -* RETURN VALUE -* None. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_rand_fwd_tbl_set_block -* NAME -* osm_rand_fwd_tbl_set_block -* -* DESCRIPTION -* Copies the specified block into the Random Forwarding Table. -* -* SYNOPSIS -*/ -static inline ib_api_status_t -osm_rand_fwd_tbl_set_block(IN osm_rand_fwd_tbl_t * const p_tbl, - IN const uint8_t * const p_block, - IN const uint32_t block_num) -{ - /* Random forwarding tables not supported yet. */ - UNUSED_PARAM(p_tbl); - UNUSED_PARAM(p_block); - UNUSED_PARAM(block_num); - CL_ASSERT(FALSE); - return (IB_ERROR); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Random Forwarding Table object. -* -* p_block -* [in] Pointer to the Forwarding Table block. -* -* block_num -* [in] Block number of this block. -* -* RETURN VALUE -* None. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_rand_fwd_tbl_get -* NAME -* osm_rand_fwd_tbl_get -* -* DESCRIPTION -* Returns the port that routes the specified LID. -* -* SYNOPSIS -*/ -static inline uint8_t -osm_rand_fwd_tbl_get(IN const osm_rand_fwd_tbl_t * const p_tbl, - IN const uint16_t lid_ho) -{ - CL_ASSERT(FALSE); - UNUSED_PARAM(p_tbl); - UNUSED_PARAM(lid_ho); - - return (OSM_NO_PATH); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Random Forwarding Table object. -* -* lid_ho -* [in] LID value (host order) for which to get the route. -* -* RETURN VALUE -* Returns the port that routes the specified LID. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_rand_fwd_tbl_get_lids_per_block -* NAME -* osm_rand_fwd_tbl_get_lids_per_block -* -* DESCRIPTION -* Returns the number of LIDs per LID block. -* -* SYNOPSIS -*/ -static inline uint16_t -osm_rand_fwd_tbl_get_lids_per_block(IN const osm_rand_fwd_tbl_t * const p_tbl) -{ - UNUSED_PARAM(p_tbl); - return (16); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* Returns the number of LIDs per LID block. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_rand_fwd_tbl_get_max_block_id_in_use -* NAME -* osm_rand_fwd_tbl_get_max_block_id_in_use -* -* DESCRIPTION -* Returns the maximum block ID in actual use by the forwarding table. -* -* SYNOPSIS -*/ -static inline uint16_t -osm_rand_fwd_tbl_get_max_block_id_in_use(IN const osm_rand_fwd_tbl_t * - const p_tbl, - IN const uint16_t lid_top_ho) -{ - UNUSED_PARAM(p_tbl); - UNUSED_PARAM(lid_top_ho); - CL_ASSERT(FALSE); - return (0); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* Returns the maximum block ID in actual use by the forwarding table. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_rand_fwd_tbl_get_size -* NAME -* osm_rand_fwd_tbl_get_size -* -* DESCRIPTION -* Returns the number of entries available in the forwarding table. -* -* SYNOPSIS -*/ -static inline uint16_t -osm_rand_fwd_tbl_get_size(IN const osm_rand_fwd_tbl_t * const p_tbl) -{ - UNUSED_PARAM(p_tbl); - CL_ASSERT(FALSE); - return (0); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* Returns the number of entries available in the forwarding table. -* -* NOTES -* -* SEE ALSO -*********/ - -END_C_DECLS -#endif /* _OSM_RAND_FWD_TBL_H_ */ diff --git a/opensm/opensm/Makefile.am b/opensm/opensm/Makefile.am index 1d345a5..01573d2 100644 --- a/opensm/opensm/Makefile.am +++ b/opensm/opensm/Makefile.am @@ -27,9 +27,9 @@ libopensm_la_DEPENDENCIES = $(srcdir)/libopensm.map sbin_PROGRAMS = opensm opensm_DEPENDENCIES = libopensm.la opensm_SOURCES = main.c osm_console_io.c osm_console.c osm_db_files.c \ - osm_db_pack.c osm_drop_mgr.c osm_fwd_tbl.c \ + osm_db_pack.c osm_drop_mgr.c \ osm_inform.c osm_lid_mgr.c osm_lin_fwd_rcv.c \ - osm_lin_fwd_tbl.c osm_link_mgr.c osm_mcast_fwd_rcv.c \ + osm_link_mgr.c osm_mcast_fwd_rcv.c \ osm_mcast_mgr.c osm_mcast_tbl.c osm_mcm_info.c \ osm_mcm_port.c osm_mtree.c osm_multicast.c osm_node.c \ osm_node_desc_rcv.c osm_node_info_rcv.c \ @@ -74,11 +74,9 @@ opensminclude_HEADERS = \ $(srcdir)/../include/opensm/osm_db_pack.h \ $(srcdir)/../include/opensm/osm_event_plugin.h \ $(srcdir)/../include/opensm/osm_errors.h \ - $(srcdir)/../include/opensm/osm_fwd_tbl.h \ $(srcdir)/../include/opensm/osm_helper.h \ $(srcdir)/../include/opensm/osm_inform.h \ $(srcdir)/../include/opensm/osm_lid_mgr.h \ - $(srcdir)/../include/opensm/osm_lin_fwd_tbl.h \ $(srcdir)/../include/opensm/osm_log.h \ $(srcdir)/../include/opensm/osm_mad_pool.h \ $(srcdir)/../include/opensm/osm_madw.h \ @@ -100,7 +98,6 @@ opensminclude_HEADERS = \ $(srcdir)/../include/opensm/osm_port_profile.h \ $(srcdir)/../include/opensm/osm_prefix_route.h \ $(srcdir)/../include/opensm/osm_qos_policy.h \ - $(srcdir)/../include/opensm/osm_rand_fwd_tbl.h \ $(srcdir)/../include/opensm/osm_remote_sm.h \ $(srcdir)/../include/opensm/osm_router.h \ $(srcdir)/../include/opensm/osm_sa.h \ diff --git a/opensm/opensm/osm_fwd_tbl.c b/opensm/opensm/osm_fwd_tbl.c deleted file mode 100644 index 2ea74af..0000000 --- a/opensm/opensm/osm_fwd_tbl.c +++ /dev/null @@ -1,100 +0,0 @@ -/* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. - * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -/* - * Abstract: - * Implementation of osm_fwd_tbl_t. - * This object represents a unicast forwarding table. - * This object is part of the opensm family of objects. - */ - -#if HAVE_CONFIG_H -# include -#endif /* HAVE_CONFIG_H */ - -#include -#include -#include - -/********************************************************************** - **********************************************************************/ -ib_api_status_t -osm_fwd_tbl_init(IN osm_fwd_tbl_t * const p_tbl, - IN const ib_switch_info_t * const p_si) -{ - uint16_t tbl_cap; - ib_api_status_t status = IB_SUCCESS; - - /* - Determine the type and size of the forwarding table - used by this switch, then initialize accordingly. - The current implementation only supports switches - with linear forwarding tables. - */ - tbl_cap = cl_ntoh16(p_si->lin_cap); - - if (tbl_cap == 0) { - /* - This switch does not support linear forwarding - tables. Error out for now. - */ - status = IB_UNSUPPORTED; - goto Exit; - } - - p_tbl->p_rnd_tbl = NULL; - - p_tbl->p_lin_tbl = osm_lin_tbl_new(tbl_cap); - - if (p_tbl->p_lin_tbl == NULL) { - status = IB_INSUFFICIENT_MEMORY; - goto Exit; - } - -Exit: - return (status); -} - -/********************************************************************** - **********************************************************************/ -void osm_fwd_tbl_destroy(IN osm_fwd_tbl_t * const p_tbl) -{ - if (p_tbl->p_lin_tbl) { - CL_ASSERT(p_tbl->p_rnd_tbl == NULL); - osm_lin_tbl_delete(&p_tbl->p_lin_tbl); - } else { - osm_rand_tbl_delete(&p_tbl->p_rnd_tbl); - } -} diff --git a/opensm/opensm/osm_lin_fwd_tbl.c b/opensm/opensm/osm_lin_fwd_tbl.c deleted file mode 100644 index 7d1eeff..0000000 --- a/opensm/opensm/osm_lin_fwd_tbl.c +++ /dev/null @@ -1,88 +0,0 @@ -/* - * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. - * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -/* - * Abstract: - * Implementation of osm_lin_fwd_tbl_t. - * This object represents an linear forwarding table. - * This object is part of the opensm family of objects. - */ - -#if HAVE_CONFIG_H -# include -#endif /* HAVE_CONFIG_H */ - -#include -#include -#include -#include -#include - -static inline size_t __osm_lin_tbl_compute_obj_size(IN const uint16_t num_lids) -{ - return (sizeof(osm_lin_fwd_tbl_t) + (num_lids - 1)); -} - -/********************************************************************** - **********************************************************************/ -osm_lin_fwd_tbl_t *osm_lin_tbl_new(IN uint16_t const size) -{ - osm_lin_fwd_tbl_t *p_tbl; - - /* - The capacity reported by the switch includes LID 0, - so add 1 to the end of the range here for this assert. - */ - CL_ASSERT(size <= IB_LID_UCAST_END_HO + 1); - p_tbl = - (osm_lin_fwd_tbl_t *) malloc(__osm_lin_tbl_compute_obj_size(size)); - - /* - Initialize the table to OSM_NO_PATH, which means "invalid port" - */ - if (p_tbl != NULL) { - memset(p_tbl, OSM_NO_PATH, __osm_lin_tbl_compute_obj_size(size)); - p_tbl->size = size; - } - return (p_tbl); -} - -/********************************************************************** - **********************************************************************/ -void osm_lin_tbl_delete(IN osm_lin_fwd_tbl_t ** const pp_tbl) -{ - free(*pp_tbl); - *pp_tbl = NULL; -} -- 1.5.1.4 From rdreier at cisco.com Thu Oct 16 15:35:36 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 16 Oct 2008 15:35:36 -0700 Subject: [ofa-general] problems with epoll and ipoib (memory leak and cpu creep) In-Reply-To: <48F777FE.7080601@tradeworx.com> (murray smigel's message of "Thu, 16 Oct 2008 13:21:02 -0400") References: <48F777FE.7080601@tradeworx.com> Message-ID: > When we run the application using poll, things work fine (except for > the occasional dropping of packets due to the large set of fds passed > to poll). To try to remedy the problem, we switched to epoll. Now, > as the program runs, the cpu utilization rises over time towards 100% > and the memory usage grows as well. I think the first thing I would check is that your program is using epoll correctly. It is really hard to see a way that ethernet vs. ipoib could affect epoll -- with both ethernet and ipoib, the IP packets go through the whole network/socket stack before getting to any poll or epoll handling (and both poll and epoll use the same file method anyway). Do you have any way to find out where the growth in memory use is coming from? Do you mean that your application is allocating more and more memory, or is it more like a kernel memory leak? - R. From nab at linux-iscsi.org Thu Oct 16 16:11:13 2008 From: nab at linux-iscsi.org (Nicholas A. Bellinger) Date: Thu, 16 Oct 2008 16:11:13 -0700 Subject: [ofa-general] [ANNOUNCE]: LIO-Target/ConfigFS for v2.6.27 Message-ID: <1224198674.5556.255.camel@haakon2.linux-iscsi.org> Greetings all, I am happy to announce the release of LIO-Target/ConfigFS for v2.6.27. http://linux-iscsi.org/index.php/LIO-Target/ConfigFS Using ConfigFS it is now possible to create generic storage objects from Linux/SCSI using struct scsi_device, Linux/BLOCK using struct block_device and Linux/VFS using struct file into a generic target infrastructure under /sys/kernel/config/target/core (target_core_mod). >From there, one can create ConfigFS symbolic links to destinations under /sys/kernel/config/target/iscsi/ (iscsi_target_mod). The idea is that any $FABRIC module can access the same storage objects under the generic target_core_mod configfs infrastracture. I have primarly been working on completing the configfs conversion for iscsi_target_mod, and as that work wraps up, I will next be focusing on target_core_mod and the generic kernel target infrastructure in Linux. This involves quite a few different things, merging existing in-tree STGT code into so it can use /sys/kernel/config/target/core, merging out of tree SCST code for multiple existing fabric modules and hardware target mode drivers, the Target API, etc. This work will definately take lots of time and effort, but the end result will be a complete target infrastructure under Linux that all manner of folks can use and take advantage of and to build upon. Thanks to the many folks you have made suggestions and comments in order to get packets across the network and blocks down to storage up as quickly as possible. It was only ~7 weeks ago that Ming Zhang made the recommendation to start using configfs, and now 6k new lines of code in target_core_config.[c,h] and iscsi_target_configfs.[c.h].. Wow.. As work continues on both fronts: making iscsi_target_mod upstream ready, and creating a generic target engine using configfs for iscsi_target_mod and *ALL* fabric modules, the Linux-iSCSI.org team invites anyone who is interested to start looking at the code. You can do this online at: http://git.kernel.org/?p=linux/kernel/git/nab/lio-core-2.6.git;a=tree;f=drivers/lio-core;hb=HEAD or checkout your own git tree and start posting patching to the LIO-Target devleopment list. Instructions for git clone can be found on: http://linux-iscsi.org/index.php/LIO-Target Please feel free to ask questions! --nab From weiny2 at llnl.gov Thu Oct 16 16:27:34 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 16 Oct 2008 16:27:34 -0700 Subject: [ofa-general] Question about previous post: "IPoIB: keep ib_wc[] on stack" Message-ID: <20081016162734.286d64f9.weiny2@llnl.gov> Arthur, It seem we have hit the bug you describe in this thread: http://lists.openfabrics.org/pipermail/general/2008-May/050196.html I was wondering what, if anything, you did to resolve this? The last post indicates you are going to look for skb corruption. http://lists.openfabrics.org/pipermail/general/2008-May/050220.html We are running RHEL 5.2 which has OFED 1.3 in the kernel. We are getting all kinds of seemingly related skb/ipoib crashes but we are unable at this point to figure out what might be causing skb corruption. Thanks in advance, Ira From dotanba at gmail.com Fri Oct 17 06:56:06 2008 From: dotanba at gmail.com (Dotan Barak) Date: Fri, 17 Oct 2008 15:56:06 +0200 Subject: ***SPAM*** Re: [ofa-general] Querying the number of open queue pairs In-Reply-To: References: Message-ID: <48F89976.6020106@gmail.com> Krishnamoorthy, Sriram wrote: > > I am trying to figure out how many more queue pairs can be created on > a device at some point (say after MPI has been initialized). > ibv_query_device() returns, among other things, the maximum number of > queue pairs that can be created. In OFED 1.3, is there a way to query > the number of queue pairs that have been created so far, on a device? > There isn't any way to get the number of QPs (or actually, any other resource) that can further be created in a device. Dotan > > Thanks, > Sriram.K > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From vlad at lists.openfabrics.org Fri Oct 17 03:18:09 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 17 Oct 2008 03:18:09 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081017-0200 daily build status Message-ID: <20081017101809.D9767E60DBE@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From svenar at simula.no Fri Oct 17 04:37:49 2008 From: svenar at simula.no (Sven-Arne Reinemo) Date: Fri, 17 Oct 2008 13:37:49 +0200 Subject: [ofa-general] Questions Concerning a 3D Torus In-Reply-To: <1224110168.28670.66.camel@localhost.localdomain> References: <1224110168.28670.66.camel@localhost.localdomain> Message-ID: <1224243469.22130.58.camel@zaltys.simula.no> Hi Matthew, I have tried give some answers/comments on your points below. I hope it is somewhat useful. Regards, Sven-Arne On Wed, 2008-10-15 at 16:36 -0600, Matthew Bohnsack wrote: > Hello, > > I have a number of questions related to the construction and operation > of a 3D torus with Infiniband. > > We would like to create an IB network arranged as a 3D mesh with > wrap-around links. That is, a 3D torus. Each vertex of this torus > would be an Infiniscale IV (I4), having single 4x connections to up to > 12 host ConnectX-based HCAs and 3-4x connections to its neighbors in > each of six dimensions. The smallest network that illustrates the most > interesting aspects of this setup is a 3x3x3 torus. I've created a > basic illustration of this here. Pick your favorite file format: > > http://bohnsack.com/3DTorus.svg > http://bohnsack.com/3DTorus.pdf > > In this diagram, each square is a single I4, and each line represents 3 > independent 4x connections. To avoid making the diagram too > complicated, host connections are only shown for a single switch. You > should imagine 12 hosts hanging off of each square. Again, to avoid an > unreadable diagram, not all of the Y dimension wrap-around links are > shown, but you should consider them present for the purposes of the > network I'm describing. > > Questions: > > 1) What do you call the topology I'm describing, strictly speaking? > It's kind of like each switch chip vertex has a sub-graph connected to > all the host HCAs. Perhaps this thing is a "decorated 3D Torus"? I would call it a 3d torus (or a 3-ary 3-cube) ignoring the fact that there is more than one endnode connected to each switch. Adding more nodes impacts performance, but does not change the routing properties of the topology. > > 2) I think this network is a little bit different than the 3D tori that > have been previously deployed in machines like Red Storm where there is > a network switch for one or at maximum a very few compute clients. Does > the fact that there are order 10 times more hosts hanging off of each > switch chip vertex in the network I'm describing matter from a routing > perspective? It seems that the routing problem is mostly the same. > I.e., algorithms to determine a set of deadlock-free routes on the same > basic topology, ignoring the "decorations", are similar. Is this right? > Yes. As long as you are able to set up deadlock free routes between the switches it is trivial to add routes the endnodes connected to the switches. The number of endnodes does not matter from a routing perspective (aparat from balancing or over/under subscription). > 3) Is there good current support for computing deadlock-free routes on > the network I'm describing in OFED 1.3.1, 1.4, or other? With which > routing algorithm? I tried to find a answer to this, by looking through > various OFED documentation, but I still have a bit of confusion on how > to proceed. Here's the data that I was able to gather. Can someone > please help to clarify? As far as I know LASH would be your best option. > > - An OpenSM wiki page says that OpenSM supports "Torus routing": > https://wiki.openfabrics.org/tiki-index.php?page=OpenSM&highlight=opensm > - However, the latest release notes I can find don't make any explicit > mention of tori: > http://www.openfabrics.org/git/?p=~sashak/management.git;a=blob_plain;f=opensm/doc/opensm_release_notes-3.2.txt;hb=HEAD > - The release notes do mention Dimension Order routing (DOR), and this > might work for a 3D torus, but it seems (per the notes) that this > algorithm, as implemented, is only considered deadlock-free for meshes > (no wrap-around) and hypercubes - no tori. I understand that when you > "wrap around" the dimension, the virtual channel used needs to change to > avoid deadlock in DOR, and DOR as implemented today doesn't do that. I believe that is correct. You would need e-cube routing and 2 virtual lanes for deadlock free shortest path routing on a torus. e-cube routing is hard to implement in IB because you would have to do SL2VL trickery to be able to change virtual lanes (afaik). Moreover, e-cube routing is not robust against faults so a single fault would make large parts of the network unreachable. > - Commit b204932d5bd2a88af5ce0989d2dff65d753b3d54 from > git://git.openfabrics.org/~sashak/management.git in March of 2007 > mentions some degree of success with LASH on 2D tori, but it's > considered "unoptimized". Would this work deadlock-free for a 3D torus? > What's the implication of "unoptimized" on something like an 8x8x8 torus > with lots of hosts at each vertex? > - I didn't see any other mention of a torus or tori in the OpenSM commit > logs. LASH is deadlock free on a 3d torus or any other topology, but the number of virtual lanes required depends on the topology. If the cabling and port numbering is consistent (0=W, 1=E, 2=N, 3=S) LASH requires 1 VL for 2d meshes of any size (equals dimension order routing) and 3 VLs for 2d tori of any size. For 3d tori it requires 5 VLs. So far, I only have experience with LASH from simulations so there might be other issues on real hardware. Maybe other users on the list have some experience with LASH on real hardware? > 8) Is there an existing utility inside of OFED that can be used to > verify that routes generated by the SM are deadlock free? I.e., can I > dump routes from OpenSM and then run a utility on them that can identify > potentials for deadlock? I think you can use "ibdiagnet" for some routing algorithms, but not for LASH since it uses layers to avoid deadlock and this confuses "ibdiagnet". > > 9) I'm aware of ibsim, and others I'm collaborating with have done the > first level of testing with it on 3D tori. My question is how much more > simulation can I do on a "mock" topology with freely-available tools? > Am I limited to MAD traffic, or is there a way to simulate a real > workload (perhaps MPI traffic)? How about commercial tools? There is IB packet level simulation support in OMNet++, check http://www.omnetpp.org/ . Never tried it myself though. > > 10) As I previously mentioned, there are 3 4x links running between each > switch chip, along each dimension. Our plan is to run these as 3 > independent links, but it should be possible to logically aggregate them > into a single 12x link. At first glance, a 12x link might be subject to > less congestion but it could also incur store-and-forward penalties > producing unwanted latency as 4x flows are converted to 12x and then > back to 4x again. My questions around this topic: Is any additional > insight into this issue available (theoretical or empirical)? Should we > even worry about testing 12x connections? I do not know of any such work. There might be a benefit to run with 3 4x links from a fault-tolerance perspective, but don't know for sure... > > 11) An artifact that results from the way we intend to connect the I4 > switches together is that there's a possibility of having 9-4x > connections between every other link in one dimension. E.g., looking in > the the Z dimension, switch-to-switch "widths" could alternate between > 9-4x and 3-4x links. Links in other dimensions would all be limited to > 3-4x as before. The end result of this configuration would be that > certain groups of two I4s and their connected hosts along one dimension > would have a little bit better connectivity than other I4 groupings. > This kind of configuration in our setup would be "free" by simply doing > some logical configuration. This change may or not be beneficial > depending on our applications and workloads. I'm not so concerned about > that issue at present. My main question today is whether this kind of > "heterogeneity" would cause routing issues/complications. Would it? If > not, it might be a no-brainer to enable it. What do you think? I believe this should be possible without complications, but it is a special case that the routing algorithm would have to be made aware of. > > An diagram of the heterogeneous 3D torus I'm talking about is available > here: > > http://bohnsack.com/3DTorus-heterogeneous-Z.svg > http://bohnsack.com/3DTorus-heterogeneous-Z.pdf > > > Thanks in advance for your help, > > -Matthew > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From nicolas.morey-chaisemartin at ext.bull.net Fri Oct 17 06:23:30 2008 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Fri, 17 Oct 2008 15:23:30 +0200 Subject: [ofa-general] Bug with SDP on IA64 Message-ID: <48F891D2.4010502@ext.bull.net> Hi, I am stuck with a bug from ofa-kernel 1.3.1 on an IA64 running a Bull 2.6.18 kernel. When doing SDP transfers from an IA64 to any other host (IA64, x86, x86_64) through ttcp, I got this message: [root at h2 ~]# LD_PRELOAD=/usr/lib/libsdp.so.1 ~/ttcp/ttcp -t -s 192.168.0.10 ttcp-t: buflen=8192, nbuf=2048, align=16384/0, port=5001 tcp -> 192.168.0.10 ttcp-t: socket ttcp-t: tcp_maxseg ttcp-t: connect ttcp-t: IO: Connection reset by peer errno=104 [root at h2 ~]# And the same error on the other side. I activated the debug mode for sdp module and found out than on the receiver side a completion error 1 shows up: Oct 16 12:40:43 s_kernel at yack0 kernel: sdp_sock(5001:36814): Recv completion with error. Status 1 Oct 16 12:40:43 s_kernel at yack0 kernel: sdp_sock(5001:36814): sdp_reset state=1 Oct 16 12:40:44 s_kernel at yack0 kernel: sdp_sock(5001:36814): sdp_cma_handler event 10 id 0000010425120600 Oct 16 12:40:44 s_kernel at yack0 kernel: sdp_sock(5001:36814): RDMA_CM_EVENT_DISCONNECTED The error triggers a socket reset which terminates the connection. According to the docs I could find, Status 1 is a local length error, meaning the size written in the packet doesn't match the payload. I've noticed that with few packets (<= 100) or when ttcp is slowed down (started through strace) transfers seem to work. I've tried to update to the latest ofa-kernel (1.4.1 from 10/16/2008) and the bug is still there. Has anyone seen this problem before? What can I do to locate where things go wrong? Regards Nicolas From mottyg at voltaire.com Fri Oct 17 06:45:20 2008 From: mottyg at voltaire.com (Motty Grosman) Date: Fri, 17 Oct 2008 15:45:20 +0200 Subject: [ofa-general] Recall: US Case 00012515: problems with epoll and ipoib (memory leak and cpu creep) [[ ref:00D38IO.50085NTZf:ref ]] Message-ID: <39C75744D164D948A170E9792AF8E7CA01989AEB@exil.voltaire.com> Motty Grosman would like to recall the message, "US Case 00012515: problems with epoll and ipoib (memory leak and cpu creep) [[ ref:00D38IO.50085NTZf:ref ]]". From dledford at redhat.com Fri Oct 17 10:04:39 2008 From: dledford at redhat.com (Doug Ledford) Date: Fri, 17 Oct 2008 13:04:39 -0400 Subject: [ofa-general] [Patch] Trivial: update usage info for ibaddr.c in infiniband-diags-1.4.1 Message-ID: <1224263079.4843.251.camel@firewall.xsintricity.com> Simple patch just to add an option to the usage info that was left out but is valid. Signed-off-by: Doug Ledford --- --- infiniband-diags-1.4.1/src/ibaddr.c.sm 2008-10-17 12:59:17.000000000 -0400 +++ infiniband-diags-1.4.1/src/ibaddr.c 2008-10-17 13:02:06.000000000 -0400 @@ -95,7 +95,7 @@ usage(void) else basename++; - fprintf(stderr, "Usage: %s [-d(ebug) -D(irect) -G(uid) -l(id_show) -g(id_show) -C ca_name -P ca_port " + fprintf(stderr, "Usage: %s [-d(ebug) -D(irect) -G(uid) -l(id_show) -g(id_show) -s(m_port) sm_lid -C ca_name -P ca_port " "-t(imeout) timeout_ms -V(ersion) -h(elp)] []\n", basename); fprintf(stderr, "\tExamples:\n"); -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From dotanba at gmail.com Fri Oct 17 19:31:04 2008 From: dotanba at gmail.com (Dotan Barak) Date: Sat, 18 Oct 2008 04:31:04 +0200 Subject: [ofa-general] ***SPAM*** [PATCH] iSER: Fix coding style issue according to checkpatch Message-ID: <200810180431.04476.dotanba@gmail.com> Fixed several coding style issues in iSER, according to checkpatch.pl: * Missing spaces * Spaces instead of tabs * Long lines * Assignment in "if condition" Signed-off-by: Dotan Barak --- diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c b/drivers/infiniband/ulp/iser/iscsi_iser.c index 5a1cf25..1a5325f 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.c +++ b/drivers/infiniband/ulp/iser/iscsi_iser.c @@ -78,7 +78,7 @@ static struct scsi_transport_template *iscsi_iser_scsi_transport; static unsigned int iscsi_max_lun = 512; module_param_named(max_lun, iscsi_max_lun, uint, S_IRUGO); -int iser_debug_level = 0; +int iser_debug_level; MODULE_DESCRIPTION("iSER (iSCSI Extensions for RDMA) Datamover " "v" DRV_VER " (" DRV_DATE ")"); @@ -331,7 +331,7 @@ iscsi_iser_conn_bind(struct iscsi_cls_session *cls_session, /* binds the iSER connection retrieved from the previously * connected ep_handle to the iSCSI layer connection. exchanges * connection pointers */ - iser_err("binding iscsi conn %p to iser_conn %p\n",conn,ib_conn); + iser_err("binding iscsi conn %p to iser_conn %p\n", conn, ib_conn); iser_conn = conn->dd_data; ib_conn->iser_conn = iser_conn; iser_conn->ib_conn = ib_conn; @@ -491,7 +491,8 @@ iscsi_iser_set_param(struct iscsi_cls_conn *cls_conn, } static void -iscsi_iser_conn_get_stats(struct iscsi_cls_conn *cls_conn, struct iscsi_stats *stats) +iscsi_iser_conn_get_stats(struct iscsi_cls_conn *cls_conn, + struct iscsi_stats *stats) { struct iscsi_conn *conn = cls_conn->dd_data; @@ -583,24 +584,24 @@ iscsi_iser_ep_disconnect(struct iscsi_endpoint *ep) iscsi_suspend_tx(ib_conn->iser_conn->iscsi_conn); - iser_err("ib conn %p state %d\n",ib_conn, ib_conn->state); + iser_err("ib conn %p state %d\n", ib_conn, ib_conn->state); iser_conn_terminate(ib_conn); } static struct scsi_host_template iscsi_iser_sht = { - .module = THIS_MODULE, - .name = "iSCSI Initiator over iSER, v." DRV_VER, - .queuecommand = iscsi_queuecommand, - .change_queue_depth = iscsi_change_queue_depth, - .sg_tablesize = ISCSI_ISER_SG_TABLESIZE, - .max_sectors = 1024, - .cmd_per_lun = ISCSI_MAX_CMD_PER_LUN, - .eh_abort_handler = iscsi_eh_abort, - .eh_device_reset_handler= iscsi_eh_device_reset, - .eh_host_reset_handler = iscsi_eh_host_reset, - .use_clustering = DISABLE_CLUSTERING, - .proc_name = "iscsi_iser", - .this_id = -1, + .module = THIS_MODULE, + .name = "iSCSI Initiator over iSER, v." DRV_VER, + .queuecommand = iscsi_queuecommand, + .change_queue_depth = iscsi_change_queue_depth, + .sg_tablesize = ISCSI_ISER_SG_TABLESIZE, + .max_sectors = 1024, + .cmd_per_lun = ISCSI_MAX_CMD_PER_LUN, + .eh_abort_handler = iscsi_eh_abort, + .eh_device_reset_handler = iscsi_eh_device_reset, + .eh_host_reset_handler = iscsi_eh_host_reset, + .use_clustering = DISABLE_CLUSTERING, + .proc_name = "iscsi_iser", + .this_id = -1 }; static struct iscsi_transport iscsi_iser_transport = { diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h b/drivers/infiniband/ulp/iser/iscsi_iser.h index 81a8262..4c7e8da 100644 --- a/drivers/infiniband/ulp/iser/iscsi_iser.h +++ b/drivers/infiniband/ulp/iser/iscsi_iser.h @@ -163,7 +163,7 @@ struct iser_data_buf { unsigned long data_len; /* total data len */ unsigned int dma_nents; /* returned by dma_map_sg */ char *copy_buf; /* allocated copy buf for SGs unaligned * - * for rdma which are copied */ + * for rdma which are copied */ struct scatterlist sg_single; /* SG-ified clone of a non SG SC or * * unaligned SG */ }; diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c index cdd2831..16a59b7 100644 --- a/drivers/infiniband/ulp/iser/iser_initiator.c +++ b/drivers/infiniband/ulp/iser/iser_initiator.c @@ -91,7 +91,7 @@ static int iser_prepare_read_cmd(struct iscsi_task *task, return -EINVAL; } - err = iser_reg_rdma_mem(iser_task,ISER_DIR_IN); + err = iser_reg_rdma_mem(iser_task, ISER_DIR_IN); if (err) { iser_err("Failed to set up Data-IN RDMA\n"); return err; @@ -141,7 +141,7 @@ iser_prepare_write_cmd(struct iscsi_task *task, return -EINVAL; } - err = iser_reg_rdma_mem(iser_task,ISER_DIR_OUT); + err = iser_reg_rdma_mem(iser_task, ISER_DIR_OUT); if (err != 0) { iser_err("Failed to register write cmd RDMA mem\n"); return err; @@ -304,7 +304,7 @@ iser_check_xmit(struct iscsi_conn *conn, void *task) if (atomic_read(&iser_conn->ib_conn->post_send_buf_count) == ISER_QP_MAX_REQ_DTOS) { - iser_dbg("%ld can't xmit task %p\n",jiffies,task); + iser_dbg("%ld can't xmit task %p\n", jiffies, task); return -ENOBUFS; } return 0; @@ -328,7 +328,8 @@ int iser_send_command(struct iscsi_conn *conn, struct scsi_cmnd *sc = task->sc; if (!iser_conn_state_comp(iser_conn->ib_conn, ISER_CONN_UP)) { - iser_err("Failed to send, conn: 0x%p is not up\n", iser_conn->ib_conn); + iser_err("Failed to send, conn: 0x%p is not up\n", + iser_conn->ib_conn); return -EPERM; } if (iser_check_xmit(conn, task)) @@ -362,7 +363,7 @@ int iser_send_command(struct iscsi_conn *conn, if (hdr->flags & ISCSI_FLAG_CMD_WRITE) { err = iser_prepare_write_cmd(task, task->imm_count, - task->imm_count + + task->imm_count + task->unsol_count, edtl); if (err) @@ -386,7 +387,7 @@ int iser_send_command(struct iscsi_conn *conn, send_command_error: iser_dto_buffs_release(send_dto); - iser_err("conn %p failed task->itt %d err %d\n",conn, task->itt, err); + iser_err("conn %p failed task->itt %d err %d\n", conn, task->itt, err); return err; } @@ -407,7 +408,8 @@ int iser_send_data_out(struct iscsi_conn *conn, int err = 0; if (!iser_conn_state_comp(iser_conn->ib_conn, ISER_CONN_UP)) { - iser_err("Failed to send, conn: 0x%p is not up\n", iser_conn->ib_conn); + iser_err("Failed to send, conn: 0x%p is not up\n", + iser_conn->ib_conn); return -EPERM; } @@ -419,7 +421,7 @@ int iser_send_data_out(struct iscsi_conn *conn, buf_offset = ntohl(hdr->offset); iser_dbg("%s itt %d dseg_len %d offset %d\n", - __func__,(int)itt,(int)data_seg_len,(int)buf_offset); + __func__, (int)itt, (int)data_seg_len, (int)buf_offset); tx_desc = kmem_cache_alloc(ig.desc_cache, GFP_NOIO); if (tx_desc == NULL) { @@ -463,7 +465,7 @@ int iser_send_data_out(struct iscsi_conn *conn, send_data_out_error: iser_dto_buffs_release(send_dto); kmem_cache_free(ig.desc_cache, tx_desc); - iser_err("conn %p failed err %d\n",conn, err); + iser_err("conn %p failed err %d\n", conn, err); return err; } @@ -480,7 +482,8 @@ int iser_send_control(struct iscsi_conn *conn, struct iser_device *device; if (!iser_conn_state_comp(iser_conn->ib_conn, ISER_CONN_UP)) { - iser_err("Failed to send, conn: 0x%p is not up\n", iser_conn->ib_conn); + iser_err("Failed to send, conn: 0x%p is not up\n", + iser_conn->ib_conn); return -EPERM; } @@ -524,7 +527,7 @@ int iser_send_control(struct iscsi_conn *conn, send_control_error: iser_dto_buffs_release(send_dto); - iser_err("conn %p failed err %d\n",conn, err); + iser_err("conn %p failed err %d\n", conn, err); return err; } @@ -545,7 +548,7 @@ void iser_rcv_completion(struct iser_desc *rx_desc, hdr = &rx_desc->iscsi_header; - iser_dbg("op 0x%x itt 0x%x\n", hdr->opcode,hdr->itt); + iser_dbg("op 0x%x itt 0x%x\n", hdr->opcode, hdr->itt); if (dto_xfer_len > ISER_TOTAL_HEADERS_LEN) { /* we have data */ rx_data_len = dto_xfer_len - ISER_TOTAL_HEADERS_LEN; @@ -568,7 +571,7 @@ void iser_rcv_completion(struct iser_desc *rx_desc, conn->iscsi_conn, opcode, hdr->itt); else { iser_task = task->dd_data; - iser_dbg("itt %d task %p\n",hdr->itt, task); + iser_dbg("itt %d task %p\n", hdr->itt, task); iser_task->status = ISER_TASK_STATUS_COMPLETED; iser_task_rdma_finalize(iser_task); iscsi_put_task(task); @@ -611,7 +614,7 @@ void iser_snd_completion(struct iser_desc *tx_desc) atomic_dec(&ib_conn->post_send_buf_count); if (resume_tx) { - iser_dbg("%ld resuming tx\n",jiffies); + iser_dbg("%ld resuming tx\n", jiffies); scsi_queue_work(conn->session->host, &conn->xmitwork); } diff --git a/drivers/infiniband/ulp/iser/iser_memory.c b/drivers/infiniband/ulp/iser/iser_memory.c index b9453d0..1918312 100644 --- a/drivers/infiniband/ulp/iser/iser_memory.c +++ b/drivers/infiniband/ulp/iser/iser_memory.c @@ -116,7 +116,7 @@ static int iser_start_rdma_unaligned_sg(struct iscsi_iser_task *iser_task, if (mem == NULL) { iser_err("Failed to allocate mem size %d %d for copying sglist\n", - data->size,(int)cmd_data_len); + data->size, (int)cmd_data_len); return -ENOMEM; } @@ -272,7 +272,8 @@ static int iser_sg_to_page_vec(struct iser_data_buf *data, } page_vec->data_size = total_sz; - iser_dbg("page_vec->data_size:%d cur_page %d\n", page_vec->data_size,cur_page); + iser_dbg("page_vec->data_size:%d cur_page %d\n", + page_vec->data_size, cur_page); return cur_page; } @@ -350,7 +351,7 @@ static void iser_dump_page_vec(struct iser_page_vec *page_vec) iser_err("page vec length %d data size %d\n", page_vec->length, page_vec->data_size); for (i = 0; i < page_vec->length; i++) - iser_err("%d %lx\n",i,(unsigned long)page_vec->pages[i]); + iser_err("%d %lx\n", i, (unsigned long)page_vec->pages[i]); } static void iser_page_vec_build(struct iser_data_buf *data, @@ -364,7 +365,7 @@ static void iser_page_vec_build(struct iser_data_buf *data, iser_dbg("Translating sg sz: %d\n", data->dma_nents); page_vec_len = iser_sg_to_page_vec(data, page_vec, ibdev); - iser_dbg("sg len %d page_vec_len %d\n", data->dma_nents,page_vec_len); + iser_dbg("sg len %d page_vec_len %d\n", data->dma_nents, page_vec_len); page_vec->length = page_vec_len; @@ -478,7 +479,7 @@ int iser_reg_rdma_mem(struct iscsi_iser_task *iser_task, iser_err("page_vec: data_size = 0x%x, length = %d, offset = 0x%x\n", ib_conn->page_vec->data_size, ib_conn->page_vec->length, ib_conn->page_vec->offset); - for (i=0 ; ipage_vec->length ; i++) + for (i = 0; i < ib_conn->page_vec->length; i++) iser_err("page_vec[%d] = 0x%llx\n", i, (unsigned long long) ib_conn->page_vec->pages[i]); return err; diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c b/drivers/infiniband/ulp/iser/iser_verbs.c index 26ff621..3b8569d 100644 --- a/drivers/infiniband/ulp/iser/iser_verbs.c +++ b/drivers/infiniband/ulp/iser/iser_verbs.c @@ -51,7 +51,7 @@ static void iser_cq_event_callback(struct ib_event *cause, void *context) static void iser_qp_event_callback(struct ib_event *cause, void *context) { - iser_err("got qp event %d\n",cause->event); + iser_err("got qp event %d\n", cause->event); } /** @@ -137,7 +137,7 @@ static int iser_create_ib_conn_res(struct iser_conn *ib_conn) device = ib_conn->device; ib_conn->page_vec = kmalloc(sizeof(struct iser_page_vec) + - (sizeof(u64) * (ISCSI_ISER_SG_TABLESIZE +1)), + (sizeof(u64) * (ISCSI_ISER_SG_TABLESIZE + 1)), GFP_KERNEL); if (!ib_conn->page_vec) { ret = -ENOMEM; @@ -269,7 +269,7 @@ static void iser_device_try_release(struct iser_device *device) { mutex_lock(&ig.device_list_mutex); device->refcount--; - iser_err("device %p refcount %d\n",device,device->refcount); + iser_err("device %p refcount %d\n", device, device->refcount); if (!device->refcount) { iser_free_device_ib_res(device); list_del(&device->ig_list); @@ -296,7 +296,8 @@ static int iser_conn_state_comp_exch(struct iser_conn *ib_conn, int ret; spin_lock_bh(&ib_conn->lock); - if ((ret = (ib_conn->state == comp))) + ret = (ib_conn->state == comp); + if (ret) ib_conn->state = exch; spin_unlock_bh(&ib_conn->lock); return ret; @@ -352,7 +353,7 @@ void iser_conn_terminate(struct iser_conn *ib_conn) err = rdma_disconnect(ib_conn->cma_id); if (err) iser_err("Failed to disconnect, conn: 0x%p err %d\n", - ib_conn,err); + ib_conn, err); wait_event_interruptible(ib_conn->wait, ib_conn->state == ISER_CONN_DOWN); @@ -456,11 +457,13 @@ static void iser_disconnected_handler(struct rdma_cm_id *cma_id) } } -static int iser_cma_handler(struct rdma_cm_id *cma_id, struct rdma_cm_event *event) +static int iser_cma_handler(struct rdma_cm_id *cma_id, + struct rdma_cm_event *event) { int ret = 0; - iser_err("event %d conn %p id %p\n",event->event,cma_id->context,cma_id); + iser_err("event %d conn %p id %p\n", event->event, cma_id->context, + cma_id); switch (event->event) { case RDMA_CM_EVENT_ADDR_RESOLVED: @@ -515,7 +518,7 @@ int iser_connect(struct iser_conn *ib_conn, struct sockaddr *src, *dst; int err = 0; - sprintf(ib_conn->name,"%d.%d.%d.%d:%d", + sprintf(ib_conn->name, "%d.%d.%d.%d:%d", NIPQUAD(dst_addr->sin_addr.s_addr), dst_addr->sin_port); /* the device is known only --after-- address resolution */ @@ -623,7 +626,7 @@ void iser_unreg_mem(struct iser_mem_reg *reg) { int ret; - iser_dbg("PHYSICAL Mem.Unregister mem_h %p\n",reg->mem_h); + iser_dbg("PHYSICAL Mem.Unregister mem_h %p\n", reg->mem_h); ret = ib_fmr_pool_unmap((struct ib_pool_fmr *)reg->mem_h); if (ret) @@ -635,7 +638,8 @@ void iser_unreg_mem(struct iser_mem_reg *reg) /** * iser_dto_to_iov - builds IOV from a dto descriptor */ -static void iser_dto_to_iov(struct iser_dto *dto, struct ib_sge *iov, int iov_len) +static void iser_dto_to_iov(struct iser_dto *dto, struct ib_sge *iov, + int iov_len) { int i; struct ib_sge *sge; @@ -799,7 +803,8 @@ static void iser_cq_tasklet_fn(unsigned long data) } else /* type == ISCSI_TX_CONTROL/SCSI_CMD/DOUT */ iser_snd_completion(desc); } else { - iser_err("comp w. error op %d status %d\n",desc->type,wc.status); + iser_err("comp w. error op %d status %d\n", + desc->type, wc.status); iser_handle_comp_error(desc); } } From dotanba at gmail.com Fri Oct 17 19:35:00 2008 From: dotanba at gmail.com (Dotan Barak) Date: Sat, 18 Oct 2008 04:35:00 +0200 Subject: [ofa-general] ***SPAM*** [PATCH] libibverbs: Update Dotan's email in all of the files Message-ID: <200810180435.00292.dotanba@gmail.com> Update Dotan's email in all of the files. Signed-off-by: Dotan Barak --- diff --git a/AUTHORS b/AUTHORS index e2dc298..fcea350 100644 --- a/AUTHORS +++ b/AUTHORS @@ -1,4 +1,4 @@ Roland Dreier -Dotan Barak +Dotan Barak Sean Hefty Michael S. Tsirkin diff --git a/debian/copyright b/debian/copyright index 58a8e6c..5009e7a 100644 --- a/debian/copyright +++ b/debian/copyright @@ -8,7 +8,7 @@ It was downloaded from the OpenIB web site at Authors: Roland Dreier - Dotan Barak + Dotan Barak Sean Hefty Michael S. Tsirkin diff --git a/man/ibv_alloc_pd.3 b/man/ibv_alloc_pd.3 index c3afd1a..63e1aea 100644 --- a/man/ibv_alloc_pd.3 +++ b/man/ibv_alloc_pd.3 @@ -37,4 +37,4 @@ freed. .BR ibv_create_ah_from_wc (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_attach_mcast.3 b/man/ibv_attach_mcast.3 index 24bcaac..7d83d56 100644 --- a/man/ibv_attach_mcast.3 +++ b/man/ibv_attach_mcast.3 @@ -50,4 +50,4 @@ the local port. .BR ibv_create_qp (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_create_ah.3 b/man/ibv_create_ah.3 index 0260f0f..becc7d1 100644 --- a/man/ibv_create_ah.3 +++ b/man/ibv_create_ah.3 @@ -61,4 +61,4 @@ returns 0 on success, or the value of errno on failure (which indicates the fail .BR ibv_create_ah_from_wc (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_create_ah_from_wc.3 b/man/ibv_create_ah_from_wc.3 index eb20dd3..8b92caf 100644 --- a/man/ibv_create_ah_from_wc.3 +++ b/man/ibv_create_ah_from_wc.3 @@ -60,4 +60,4 @@ can be used to create a new AH using .BR ibv_poll_cq (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_create_comp_channel.3 b/man/ibv_create_comp_channel.3 index 285ee04..15a9618 100644 --- a/man/ibv_create_comp_channel.3 +++ b/man/ibv_create_comp_channel.3 @@ -47,4 +47,4 @@ channel being destroyed. .BR ibv_get_cq_event (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_create_cq.3 b/man/ibv_create_cq.3 index cfa5f3e..5dc333e 100644 --- a/man/ibv_create_cq.3 +++ b/man/ibv_create_cq.3 @@ -55,4 +55,4 @@ fails if any queue pair is still associated with this CQ. .BR ibv_create_qp (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_create_qp.3 b/man/ibv_create_qp.3 index 28b3a09..5301ad8 100644 --- a/man/ibv_create_qp.3 +++ b/man/ibv_create_qp.3 @@ -82,4 +82,4 @@ fails if the QP is attached to a multicast group. .BR ibv_query_qp (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_create_srq.3 b/man/ibv_create_srq.3 index f0963e8..7a826a1 100644 --- a/man/ibv_create_srq.3 +++ b/man/ibv_create_srq.3 @@ -64,4 +64,4 @@ fails if any queue pair is still associated with this SRQ. .BR ibv_query_srq (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_devinfo.1 b/man/ibv_devinfo.1 index 41878b2..70f7ed2 100644 --- a/man/ibv_devinfo.1 +++ b/man/ibv_devinfo.1 @@ -33,7 +33,7 @@ print all available information about RDMA devices .SH AUTHORS .TP Dotan Barak -.RI < dotanb at mellanox.co.il > +.RI < dotanba at gmail.com > .TP Roland Dreier .RI < rolandd at cisco.com > diff --git a/man/ibv_fork_init.3 b/man/ibv_fork_init.3 index b34f71f..6f2a287 100644 --- a/man/ibv_fork_init.3 +++ b/man/ibv_fork_init.3 @@ -55,4 +55,4 @@ usually will not be significant. .BR ibv_get_device_list (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_get_async_event.3 b/man/ibv_get_async_event.3 index 7426f4d..acb6257 100644 --- a/man/ibv_get_async_event.3 +++ b/man/ibv_get_async_event.3 @@ -159,4 +159,4 @@ ibv_ack_async_event(&async_event); .BR ibv_open_device (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_get_cq_event.3 b/man/ibv_get_cq_event.3 index 58c744a..70b2572 100644 --- a/man/ibv_get_cq_event.3 +++ b/man/ibv_get_cq_event.3 @@ -182,4 +182,4 @@ ibv_ack_cq_events(ev_cq, 1); .SH "AUTHORS" .TP Dotan Barak -.RI < dotanb at mellanox.co.il > +.RI < dotanba at gmail.com > diff --git a/man/ibv_get_device_guid.3 b/man/ibv_get_device_guid.3 index 98c0499..8cbe0e7 100644 --- a/man/ibv_get_device_guid.3 +++ b/man/ibv_get_device_guid.3 @@ -22,4 +22,4 @@ returns the GUID of the device in network byte order. .BR ibv_open_device (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_get_device_list.3 b/man/ibv_get_device_list.3 index c881c28..003fffb 100644 --- a/man/ibv_get_device_list.3 +++ b/man/ibv_get_device_list.3 @@ -43,4 +43,4 @@ it will be able to use only the open devices; pointers to unopened devices will .BR ibv_open_device (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_get_device_name.3 b/man/ibv_get_device_name.3 index 284ea9f..b6cb491 100644 --- a/man/ibv_get_device_name.3 +++ b/man/ibv_get_device_name.3 @@ -22,4 +22,4 @@ returns a pointer to the device name, or NULL if the request fails. .BR ibv_open_device (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_modify_qp.3 b/man/ibv_modify_qp.3 index f045900..a022dac 100644 --- a/man/ibv_modify_qp.3 +++ b/man/ibv_modify_qp.3 @@ -166,4 +166,4 @@ RTS \fB IBV_QP_STATE, IBV_QP_SQ_PSN, IBV_QP_MAX_QP_RD_ATOMIC, \fR .BR ibv_create_ah (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_modify_srq.3 b/man/ibv_modify_srq.3 index 01375c9..fa23c3a 100644 --- a/man/ibv_modify_srq.3 +++ b/man/ibv_modify_srq.3 @@ -60,4 +60,4 @@ Modifying the srq_limit arms the SRQ to produce an .BR ibv_query_srq (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_open_device.3 b/man/ibv_open_device.3 index d5149a5..ed2226c 100644 --- a/man/ibv_open_device.3 +++ b/man/ibv_open_device.3 @@ -40,4 +40,4 @@ resources before closing a context. .BR ibv_query_pkey (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_poll_cq.3 b/man/ibv_poll_cq.3 index 75e4d7c..cf70efc 100644 --- a/man/ibv_poll_cq.3 +++ b/man/ibv_poll_cq.3 @@ -74,4 +74,4 @@ will be triggered, and the CQ cannot be used. .BR ibv_post_recv (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_post_recv.3 b/man/ibv_post_recv.3 index 46a630b..4e65c81 100644 --- a/man/ibv_post_recv.3 +++ b/man/ibv_post_recv.3 @@ -73,4 +73,4 @@ offset of 40 bytes into the buffer(s) in the scatter list. .BR ibv_poll_cq (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_post_send.3 b/man/ibv_post_send.3 index 8c7b0eb..51fc1cf 100644 --- a/man/ibv_post_send.3 +++ b/man/ibv_post_send.3 @@ -120,4 +120,4 @@ after the call returns. .BR ibv_poll_cq (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_post_srq_recv.3 b/man/ibv_post_srq_recv.3 index 65877cd..073972b 100644 --- a/man/ibv_post_srq_recv.3 +++ b/man/ibv_post_srq_recv.3 @@ -65,4 +65,4 @@ offset of 40 bytes into the buffer(s) in the scatter list. .BR ibv_poll_cq (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_query_device.3 b/man/ibv_query_device.3 index 3bf7511..afc7573 100644 --- a/man/ibv_query_device.3 +++ b/man/ibv_query_device.3 @@ -81,4 +81,4 @@ in use by other users/processes. .BR ibv_query_gid (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_query_gid.3 b/man/ibv_query_gid.3 index df3ac10..b3a1b0f 100644 --- a/man/ibv_query_gid.3 +++ b/man/ibv_query_gid.3 @@ -30,4 +30,4 @@ returns 0 on success, and \-1 on error. .BR ibv_query_pkey (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_query_pkey.3 b/man/ibv_query_pkey.3 index fcdfe6c..9683049 100644 --- a/man/ibv_query_pkey.3 +++ b/man/ibv_query_pkey.3 @@ -30,4 +30,4 @@ returns 0 on success, and \-1 on error. .BR ibv_query_gid (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_query_port.3 b/man/ibv_query_port.3 index c6b3b63..882470d 100644 --- a/man/ibv_query_port.3 +++ b/man/ibv_query_port.3 @@ -58,4 +58,4 @@ returns 0 on success, or the value of errno on failure (which indicates the fail .BR ibv_create_ah (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_query_qp.3 b/man/ibv_query_qp.3 index 8da270e..7101dc2 100644 --- a/man/ibv_query_qp.3 +++ b/man/ibv_query_qp.3 @@ -85,4 +85,4 @@ may yield some differences in the values returned for the following attributes: .BR ibv_create_ah (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_query_srq.3 b/man/ibv_query_srq.3 index 56db4fa..05519aa 100644 --- a/man/ibv_query_srq.3 +++ b/man/ibv_query_srq.3 @@ -41,4 +41,4 @@ asynchronous events will be generated until the event is rearmed. .BR ibv_modify_srq (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_rate_to_mult.3 b/man/ibv_rate_to_mult.3 index 1346d75..fb57ff8 100644 --- a/man/ibv_rate_to_mult.3 +++ b/man/ibv_rate_to_mult.3 @@ -43,4 +43,4 @@ returns the enumeration representing the IB transmission rate. .BR ibv_query_port (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_reg_mr.3 b/man/ibv_reg_mr.3 index cb0d8b8..7cd3de8 100644 --- a/man/ibv_reg_mr.3 +++ b/man/ibv_reg_mr.3 @@ -74,4 +74,4 @@ fails if any memory window is still bound to this MR. .BR ibv_post_srq_recv (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_req_notify_cq.3 b/man/ibv_req_notify_cq.3 index eda5801..47d45fe 100644 --- a/man/ibv_req_notify_cq.3 +++ b/man/ibv_req_notify_cq.3 @@ -40,4 +40,4 @@ will be generated for each call to .BR ibv_get_cq_event (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak diff --git a/man/ibv_resize_cq.3 b/man/ibv_resize_cq.3 index 0563f56..fbb1ebd 100644 --- a/man/ibv_resize_cq.3 +++ b/man/ibv_resize_cq.3 @@ -39,4 +39,4 @@ will be updated to the actual size. .BR ibv_destroy_cq (3) .SH "AUTHORS" .TP -Dotan Barak +Dotan Barak From dmlb2000 at gmail.com Fri Oct 17 16:24:30 2008 From: dmlb2000 at gmail.com (David Brown) Date: Fri, 17 Oct 2008 16:24:30 -0700 Subject: [ofa-general] Fwd: pciback module not working In-Reply-To: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com> References: <9c21eeae0809111424v3c8bf001k42b9463a25529e32@mail.gmail.com> Message-ID: <9c21eeae0810171624o208bff4fo9b071a9881d83060@mail.gmail.com> Okay so my question to the openfabrics guys is, why would the OFED drivers fail to read the firmware? Any thoughts? Thanks, - David Brown ---------- Forwarded message ---------- From: David Brown Date: Thu, Sep 11, 2008 at 2:24 PM Subject: pciback module not working To: xen-users at lists.xensource.com, xen-devel at lists.xensource.com This issue was brought up about a year and a half ago. So I'll bring it up again and see if anything happens. I've got an infiniband network and am attempting to pass the infiniband card through the host and give it to the guest. I'm working with standard CentOS 5.2 on both guest and host with their provided xen (3.0.3 ish). I've also attempted to install the newest Xen 3.3 and use their standard host kernel and that did the same thing. The guest dmesg output in the guest is similar on both permissive and normal mode. I'm getting issues with detecting the firmware on the card for some reason... Any help would be appreciated. Thanks, - David Brown === GUEST dmesg output === ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008) ib_mthca: Initializing 0000:00:00.0 PCI: Enabling device 0000:00:00.0 (0000 -> 0002) PCI: Setting latency timer of device 0000:00:00.0 to 64 ib_mthca 0000:00:00.0: QUERY_FW command failed, aborting. ib_mthca: probe of 0000:00:00.0 failed with error -11 ======================= === Host modprobe.conf === alias eth0 bnx2 alias eth1 bnx2 alias scsi_hostadapter cciss options pciback hide=(41:00.0) ===================== === Host lspci output === # lspci -vs 41:00.0 41:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20) Subsystem: Hewlett-Packard Company Unknown device 170a Flags: fast devsel, IRQ 16 Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] Capabilities: [40] Power Management version 2 Capabilities: [48] Vital Product Data Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 Capabilities: [60] Express Endpoint IRQ 0 ===================== This makes sure it get loaded first off before anything else. === Host mkinitrd cmd === # mkinitrd -f --with=pciback --preload pciback /boot/initrd-2.6.18-92.1.10.el5xen.img 2.6.18-92.1.10.el5xen ==================== === Host pciback dmesg === pciback 0000:41:00.0: Driver tried to write to a read-only configuration space field at offset 0x44, size 2. This may be harmless, but if you have problems with your device: 1) see permissive attribute in sysfs 2) report problems to the xen-devel mailing list along with details of your device obtained from lspci. PCI: Enabling device 0000:41:00.0 (0000 -> 0002) ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 PCI: Setting latency timer of device 0000:41:00.0 to 64 ACPI: PCI interrupt for device 0000:41:00.0 disabled ====================== === Host pciback dmesg (after setting it permissive) === pciback 0000:41:00.0: enabling permissive mode configuration space accesses! pciback 0000:41:00.0: permissive mode is potentially unsafe! pciback: vpci: 0000:41:00.0: assign to virtual slot 0 device vif1.0 entered promiscuous mode ADDRCONF(NETDEV_UP): vif1.0: link is not ready blkback: ring-ref 9, event-channel 28, protocol 1 (x86_64-abi) PCI: Enabling device 0000:41:00.0 (0000 -> 0002) ACPI: PCI Interrupt 0000:41:00.0[A] -> GSI 16 (level, low) -> IRQ 16 PCI: Setting latency timer of device 0000:41:00.0 to 64 ACPI: PCI interrupt for device 0000:41:00.0 disabled ========================================= === Guest lspci output === # lspci -v 00:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20) Subsystem: Hewlett-Packard Company Unknown device 170a Flags: fast devsel, IRQ 16 Memory at fdc00000 (64-bit, non-prefetchable) [disabled] [size=1M] Memory at fd000000 (64-bit, prefetchable) [disabled] [size=8M] Capabilities: [40] Power Management version 2 Capabilities: [48] Vital Product Data Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 Capabilities: [60] Express Endpoint IRQ 0 ===================== From sashak at voltaire.com Sat Oct 18 02:22:17 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 18 Oct 2008 11:22:17 +0200 Subject: [ofa-general] Re: [PATCH 0/6 v2] opensm: Unicast Routing Cache In-Reply-To: <48F660B9.5010804@dev.mellanox.co.il> References: <48E96928.8030200@dev.mellanox.co.il> <20081009171103.GF4912@sashak.voltaire.com> <48EEB00E.7000209@dev.mellanox.co.il> <20081010082827.GX4912@sashak.voltaire.com> <48F13DEC.2030109@dev.mellanox.co.il> <48F660B9.5010804@dev.mellanox.co.il> Message-ID: <20081018092216.GC21743@sashak.voltaire.com> Hi Yevgeny, On 23:29 Wed 15 Oct , Yevgeny Kliteynik wrote: > > I'm sending v2 of the patches: All applied. Thanks. > The job that still needs to be done: > - Check how the cache handles port moving during discovery. > Might be a bug there. > - Check how unicast manager handles fast reset of switches. > AFAIK SM will now write the LFT there - need to fix it > (unrelated to cache, general ucast mgr issue) Seeing again at this I think that it handles this properly. This is the code in osm_ucast_mgr_set_fwd_table(): if (!p_sw->need_update && !memcmp(block, p_sw->lft_buf + block_id_ho * 64, 64)) continue; > - Optimize LFT usage - simplify current switch LFT, > hold two LFTs (current and cached) only when these LFTs > are not identical. This would be a nice improvement. Sasha From sashak at voltaire.com Sat Oct 18 02:23:08 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 18 Oct 2008 11:23:08 +0200 Subject: [ofa-general] [Patch] Trivial: update usage info for ibaddr.c in infiniband-diags-1.4.1 In-Reply-To: <1224263079.4843.251.camel@firewall.xsintricity.com> References: <1224263079.4843.251.camel@firewall.xsintricity.com> Message-ID: <20081018092308.GD21743@sashak.voltaire.com> On 13:04 Fri 17 Oct , Doug Ledford wrote: > Simple patch just to add an option to the usage info that was left out > but is valid. > > Signed-off-by: Doug Ledford Applied. Thanks. Sasha From sashak at voltaire.com Sat Oct 18 02:34:26 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 18 Oct 2008 11:34:26 +0200 Subject: [ofa-general] Re: [PATCH] OpenSM release notes: Indicate InfiniScale-IV support In-Reply-To: <48F5FCC8.8070601@obsidianresearch.com> References: <48F5FCC8.8070601@obsidianresearch.com> Message-ID: <20081018093426.GF21743@sashak.voltaire.com> On 08:23 Wed 15 Oct , Hal Rosenstock wrote: > Sasha, > > Another update to the release notes. > > -- Hal > > OpenSM release notes: Add IS-IV information in qualified devices/firmware > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From vlad at lists.openfabrics.org Sat Oct 18 03:16:09 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 18 Oct 2008 03:16:09 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081018-0200 daily build status Message-ID: <20081018101609.52B6EE60B36@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From sashak at voltaire.com Sat Oct 18 04:27:22 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 18 Oct 2008 13:27:22 +0200 Subject: [ofa-general] Re: [PATCH] opensm/doc/current-routing.txt: added ucast cache info In-Reply-To: <48F728C3.9010602@dev.mellanox.co.il> References: <48F728C3.9010602@dev.mellanox.co.il> Message-ID: <20081018112722.GP21743@sashak.voltaire.com> On 13:42 Thu 16 Oct , Yevgeny Kliteynik wrote: > Added ucast cache info in current-routing.txt > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From kliteyn at dev.mellanox.co.il Sat Oct 18 07:58:19 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sat, 18 Oct 2008 16:58:19 +0200 Subject: [ofa-general] Re: [PATCH 0/6 v2] opensm: Unicast Routing Cache In-Reply-To: <20081018092216.GC21743@sashak.voltaire.com> References: <48E96928.8030200@dev.mellanox.co.il> <20081009171103.GF4912@sashak.voltaire.com> <48EEB00E.7000209@dev.mellanox.co.il> <20081010082827.GX4912@sashak.voltaire.com> <48F13DEC.2030109@dev.mellanox.co.il> <48F660B9.5010804@dev.mellanox.co.il> <20081018092216.GC21743@sashak.voltaire.com> Message-ID: <48F9F98B.7000909@dev.mellanox.co.il> Hi Sasha, Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 23:29 Wed 15 Oct , Yevgeny Kliteynik wrote: >> I'm sending v2 of the patches: > > All applied. Thanks. > >> The job that still needs to be done: >> - Check how the cache handles port moving during discovery. >> Might be a bug there. >> - Check how unicast manager handles fast reset of switches. >> AFAIK SM will now write the LFT there - need to fix it >> (unrelated to cache, general ucast mgr issue) > > Seeing again at this I think that it handles this properly. This is > the code in osm_ucast_mgr_set_fwd_table(): Right, I noticed it too. > if (!p_sw->need_update && > !memcmp(block, p_sw->lft_buf + block_id_ho * 64, 64)) > continue; > >> - Optimize LFT usage - simplify current switch LFT, I've already done this - two patches were posted to list couple of days ago. >> hold two LFTs (current and cached) only when these LFTs >> are not identical. I already have a draft for this too - it's very short. Waiting for the LFT simplification patches to be applied first. -- Yevgeny > This would be a nice improvement. > > Sasha > From publications at kressworks.com Sat Oct 18 08:50:22 2008 From: publications at kressworks.com (publications) Date: Sat, 18 Oct 2008 11:50:22 -0400 Subject: [ofa-general] 32 bit mvapich 1.0.1 shared libraries from OFED 1.3.1 In-Reply-To: <6938774FE5D54C2BA6E010D193AEBD78@inspiron9100> References: <6938774FE5D54C2BA6E010D193AEBD78@inspiron9100> Message-ID: Well, it appears that my question is going to go unanswered. BTW, the same problem occurs in OFED versions 1.3.1 and 1.4 RC2 Thanks for the help. > -----Original Message----- > From: general-bounces at lists.openfabrics.org > [mailto:general-bounces at lists.openfabrics.org] On Behalf Of > publications > Sent: Thursday, October 16, 2008 1:49 PM > To: general at lists.openfabrics.org > Subject: [ofa-general] 32 bit mvapich 1.0.1 shared libraries > > When installing OFED 1.3.1, the person doing the install is > presented with a > option: > > --build32 Build 32-bit libraries. Relevant for > x86_64 and ppc64 > platforms > > However, there are no 32 bit shared libraries for > mvapich-1.0.1 created, even though the installation script > finishes with "installation successful" > > How does one get the 32 bit mvapich-1.0.1 libraries built and > where will they be located? > > > Thanks, > > Jim Kress > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Sat Oct 18 16:20:32 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 19 Oct 2008 01:20:32 +0200 Subject: [ofa-general] [PATCH] infiniband-diags/saquery: add LFT Record handling Message-ID: <20081018232032.GS5528@sashak.voltaire.com> Add SA LFT Record attribute handling. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/src/saquery.c | 52 ++++++++++++++++++++++++++++++++++++++++ 1 files changed, 52 insertions(+), 0 deletions(-) diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index 00b0f2c..996434d 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -634,6 +634,21 @@ static void dump_one_pkey_tbl_record(void *data) printf("\n"); } +static void dump_one_lft_record(void *data) +{ + ib_lft_record_t *lftr = data; + unsigned block = cl_ntoh16(lftr->block_num); + int i; + printf("LFT Record dump:\n" + "\t\tLID........................%u\n" + "\t\tBlock......................%u\n" + "\t\tLFT:\n", + cl_ntoh16(lftr->lid), block); + for (i = 0; i < 64 ; i++) + printf("\t\t%u\t%u\n", block*64 + i, lftr->lft[i]); + printf("\n"); +} + static void dump_results(osmv_query_res_t *r, void (*dump_func)(void *)) { int i; @@ -1251,6 +1266,41 @@ print_pkey_tbl_records(const struct query_cmd *q, osm_bind_handle_t bind_handle, return status; } +static int +print_lft_records(const struct query_cmd *q, osm_bind_handle_t bind_handle, + int argc, char *argv[]) +{ + ib_lft_record_t lftr; + ib_net64_t comp_mask = 0; + int lid = 0, block = -1; + ib_api_status_t status; + + if (argc > 0) + parse_lid_and_ports(bind_handle, argv[0], + &lid, &block, NULL); + + memset(&lftr, 0, sizeof(lftr)); + + if (lid > 0) { + lftr.lid = cl_hton16(lid); + comp_mask |= IB_LFTR_COMPMASK_LID; + } + if (block >= 0) { + lftr.block_num = cl_hton16(block); + comp_mask |= IB_LFTR_COMPMASK_BLOCK; + } + + status = get_any_records(bind_handle, IB_MAD_ATTR_LFT_RECORD, 0, + comp_mask, &lftr, + ib_get_attr_offset(sizeof(lftr)), 0); + if (status != IB_SUCCESS) + return status; + + dump_results(&result, dump_one_lft_record); + return_mad(); + return status; +} + static osm_bind_handle_t get_bind_handle(void) { @@ -1344,6 +1394,8 @@ static const struct query_cmd query_cmds[] = { { "ServiceRecord", "SR", IB_MAD_ATTR_SERVICE_RECORD, }, { "PathRecord", "PR", IB_MAD_ATTR_PATH_RECORD, }, { "MCMemberRecord", "MCMR", IB_MAD_ATTR_MCMEMBER_RECORD, }, + { "LFTRecord", "LFTR", IB_MAD_ATTR_LFT_RECORD, "[[lid]/[block]]", + print_lft_records }, { 0 } }; -- 1.6.0.1.196.g01914 From sashak at voltaire.com Sat Oct 18 16:27:03 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 19 Oct 2008 01:27:03 +0200 Subject: [ofa-general] [PATCH] opensm/osm_sa_lft_record.c: fix block number encoding byte order Message-ID: <20081018232703.GT5528@sashak.voltaire.com> Block number shouldbe encoded in SA LFT Record in network byte order. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_sa_lft_record.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_sa_lft_record.c b/opensm/opensm/osm_sa_lft_record.c index e1fe8d5..013a1bf 100644 --- a/opensm/opensm/osm_sa_lft_record.c +++ b/opensm/opensm/osm_sa_lft_record.c @@ -97,7 +97,7 @@ __osm_lftr_rcv_new_lftr(IN osm_sa_t * sa, memset(p_rec_item, 0, sizeof(*p_rec_item)); p_rec_item->rec.lid = lid; - p_rec_item->rec.block_num = block; + p_rec_item->rec.block_num = cl_hton16(block); /* copy the lft block */ osm_switch_get_fwd_tbl_block(p_sw, block, p_rec_item->rec.lft); -- 1.6.0.1.196.g01914 From sashak at voltaire.com Sat Oct 18 16:48:14 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 19 Oct 2008 01:48:14 +0200 Subject: [ofa-general] Re: [PATCH 1/2] opensm: replace switch's fwd_tbl with simple LFT In-Reply-To: <48F7B3D7.3070004@dev.mellanox.co.il> References: <48F7B3D7.3070004@dev.mellanox.co.il> Message-ID: <20081018234814.GU5528@sashak.voltaire.com> Hi Yevgeny, On 23:36 Thu 16 Oct , Yevgeny Kliteynik wrote: > Replace the unnecessarily complex switch's forwarding table > implementation with a simple LFT that is implemented as plain > uint8_t array. > > Signed-off-by: Yevgeny Kliteynik > --- [snip...] > diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c > index 9bf76e0..bdfc7d0 100644 > --- a/opensm/opensm/osm_switch.c > +++ b/opensm/opensm/osm_switch.c > @@ -97,9 +97,26 @@ osm_switch_init(IN osm_switch_t * const p_sw, > p_sw->num_ports = num_ports; > p_sw->need_update = 2; > > - status = osm_fwd_tbl_init(&p_sw->fwd_tbl, p_si); > - if (status != IB_SUCCESS) > + /* Initiate the linear forwarding table */ > + > + if (!p_si->lin_cap) { > + /* This switch does not support linear forwarding tables */ > + status = IB_UNSUPPORTED; > goto Exit; > + } > + > + /* The capacity reported by the switch includes LID 0, > + so add 1 to the end of the range here for this assert. */ > + CL_ASSERT(cl_ntoh16(p_si->lin_cap) <= IB_LID_UCAST_END_HO + 1); Maybe there should be run-time check (not sure since lin_cap is not really used in other places in the code), but not assertion - any bogus data received from network should not crash OpenSM. I'm removing this. > + > + p_sw->lft = malloc(IB_LID_UCAST_END_HO + 1); > + if (!p_sw->lft) { > + status = IB_INSUFFICIENT_MEMORY; > + goto Exit; > + } > + > + /* Initialize the table to OSM_NO_PATH, which is "invalid port" */ > + memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); > > p_sw->lft_buf = malloc(IB_LID_UCAST_END_HO + 1); > if (!p_sw->lft_buf) { > @@ -138,7 +155,8 @@ void osm_switch_delete(IN OUT osm_switch_t ** const pp_sw) > > osm_mcast_tbl_destroy(&p_sw->mcast_tbl); > free(p_sw->p_prof); > - osm_fwd_tbl_destroy(&p_sw->fwd_tbl); > + if (p_sw->lft) > + free(p_sw->lft); > if (p_sw->lft_buf) > free(p_sw->lft_buf); > if (p_sw->hops) { > @@ -176,44 +194,36 @@ osm_switch_t *osm_switch_new(IN osm_node_t * const p_node, > /********************************************************************** > **********************************************************************/ > boolean_t > -osm_switch_get_fwd_tbl_block(IN const osm_switch_t * const p_sw, > - IN const uint32_t block_id, > - OUT uint8_t * const p_block) > +osm_switch_get_lft_block(IN const osm_switch_t * const p_sw, > + IN const uint32_t block_id, > + OUT uint8_t * const p_block) > { > uint16_t base_lid_ho; > - uint16_t max_lid_ho; > - uint16_t lid_ho; > uint16_t block_top_lid_ho; > - uint32_t lids_per_block; > - osm_fwd_tbl_t *p_tbl; > boolean_t return_flag = FALSE; > > CL_ASSERT(p_sw); > CL_ASSERT(p_block); > > - p_tbl = osm_switch_get_fwd_tbl_ptr(p_sw); > - max_lid_ho = p_sw->max_lid_ho; > - lids_per_block = osm_fwd_tbl_get_lids_per_block(&p_sw->fwd_tbl); > - base_lid_ho = (uint16_t) (block_id * lids_per_block); > + base_lid_ho = (uint16_t) (block_id * IB_SMP_DATA_SIZE); > > - if (base_lid_ho <= max_lid_ho) { > + if (base_lid_ho <= p_sw->max_lid_ho) { > /* Initialize LIDs in block to invalid port number. */ > memset(p_block, OSM_NO_PATH, IB_SMP_DATA_SIZE); > /* > Determine the range of LIDs we can return with this block. > */ > block_top_lid_ho = > - (uint16_t) (base_lid_ho + lids_per_block - 1); > - if (block_top_lid_ho > max_lid_ho) > - block_top_lid_ho = max_lid_ho; > + (uint16_t) (base_lid_ho + IB_SMP_DATA_SIZE - 1); > + if (block_top_lid_ho > p_sw->max_lid_ho) > + block_top_lid_ho = p_sw->max_lid_ho; > > /* > Configure the forwarding table with the routing > information for the specified block of LIDs. > */ > - for (lid_ho = base_lid_ho; lid_ho <= block_top_lid_ho; lid_ho++) > - p_block[lid_ho - base_lid_ho] = > - osm_fwd_tbl_get(p_tbl, lid_ho); > + memcpy(p_block, &(p_sw->lft[base_lid_ho]), > + block_top_lid_ho - base_lid_ho + 1); Hmm, why not just memcpy(p_block, &p_sw->lft[base_lid_ho], 64); ? And then no need initial memset()? Sasha From vlad at lists.openfabrics.org Sun Oct 19 03:15:41 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 19 Oct 2008 03:15:41 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081019-0200 daily build status Message-ID: <20081019101541.0A755E60C5A@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From yosefe at Voltaire.COM Sun Oct 19 06:14:55 2008 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Sun, 19 Oct 2008 15:14:55 +0200 Subject: [ofa-general] [PATCH] ipoib: don't enable napi when it's already enabled Message-ID: <48FB32CF.6060202@Voltaire.COM> ipoib_open() may be called from ipoib_pkey_poll(), without calling ipoib_stop() first. This will call napi_enable() without calling napi_disable first(). Signed-off-by: Yosef Etigin -- Fixes bug https://bugs.openfabrics.org/show_bug.cgi?id=1288. Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-10-19 14:12:55.000000000 +0200 +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-10-19 14:16:16.000000000 +0200 @@ -106,8 +106,8 @@ int ipoib_open(struct net_device *dev) ipoib_dbg(priv, "bringing up interface\n"); - napi_enable(&priv->napi); - set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); + if (!test_and_set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) + napi_enable(&priv->napi); if (ipoib_pkey_dev_delay_open(dev)) return 0; From yosefe at Voltaire.COM Sun Oct 19 06:17:58 2008 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Sun, 19 Oct 2008 15:17:58 +0200 Subject: [ofa-general] [PATCH] ipoib: take rtnl_lock when calling ipoib_open Message-ID: <48FB3386.2020903@Voltaire.COM> ipoib_open() has to be called with rtnl_lock. First, it calls dev_change_flags() which needs rtnl_lock, and second, it needs to sync with ipoib_stop(). Signed-off-by: Yossi Etigin -- It will not deadlock because we don't flush ipoib_workqueue with rtnl_lock taken. Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2008-10-17 17:29:51.000000000 +0200 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2008-10-19 14:17:12.000000000 +0200 @@ -1045,9 +1045,11 @@ void ipoib_pkey_poll(struct work_struct ipoib_pkey_dev_check_presence(dev); - if (test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) + if (test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) { + rtnl_lock(); ipoib_open(dev); - else { + rtnl_unlock(); + } else { mutex_lock(&pkey_mutex); if (!test_bit(IPOIB_PKEY_STOP, &priv->flags)) From halr at obsidianresearch.com Sun Oct 19 07:54:30 2008 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Sun, 19 Oct 2008 08:54:30 -0600 Subject: [ofa-general] [PATCH] infiniband-diags/saquery: Support MFTRecord Message-ID: <48FB4A26.3050504@obsidianresearch.com> Sasha, This patch adds support for MFTRecord in saquery. -- Hal -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: patch-saq-mftr1 URL: From halr at obsidianresearch.com Sun Oct 19 07:54:32 2008 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Sun, 19 Oct 2008 08:54:32 -0600 Subject: [ofa-general] [PATCH] infiniband-diags/saquery.8: Update saquery man page Message-ID: <48FB4A28.3000400@obsidianresearch.com> Sasha, This patch updates the saquery man page. -- Hal -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: patch-saq-man1 URL: From sashak at voltaire.com Sun Oct 19 08:14:50 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 19 Oct 2008 17:14:50 +0200 Subject: [ofa-general] [PATCH] opensm/osm_sa: fix memory leak in SA responder Message-ID: <20081019151450.GU25831@sashak.voltaire.com> On each SA response (osm_sa_respond()) OpenSM eats memory. free() at end of the function does not work because all items were removed from the list already. Fixing this. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_sa.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/opensm/opensm/osm_sa.c b/opensm/opensm/osm_sa.c index 670deae..fb2f962 100644 --- a/opensm/opensm/osm_sa.c +++ b/opensm/opensm/osm_sa.c @@ -494,6 +494,7 @@ void osm_sa_respond(osm_sa_t *sa, osm_madw_t *madw, size_t attr_size, item = cl_qlist_remove_head(list); memcpy(p, ((struct item_data *)item)->data, attr_size); p += attr_size; + free(item); } osm_sa_send(sa, resp_madw, FALSE); -- 1.6.0.2.287.g3791f From sashak at voltaire.com Sun Oct 19 09:51:12 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 19 Oct 2008 18:51:12 +0200 Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery: Support MFTRecord In-Reply-To: <48FB4A26.3050504@obsidianresearch.com> References: <48FB4A26.3050504@obsidianresearch.com> Message-ID: <20081019165112.GV25831@sashak.voltaire.com> On 08:54 Sun 19 Oct , Hal Rosenstock wrote: > Sasha, > > This patch adds support for MFTRecord in saquery. > > -- Hal > infiniband-diags/saquery: Add support for MFTRecord > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sun Oct 19 09:51:30 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 19 Oct 2008 18:51:30 +0200 Subject: [ofa-general] Re: [PATCH] infiniband-diags/saquery.8: Update saquery man page In-Reply-To: <48FB4A28.3000400@obsidianresearch.com> References: <48FB4A28.3000400@obsidianresearch.com> Message-ID: <20081019165130.GW25831@sashak.voltaire.com> On 08:54 Sun 19 Oct , Hal Rosenstock wrote: > Sasha, > > This patch updates the saquery man page. > > -- Hal > infiniband-diags/saquery.8: Update saquery man page for additional queries > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sun Oct 19 10:58:27 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 19 Oct 2008 19:58:27 +0200 Subject: [ofa-general] [PATCH] opensm/osm_mcast_mgr: fix memory leak Message-ID: <20081019175827.GX25831@sashak.voltaire.com> In case when switch is member of MC group mcast working object must be freed after routing was set. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_mcast_mgr.c | 8 +++++++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/opensm/opensm/osm_mcast_mgr.c b/opensm/opensm/osm_mcast_mgr.c index c4cd632..c9fed40 100644 --- a/opensm/opensm/osm_mcast_mgr.c +++ b/opensm/opensm/osm_mcast_mgr.c @@ -671,11 +671,17 @@ static osm_mtree_node_t *__osm_mcast_mgr_branch(osm_sm_t * sm, table for this switch. */ osm_mcast_tbl_set(p_tbl, mlid_ho, i); - if (i == 0) + if (i == 0) { /* This means we are adding the switch to the MC group. We do not need to continue looking at the remote port, just needed to add the port to the table */ + CL_ASSERT(count == 1); + + p_wobj = (osm_mcast_work_obj_t *) + cl_qlist_remove_head(p_port_list); + __osm_mcast_work_obj_delete(p_wobj); continue; + } p_node = p_sw->p_node; p_remote_node = osm_node_get_remote_node(p_node, i, NULL); -- 1.6.0.2.287.g3791f From sashak at voltaire.com Sun Oct 19 11:51:03 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 19 Oct 2008 20:51:03 +0200 Subject: [ofa-general] [PATCH] opensm/release notes: add recent changes Message-ID: <20081019185103.GY25831@sashak.voltaire.com> Add recent changes and bug fixes to OpenSM 3.2 Release Notes. Signed-off-by: Sasha Khapyorsky --- opensm/doc/opensm_release_notes-3.2.txt | 21 +++++++++++++++++++++ 1 files changed, 21 insertions(+), 0 deletions(-) diff --git a/opensm/doc/opensm_release_notes-3.2.txt b/opensm/doc/opensm_release_notes-3.2.txt index 7316765..ce7ad90 100644 --- a/opensm/doc/opensm_release_notes-3.2.txt +++ b/opensm/doc/opensm_release_notes-3.2.txt @@ -23,6 +23,15 @@ This document includes the following sections: 1.1 Major New Features +* Cached Routing + OpenSM provides an optional unicast routing cache (enabled by '-A' or + '--ucast_cache' options). When enabled, unicast routing cache prevents + routing recalculation (which is a heavy task in a large cluster) when + there was no topology change detected during the heavy sweep, or when + the topology change does not require new routing calculation, e.g. when + one or more CAs/RTRs/leaf switches going down, or one or more of these + nodes coming back after being down. + * Routing Chaining Routing chaining is the ability to configure the order in which routing algorithms are applied in opensm, i.e. '-R ftree,updn,minhop' - try @@ -173,6 +182,12 @@ This document includes the following sections: * Add generated osm_config.h file with OpenSM specific defines +* Display port number in decimal in log messages + +* Replace osm_vendor_select.h by generated osm_config.h + +* Unify options listing in OpenSM usage message + 1.3 Library API Changes None @@ -334,6 +349,12 @@ information regarding each compliance statement. * opensm: fix broken IPv6 SNM consolidation code +* opensm/osm_sa_lft_record.c: fix block number encoding byte order + +* opensm/osm_sa: fix memory leak in SA responder + +* opensm/osm_mcast_mgr: fix memory leak + * Other less critical or visible bugs were also fixed. 5 Main Verification Flows -- 1.6.0.2.287.g3791f From sashak at voltaire.com Sun Oct 19 12:19:38 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 19 Oct 2008 21:19:38 +0200 Subject: [ofa-general] ***SPAM*** [ANNOUNCE] management tarballs release Message-ID: <20081019191938.GZ25831@sashak.voltaire.com> Hi, There is a new release of the management (OpenSM and infiniband diagnostics) tarballs available in: http://www.openfabrics.org/downloads/management/ md5sum: 5ce057a956435f818621802849fe4d51 libibcommon-1.1.2.tar.gz be74f2916d1901a789b3b558d77b2a8a libibumad-1.2.2.tar.gz f78425228d34ff80440a7669027732ad libibmad-1.2.2.tar.gz 7fbb303e8cbf3f0b8418a0f565ddd7ce opensm-3.2.3.tar.gz 12a32e384dfa92f6a8df8fbaba8bf3b3 infiniband-diags-1.4.2.tar.gz All component versions are from recent master branch. Full change log is below. Sasha Al Chu (21): opensm: reroute console option opensm: rename ucast_file and ucast_dump_file to lfts_file opensm: fix instruction typo in config file opensm: fix segfault corner case when osm_console_init fails opensm/console: close console socket on cleanup path opensm: fix comment typo opensm: remove old comment opensm: fix routing algorithm description add clarification in perfquery manpage use 0xff vs. 255 consistently in perfquery add more detail to error message on perfquery workaround add --loop_ports option to perfquery infiniband-diags/perfquery: support ehanced port 0 with --loop-ports infiniband-diags/perfquery: fix comments infiniband-diags/ibclearerrors: specify -a in call to perfquery tweak notes about port 255 in perfquery manpage infiniband-diags/perfquery: error out if AllPortSelect is not supported infinibanf-diags/perfquery: remove single port CA AllPortSelect workaround infiniband-diags/perfquery: if --loop_ports is specified always loop through all ports if desired perfquery code cleanup infiniband-diags/perfquery: if -a is specified loop through ports if required and aggregate output Albert Chu (2): opensm/opensm.init: do not specify run levels on SuSE opensm: routing chaining Doug Ledford (3): opensm/*/Makefile.am: remove install-exec-hook opensm: remove -rpath from LDFLAGS Trivial: update usage info for ibaddr.c in infiniband-diags-1.4.1 Hal Rosenstock (23): opensm/osm_lin_fwd_tbl.h: Cosmetic changes opensm/osm_rand_fwd_tbl.h: Cosmetic changes opensm/osm_console_io.c: Eliminate compile warnings opensm/osm_qos_parser.y: Eliminate bison warning libibmad/ChangeLog: Fix typo libibumad/umad.c: Cosmetic change opensm/osm_inform.c: Fix compile warning complib/cl_event_wheel.c: Fix some printf typos opensm/osm_lid_mgr.c: Convert lid range prints to decimal opensm/osm_lid_mgr.c: Cosmetic format change opensm/include/opensm/osm_subnet.h: Update some comments opensm/osm_helper.c: Change lids_per_port to decimal OpenSM: More conversion to (unsiged) decimal lid format osmtest/osmtest.8: Fix log_file option description in man page osm_(helper trap_rcv).c: Change output format of notice type to unsigned decimal OpenSM: More man and doc changes for opensm.conf OpenSM: Display port number in decimal in log messages OpenSM: Display port number in decimal in log messages infiniband-diags/ibsysstat.c: Fix a couple of latent bugs OpenSM release notes: Clarify OpenSM compatibility due to change in default SM/SA keys OpenSM release notes: Indicate InfiniScale-IV support infiniband-diags/saquery: Support MFTRecord infiniband-diags/saquery.8: Update saquery man page Ira Weiny (9): infiniband-diags/src/saquery.c: convert GID prints to use inet_ntop infiniband-diags/src/ibaddr.c: convert GID prints to use inet_ntop OpenSM: convert GID prints to use inet_ntop opensm: Add ib_trap_str function opensm: Add a Node Description check on light sweep. Fix some missing node name map substitutions opensm: move vendor specific compilation flags to config.h ibnetdiscover.c: continue processing other ports even if smpquery fails on one port opensm: Add osm_config.h file Jack Morgenstein (2): libibmad: eliminate compiler warnings on x86_64 infiniband-diags: eliminate compiler warnings Keshetti Mahesh (3): complib: trivial change in description of cl_list_insert_tail function opensm/osm_ucast_lash.c: remove unused variables opensm/osm_ucast_lash: find_port_from_lid() function is not used Sasha Khapyorsky (106): libibmad: Bump a library version opensm/osm_sa_lft_record: validate LFT block number opensm/sa: remove local *get_port_by_guid() wrappers opensm/osm_sa_lft_record: pass block parameter in host byte order opensm: speedup and improve ipv6 snm handling opensm: improve port_prof_ignore handling infiniband-diags/saquery: remove unused variable infiniband-diags/src/ibaddr.c: remove unused variables opensm: minor: move GID print buffer definitions under nearest condition opensm/osm_log: reverse log level check flow opensm/osm_log: osl_log() speedup opensm/osm_helper.c: trivial simplifications opensm/configure.in: don't touch CFLAGS directly opensm/osm_sa_mcmember_record: fix uninitilized variable use opensm/osm_ucast_lash: remove an invalid error log opensm/osm_ucast_updn: remove some debug logging opensm/osm_ucast_updn: move and rename __osm_updn_find_root_nodes_by_min_hop() opensm/*/Makefile.am: remove explicit -lpthread and -ldl flags from Makefile.am opensm/OSM_LOG(): wrap osm_log call with log level check opensm: remove osm_log_is_active() check opensm/osmtest: convert osm_log() to OSM_LOG() macro opensm/libvendor: convert osm_log() to OSM_LOG() macro opensm: simplify flow in __osm_state_mgr_light_sweep_start() opensm: add osm_version field to osm_opensm_t object opensm/osm_helper: remove some empty lines opensm/*/Makefile.am: sort header file list by name opensm/osm_state_mgr: fix ERR code in __osm_state_mgr_light_sweep_start() opensm/osm_state_mgr: trivial changes opensm/osm_sminfo_rcv.c: cosmetic opensm/include/Makefile.am: don't duplicate header files in EXTRA_DIST opensm: install all OpenSM header files opensm/event_plugin: plugin API version 2 opensm/osm_sminfo_rcv.c: improve locking opensm/osm_mtree.c: mask unused static function __osm_mtree_dump() opensm/osm_sa_class_port_info.c: fix over bound array access osmtest/osmt_service.c: fix over bound array access opensm/osm_ucast_lash.c: A couple more conversions to (unsigned) decimal lid format opensm: cleanup osm_sweep_fail_ctrl opensm: remove osm_sweep_fail_ctrl.[ch] files opensm/osm_sm.h: fix comment opensm: add OSM_EVENT_ID_SUBNET_UP event opensm: redo lex and yacc files generation opensm: query remote SMs during light sweep osmtest: fix qpn encoding in osmtest_informinfo_request() osm_vendor_ibumad_sa: in a log print mad_status in host byte order opensm/osm_sa_informinfo.c: consolidate flows opensm/osm_version.h.in: remove 'extern "C"' braces opensm/config/osmvsel.m4: remove unused osmv_save_ldflags variable opensm: remove USEGPPLINK hack opensm/osm_qos_policy.c: cosmetic simplification opensm: opensm_release_notes-3.2.txt template opensm/config: remove unused variables opensm/Makefile.am: remove -fno-strict-aliasing flag from *CFLAGS opensm/*/Makefile.am: remove -Wno-deprecated-declarations C flags opensm: fix strict-aliasing rules warnings opensm: fix in usage message man/ibnetdiscover: cleanup non-existing options infiniband-diags/ibclearcounters: remove unrelated -N option man/ibtracert: remove non-existing -e option infiniband-diags/ibtracert: fix port by direct path resolving opensm: do not start opensm on boot automatically opensm/redhat-opensm.init.in: make config file optional opensm/opensm.spec.in: don't install old format conf file opensm/osm_mcm_port.c: remove osm_mcm_port_init() opensm/osm_sm.c: cosmetic opensm/osm_multicast.[ch]: simplify flows, remove unused functions opensm: simplify osm_get_mgrp_by_mgid() search function opensm/osm_sa_mcmember_record.c: cleanup code, simplify flows management/gen_chlog.sh: use git command instead of 'git-*' wrappers management/make.dist: use '_' in version number rather than '-'. opensm: multicast group create/delete notification fix opensm: consolidate mgrp_send_notice calls opensm/osm_opensm.c: cosmetic formatting opensm/osm_log.c: provide useful error message when file opening fails opensm/man: remove any opensm.opts mentions from opensm man page opensm/osm_sa_mcmember_record.c: cosmetic - remove empty line opensm/osm_ucast_updn: cosmetic opensm/osm_ucast_mgr: remove any_change tracking opensm/main.c: trivial usage message formatting fix opensm/osm_ucast_lash: fix buffer overflow opensm/osm_ucast_lash.c: minor code optimization opensm: fix broken IPv6 SNM consolidation code infiniband-diags/perfquery: fix constant size warning opensm/config/osmvsel.m4: fix indentations opensm/vendor: remove osm_pkt_randomizer from ibumad vendor build opensm/opensm/Makefile.am: merge CFLAGS opensm/osm_config.h: generated OSM_DEBUG macro opensm: update Release Notes for OpenSM version 3.2 management/*/COPYING: update Voltaire copyright years management: add Voltaire copyright opensm/vendor: replace osm_vendor_select.h by osm_config.h opensm: Unify options listing in usage() message opensm/osm_event_plugin.c: include config.h opensm: 3.2 release notes update opensm/osm_ucast_cache: trivial comment fix opensm/osm_ucast_lash.c: indentation fixes infiniband-diags/saquery: add LFT Record handling opensm/osm_sa_lft_record.c: fix block number encoding byte order opensm/osm_sa: fix memory leak in SA responder opensm/osm_mcast_mgr: fix memory leak opensm/osm_mcast_mgr.c: cosmetic management: update Voltaire copyright opensm/libopensm.ver: advance libopensm library version libibumad: check that node type has IB type opensm/release notes: add recent changes management/*/configure.in: bump all versions Tim Meier (1): opensm: osm_opensm.c - changed load_plugins() arg to const Yevgeny Kliteynik (17): opensm/osm_qos_parser.l: add 'noinput' lexer option to remove compiler warning opensm/Makefile.am: fixing compilation error with osm_version.h opensm/Makefile.am: Fix dependency for 'make -j2' opensm/Makefile.am: add dependency rule opensm/osm_ucast_mgr.c: cleaning port_order_list opensm/libvendor/osm_vendor_mlx_sa.c: handling attribute offset of 0 opensm/osm_state_mgr.c: fixing some typos opensm/osm_qos_policy.c: removing some log messages opensm/osm_qos_policy.c: log matched qos criteria opensm/Makefile.am: adding yacc-generated .h file as dependency opensm/Unicast Routing Cache: move lft_buf from ucast_mgr to osm_switch opensm/Unicast Routing Cache: add -A / --ucast_cache option opensm/Unicast Routing Cache: add osm_ucast_cache.{c,h} files opensm/Unicast Routing Cache: compile cache files opensm/Unicast Routing Cache: integrate cache into opensm opensm/Unicast Routing Cache: manpage entry opensm/doc/current-routing.txt: added ucast cache info From vinnie at sgi.com Sun Oct 19 21:16:22 2008 From: vinnie at sgi.com (Vincent Rizza) Date: Mon, 20 Oct 2008 15:16:22 +1100 Subject: [ofa-general] [PATCH] mlx4: Set sq_sig_type and qp_type in qp_init_attr struct Message-ID: <48FC0616.8060501@sgi.com> Set correct sq_sig_type and qp_type in qp_init_attr. Signed-off-by: Vincent Rizza Signed-off-by: Brett Grandbois Signed-off-by: Greg Banks Signed-off-by: Max Matveev Signed-off-by: Ken Sandars --- drivers/infiniband/hw/mlx4/qp.c | 7 +++++++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 9559248..37fc05d 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -1910,6 +1910,13 @@ done: qp_init_attr->cap = qp_attr->cap; + if (qp->sq_signal_bits & MLX4_WQE_CTRL_CQ_UPDATE) + qp_init_attr->sq_sig_type = IB_SIGNAL_ALL_WR; + else + qp_init_attr->sq_sig_type = IB_SIGNAL_REQ_WR; + + qp_init_attr->qp_type = qp->ibqp.qp_type; + qp_init_attr->create_flags = 0; if (qp->flags & MLX4_IB_QP_BLOCK_MULTICAST_LOOPBACK) qp_init_attr->create_flags |= IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK; From vinnie at sgi.com Sun Oct 19 21:16:29 2008 From: vinnie at sgi.com (Vincent Rizza) Date: Mon, 20 Oct 2008 15:16:29 +1100 Subject: [ofa-general] [PATCH] mthca: Create sysfs entries for MSI-X interrupts if enabled Message-ID: <48FC061D.9090508@sgi.com> Creates a sysfs entry for each MSI-X vector containing the IRQ value. This patch applies to the mthca driver. Signed-off-by: Vincent Rizza Signed-off-by: Brett Grandbois Signed-off-by: Greg Banks Signed-off-by: Max Matveev Signed-off-by: Ken Sandars --- drivers/infiniband/hw/mthca/mthca_provider.c | 62 +++++++++++++++++++++++--- 1 files changed, 55 insertions(+), 7 deletions(-) diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index 87ad889..70fc686 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -1232,10 +1232,40 @@ static ssize_t show_board(struct device *device, struct device_attribute *attr, return sprintf(buf, "%.*s\n", MTHCA_BOARD_ID_LEN, dev->board_id); } -static DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); -static DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); -static DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); -static DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL); +static ssize_t show_msix_async_irq(struct device *device, + struct device_attribute *attr, char *buf) +{ + struct mthca_dev *dev = + container_of(device, struct mthca_dev, ib_dev.dev); + return sprintf(buf, "%u\n", + dev->eq_table.eq[MTHCA_EQ_ASYNC].msi_x_vector); +} + +static ssize_t show_msix_cmd_irq(struct device *device, + struct device_attribute *attr, char *buf) +{ + struct mthca_dev *dev = + container_of(device, struct mthca_dev, ib_dev.dev); + return sprintf(buf, "%u\n", + dev->eq_table.eq[MTHCA_EQ_CMD].msi_x_vector); +} + +static ssize_t show_msix_comp_irq(struct device *device, + struct device_attribute *attr, char *buf) +{ + struct mthca_dev *dev = + container_of(device, struct mthca_dev, ib_dev.dev); + return sprintf(buf, "%u\n", + dev->eq_table.eq[MTHCA_EQ_COMP].msi_x_vector); +} + +static DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); +static DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL); +static DEVICE_ATTR(msix_async_irq, S_IRUGO, show_msix_async_irq, NULL); +static DEVICE_ATTR(msix_cmd_irq, S_IRUGO, show_msix_cmd_irq, NULL); +static DEVICE_ATTR(msix_comp_irq, S_IRUGO, show_msix_comp_irq, NULL); static struct device_attribute *mthca_dev_attributes[] = { &dev_attr_hw_rev, @@ -1244,6 +1274,12 @@ static struct device_attribute *mthca_dev_attributes[] = { &dev_attr_board_id }; +static struct device_attribute *mthca_msix_attributes[] = { + &dev_attr_msix_async_irq, + &dev_attr_msix_cmd_irq, + &dev_attr_msix_comp_irq +}; + static int mthca_init_node_data(struct mthca_dev *dev) { struct ib_smp *in_mad = NULL; @@ -1406,15 +1442,27 @@ int mthca_register_device(struct mthca_dev *dev) for (i = 0; i < ARRAY_SIZE(mthca_dev_attributes); ++i) { ret = device_create_file(&dev->ib_dev.dev, mthca_dev_attributes[i]); - if (ret) { - ib_unregister_device(&dev->ib_dev); - return ret; + if (ret) + goto err_reg; + } + + /* Only create msix entries if msix is enabled */ + if (dev->pdev->msix_enabled) { + for (i = 0; i < ARRAY_SIZE(mthca_msix_attributes); ++i) { + ret = device_create_file(&dev->ib_dev.dev, + mthca_msix_attributes[i]); + if (ret) + goto err_reg; } } mthca_start_catas_poll(dev); return 0; + +err_reg: + ib_unregister_device(&dev->ib_dev); + return ret; } void mthca_unregister_device(struct mthca_dev *dev) From vinnie at sgi.com Sun Oct 19 21:16:33 2008 From: vinnie at sgi.com (Vincent Rizza) Date: Mon, 20 Oct 2008 15:16:33 +1100 Subject: [ofa-general] [PATCH] mlx4: Create sysfs entries for MSI-X interrupts if enabled Message-ID: <48FC0621.8040502@sgi.com> Creates a sysfs entry for each MSI-X vector containing the IRQ value. This patch applies to the mlx4 infiniband driver. Signed-off-by: Vincent Rizza Signed-off-by: Brett Grandbois Signed-off-by: Greg Banks Signed-off-by: Max Matveev Signed-off-by: Ken Sandars --- drivers/infiniband/hw/mlx4/main.c | 39 +++++++++++++++++++++++++++++++++--- drivers/net/mlx4/eq.c | 1 + drivers/net/mlx4/mlx4.h | 6 ----- include/linux/mlx4/device.h | 7 ++++++ 4 files changed, 43 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index a3c2851..323ab89 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -527,10 +527,28 @@ static ssize_t show_board(struct device *device, struct device_attribute *attr, dev->dev->board_id); } -static DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); -static DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); -static DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); -static DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL); +static ssize_t show_msix_async_irq(struct device *device, + struct device_attribute *attr, char *buf) +{ + struct mlx4_ib_dev *dev = + container_of(device, struct mlx4_ib_dev, ib_dev.dev); + return sprintf(buf, "%u\n", dev->dev->msix_irqs[MLX4_EQ_ASYNC]); +} + +static ssize_t show_msix_comp_irq(struct device *device, + struct device_attribute *attr, char *buf) +{ + struct mlx4_ib_dev *dev = + container_of(device, struct mlx4_ib_dev, ib_dev.dev); + return sprintf(buf, "%u\n", dev->dev->msix_irqs[MLX4_EQ_COMP]); +} + +static DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL); +static DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL); +static DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL); +static DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL); +static DEVICE_ATTR(msix_async_irq, S_IRUGO, show_msix_async_irq, NULL); +static DEVICE_ATTR(msix_comp_irq, S_IRUGO, show_msix_comp_irq, NULL); static struct device_attribute *mlx4_class_attributes[] = { &dev_attr_hw_rev, @@ -539,6 +557,11 @@ static struct device_attribute *mlx4_class_attributes[] = { &dev_attr_board_id }; +static struct device_attribute *mlx4_msix_attributes[] = { + &dev_attr_msix_async_irq, + &dev_attr_msix_comp_irq +}; + static void *mlx4_ib_add(struct mlx4_dev *dev) { static int mlx4_ib_version_printed; @@ -666,6 +689,14 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) goto err_reg; } + if (dev->pdev->msix_enabled) { + for (i = 0; i < ARRAY_SIZE(mlx4_msix_attributes); i++) { + if (device_create_file(&ibdev->ib_dev.dev, + mlx4_msix_attributes[i])) + goto err_reg; + } + } + return ibdev; err_reg: diff --git a/drivers/net/mlx4/eq.c b/drivers/net/mlx4/eq.c index 8a8b561..00d0357 100644 --- a/drivers/net/mlx4/eq.c +++ b/drivers/net/mlx4/eq.c @@ -600,6 +600,7 @@ int mlx4_init_eq_table(struct mlx4_dev *dev) goto err_out_async; priv->eq_table.eq[i].have_irq = 1; + dev->msix_irqs[i] = priv->eq_table.eq[i].irq; } } else { diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h index 5337e3a..712a294 100644 --- a/drivers/net/mlx4/mlx4.h +++ b/drivers/net/mlx4/mlx4.h @@ -63,12 +63,6 @@ enum { }; enum { - MLX4_EQ_ASYNC, - MLX4_EQ_COMP, - MLX4_NUM_EQ -}; - -enum { MLX4_NUM_PDS = 1 << 15 }; diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index b2f9444..9b0c633 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -53,6 +53,12 @@ enum { }; enum { + MLX4_EQ_ASYNC, + MLX4_EQ_COMP, + MLX4_NUM_EQ +}; + +enum { MLX4_DEV_CAP_FLAG_RC = 1 << 0, MLX4_DEV_CAP_FLAG_UC = 1 << 1, MLX4_DEV_CAP_FLAG_UD = 1 << 2, @@ -339,6 +345,7 @@ struct mlx4_dev { struct radix_tree_root qp_table_tree; u32 rev_id; char board_id[MLX4_BOARD_ID_LEN]; + u16 msix_irqs[MLX4_NUM_EQ]; }; struct mlx4_init_port_param { From vinnie at sgi.com Sun Oct 19 21:16:40 2008 From: vinnie at sgi.com (Vincent Rizza) Date: Mon, 20 Oct 2008 15:16:40 +1100 Subject: [ofa-general] [PATCH] libmlx4: Re-calculate number of inline segments Message-ID: <48FC0628.3010801@sgi.com> From: Brett Grandbois Supplying an ibv_qp_cap.max_inline_data value of 460 for mlx4_create_qp was getting back ENOMEM when the max should have been 928. Tracked the bug to the inline segment calculation. Here's the fix. Signed-off-by: Vincent Rizza Signed-off-by: Brett Grandbois Signed-off-by: Greg Banks Signed-off-by: Max Matveev Signed-off-by: Ken Sandars --- src/qp.c | 51 ++++++++++++++++++++++++++++++++++++++++----------- 1 files changed, 40 insertions(+), 11 deletions(-) diff --git a/src/qp.c b/src/qp.c index bb98c09..759ef51 100644 --- a/src/qp.c +++ b/src/qp.c @@ -497,6 +497,13 @@ out: static int num_inline_segs(int data, enum ibv_qp_type type) { + int initial_seg; + int num_segs = 0; + + /* ask for nothing, get nothing */ + if (!data) + goto out; + /* * Inline data segments are not allowed to cross 64 byte * boundaries. For UD QPs, the data segments always start @@ -505,18 +512,40 @@ static int num_inline_segs(int data, enum ibv_qp_type type) * control segment and possibly a 16 byte remote address * segment, so in the worst case there will be only 32 bytes * available for the first data segment. + * So we need to do a little overhead processing only in the + * case of a non-UD QP. */ - if (type == IBV_QPT_UD) - data += (sizeof (struct mlx4_wqe_ctrl_seg) + - sizeof (struct mlx4_wqe_datagram_seg)) % - MLX4_INLINE_ALIGN; - else - data += (sizeof (struct mlx4_wqe_ctrl_seg) + - sizeof (struct mlx4_wqe_raddr_seg)) % - MLX4_INLINE_ALIGN; + if (type != IBV_QPT_UD) { + + initial_seg = MLX4_INLINE_ALIGN - + (sizeof (struct mlx4_wqe_ctrl_seg) + + sizeof (struct mlx4_wqe_raddr_seg) + + sizeof (struct mlx4_wqe_inline_seg)); + + num_segs = 1; + /* + * no point continuing if everything fits in the + * initial segment + */ + if (data <= initial_seg) + goto out; - return (data + MLX4_INLINE_ALIGN - sizeof (struct mlx4_wqe_inline_seg) - 1) / - (MLX4_INLINE_ALIGN - sizeof (struct mlx4_wqe_inline_seg)); + /* + * If there's room in the initial segment, make sure we + * account for it before doing the full segment calcs + */ + data -= initial_seg; + } + + /* at this point we are just interested in how many segments are needed */ + num_segs += data / (MLX4_INLINE_ALIGN - sizeof (struct mlx4_wqe_inline_seg)); + + /* account for any leftovers */ + if (data % (MLX4_INLINE_ALIGN - sizeof (struct mlx4_wqe_inline_seg))) + num_segs++; + +out: + return num_segs; } void mlx4_calc_sq_wqe_size(struct ibv_qp_cap *cap, enum ibv_qp_type type, From rdreier at cisco.com Sun Oct 19 22:10:51 2008 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 19 Oct 2008 22:10:51 -0700 Subject: [ofa-general] [PATCH] libmlx4: Re-calculate number of inline segments In-Reply-To: <48FC0628.3010801@sgi.com> (Vincent Rizza's message of "Mon, 20 Oct 2008 15:16:40 +1100") References: <48FC0628.3010801@sgi.com> Message-ID: > Supplying an ibv_qp_cap.max_inline_data value of 460 for mlx4_create_qp > was getting back ENOMEM when the max should have been 928. Tracked the bug > to the inline segment calculation. Here's the fix. I don't see what's wrong with the current code, or why your change is anything but a more obfuscated way of calculating the same thing. And indeed, I just adapted a quick test program (below) which tries to create QPs with max_inline_data of 460, and I get the results: RC: inline 460 ok (got 928) UC: inline 460 ok (got 928) UD: inline 460 ok (got 900) so it seems to work on my system. - R. Here's the test code: #include #include #include int main(int argc, char *argv) { struct ibv_device **dev_list; struct ibv_device_attr dev_attr; struct ibv_context *context; struct ibv_pd *pd; struct ibv_cq *cq; struct ibv_qp_init_attr qp_attr; int t; static const struct { enum ibv_qp_type type; char *name; } type_tab[] = { { IBV_QPT_RC, "RC" }, { IBV_QPT_UC, "UC" }, { IBV_QPT_UD, "UD" }, }; dev_list = ibv_get_device_list(NULL); if (!dev_list) { printf("No IB devices found\n"); return 1; } for (; *dev_list; ++dev_list) { printf("%s:\n", ibv_get_device_name(*dev_list)); context = ibv_open_device(*dev_list); if (!context) { printf(" ibv_open_device failed\n"); continue; } if (ibv_query_device(context, &dev_attr)) { printf(" ibv_query_device failed\n"); continue; } cq = ibv_create_cq(context, 1, NULL, NULL, 0); if (!cq) { printf(" ibv_create_cq failed\n"); continue; } pd = ibv_alloc_pd(context); if (!pd) { printf(" ibv_alloc_pd failed\n"); continue; } for (t = 0; t < sizeof type_tab / sizeof type_tab[0]; ++t) { memset(&qp_attr, 0, sizeof qp_attr); qp_attr.send_cq = cq; qp_attr.recv_cq = cq; qp_attr.cap.max_send_wr = 1; qp_attr.cap.max_recv_wr = 1; qp_attr.cap.max_send_sge = 1; qp_attr.cap.max_recv_sge = 1; qp_attr.cap.max_inline_data = 460; qp_attr.qp_type = type_tab[t].type; printf(" %s: inline %d ", type_tab[t].name, qp_attr.cap.max_inline_data); if (ibv_create_qp(pd, &qp_attr)) printf("ok (got %d)\n", qp_attr.cap.max_inline_data); else printf("FAILED\n"); } } return 0; } From rdreier at cisco.com Sun Oct 19 22:13:12 2008 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 19 Oct 2008 22:13:12 -0700 Subject: [ofa-general] [PATCH] mlx4: Set sq_sig_type and qp_type in qp_init_attr struct In-Reply-To: <48FC0616.8060501@sgi.com> (Vincent Rizza's message of "Mon, 20 Oct 2008 15:16:22 +1100") References: <48FC0616.8060501@sgi.com> Message-ID: > Set correct sq_sig_type and qp_type in qp_init_attr. Why? What cares about this? The qp_type seems especially silly, since it's already in struct ib_qp. And mthca doesn't set this info either. - R. From rdreier at cisco.com Sun Oct 19 22:13:58 2008 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 19 Oct 2008 22:13:58 -0700 Subject: [ofa-general] [PATCH] mthca: Create sysfs entries for MSI-X interrupts if enabled In-Reply-To: <48FC061D.9090508@sgi.com> (Vincent Rizza's message of "Mon, 20 Oct 2008 15:16:29 +1100") References: <48FC061D.9090508@sgi.com> Message-ID: > Creates a sysfs entry for each MSI-X vector containing the IRQ value. This > patch applies to the mthca driver. Again, why? This needs some more explanation. - R. From rdreier at cisco.com Sun Oct 19 22:34:55 2008 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 19 Oct 2008 22:34:55 -0700 Subject: [ofa-general] Re: [PATCH] ipoib: don't enable napi when it's already enabled In-Reply-To: <48FB32CF.6060202@Voltaire.COM> (Yossi Etigin's message of "Sun, 19 Oct 2008 15:14:55 +0200") References: <48FB32CF.6060202@Voltaire.COM> Message-ID: > - napi_enable(&priv->napi); > - set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); > + if (!test_and_set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) > + napi_enable(&priv->napi); > > if (ipoib_pkey_dev_delay_open(dev)) > return 0; Does it work just to move the napi_enable() to after the ipoib_pkey_dev_delay_open() test? - R. From vinnie at sgi.com Sun Oct 19 22:55:14 2008 From: vinnie at sgi.com (Vincent Rizza) Date: Mon, 20 Oct 2008 16:55:14 +1100 Subject: [ofa-general] [PATCH] libmlx4: Re-calculate number of inline segments In-Reply-To: References: <48FC0628.3010801@sgi.com> Message-ID: <48FC1D42.5060800@sgi.com> Hi, When we set qp_attr.cap.max_inline_data between 461 and 928 we're seeing ENOMEM. Shouldn't the maximum handled be 928? Roland Dreier wrote: > > Supplying an ibv_qp_cap.max_inline_data value of 460 for mlx4_create_qp > > was getting back ENOMEM when the max should have been 928. Tracked the bug > > to the inline segment calculation. Here's the fix. > > I don't see what's wrong with the current code, or why your change is > anything but a more obfuscated way of calculating the same thing. And > indeed, I just adapted a quick test program (below) which tries to > create QPs with max_inline_data of 460, and I get the results: > > RC: inline 460 ok (got 928) > UC: inline 460 ok (got 928) > UD: inline 460 ok (got 900) > > so it seems to work on my system. > > - R. > > Here's the test code: > > #include > #include > > #include > > int main(int argc, char *argv) > { > struct ibv_device **dev_list; > struct ibv_device_attr dev_attr; > struct ibv_context *context; > struct ibv_pd *pd; > struct ibv_cq *cq; > struct ibv_qp_init_attr qp_attr; > int t; > static const struct { > enum ibv_qp_type type; > char *name; > } type_tab[] = { > { IBV_QPT_RC, "RC" }, > { IBV_QPT_UC, "UC" }, > { IBV_QPT_UD, "UD" }, > }; > > dev_list = ibv_get_device_list(NULL); > if (!dev_list) { > printf("No IB devices found\n"); > return 1; > } > > for (; *dev_list; ++dev_list) { > printf("%s:\n", ibv_get_device_name(*dev_list)); > > context = ibv_open_device(*dev_list); > if (!context) { > printf(" ibv_open_device failed\n"); > continue; > } > > if (ibv_query_device(context, &dev_attr)) { > printf(" ibv_query_device failed\n"); > continue; > } > > cq = ibv_create_cq(context, 1, NULL, NULL, 0); > if (!cq) { > printf(" ibv_create_cq failed\n"); > continue; > } > > pd = ibv_alloc_pd(context); > if (!pd) { > printf(" ibv_alloc_pd failed\n"); > continue; > } > > for (t = 0; t < sizeof type_tab / sizeof type_tab[0]; ++t) { > memset(&qp_attr, 0, sizeof qp_attr); > > qp_attr.send_cq = cq; > qp_attr.recv_cq = cq; > qp_attr.cap.max_send_wr = 1; > qp_attr.cap.max_recv_wr = 1; > qp_attr.cap.max_send_sge = 1; > qp_attr.cap.max_recv_sge = 1; > qp_attr.cap.max_inline_data = 460; > qp_attr.qp_type = type_tab[t].type; > > printf(" %s: inline %d ", type_tab[t].name, qp_attr.cap.max_inline_data); > > if (ibv_create_qp(pd, &qp_attr)) > printf("ok (got %d)\n", > qp_attr.cap.max_inline_data); > else > printf("FAILED\n"); > } > } > > return 0; > } From vinnie at sgi.com Sun Oct 19 22:55:34 2008 From: vinnie at sgi.com (Vincent Rizza) Date: Mon, 20 Oct 2008 16:55:34 +1100 Subject: [ofa-general] [PATCH] mthca: Create sysfs entries for MSI-X interrupts if enabled In-Reply-To: References: <48FC061D.9090508@sgi.com> Message-ID: <48FC1D56.2070505@sgi.com> We needed a way to link IRQs to IB cards. In /proc/interrupts it lists the IRQ and driver module name but when you put in more than one card you can't tell which interrupts belong to which card. Same thing for Connect-X. Roland Dreier wrote: > > Creates a sysfs entry for each MSI-X vector containing the IRQ value. This > > patch applies to the mthca driver. > > Again, why? This needs some more explanation. > > - R. From rdreier at cisco.com Sun Oct 19 23:01:58 2008 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 19 Oct 2008 23:01:58 -0700 Subject: [ofa-general] [PATCH] libmlx4: Re-calculate number of inline segments In-Reply-To: <48FC1D42.5060800@sgi.com> (Vincent Rizza's message of "Mon, 20 Oct 2008 16:55:14 +1100") References: <48FC0628.3010801@sgi.com> <48FC1D42.5060800@sgi.com> Message-ID: > When we set qp_attr.cap.max_inline_data between 461 and 928 we're seeing ENOMEM. > Shouldn't the maximum handled be 928? That's not what your original email said: > > Supplying an ibv_qp_cap.max_inline_data value of 460... But anyway, I just tried 461, 480, 511, 512, 513, 900, and 928 with the same test program and all work as expected (UD QPs have a limit of 900 bytes of inline data, so asking for 928 failed, but that is correct I believe). Can you give a specific QP type and data size where your version of num_inline_segs() returns a different value from the existing one? That might be the best way to explain the bug you've found. - R. From vinnie at sgi.com Sun Oct 19 23:03:42 2008 From: vinnie at sgi.com (Vincent Rizza) Date: Mon, 20 Oct 2008 17:03:42 +1100 Subject: [ofa-general] [PATCH] mlx4: Set sq_sig_type and qp_type in qp_init_attr struct In-Reply-To: References: <48FC0616.8060501@sgi.com> Message-ID: <48FC1F3E.6000702@sgi.com> The application sets the sq_sig_all field in the ibv_qp_init_attr structure passed in to ibv_qp_create(). It uses ibv_query_qp() to determine what values have been applied by the ofed infrastructure. In this case it wants to know that CQEs will only be generated for the requested WRs. Otherwise the application may flag a warning to the user to indicate this performance option is not in effect for this QP. Roland Dreier wrote: > > Set correct sq_sig_type and qp_type in qp_init_attr. > > Why? What cares about this? The qp_type seems especially silly, since > it's already in struct ib_qp. And mthca doesn't set this info either. > > - R. From vinnie at sgi.com Sun Oct 19 23:10:15 2008 From: vinnie at sgi.com (Vincent Rizza) Date: Mon, 20 Oct 2008 17:10:15 +1100 Subject: [ofa-general] [PATCH] libmlx4: Re-calculate number of inline segments In-Reply-To: References: <48FC0628.3010801@sgi.com> <48FC1D42.5060800@sgi.com> Message-ID: <48FC20C7.3070808@sgi.com> Yep sorry that original summary should have said 'greater than 460'. We'll work up a test case to show the problem. Roland Dreier wrote: > > When we set qp_attr.cap.max_inline_data between 461 and 928 we're seeing ENOMEM. > > Shouldn't the maximum handled be 928? > > That's not what your original email said: > >>> Supplying an ibv_qp_cap.max_inline_data value of 460... > > But anyway, I just tried 461, 480, 511, 512, 513, 900, and 928 with the > same test program and all work as expected (UD QPs have a limit of 900 > bytes of inline data, so asking for 928 failed, but that is correct I > believe). > > Can you give a specific QP type and data size where your version of > num_inline_segs() returns a different value from the existing one? > That might be the best way to explain the bug you've found. > > - R. From vlad at dev.mellanox.co.il Mon Oct 20 02:21:12 2008 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Mon, 20 Oct 2008 11:21:12 +0200 Subject: [ofa-general] OFED-1.4-rc3 is available Message-ID: <48FC4D88.3040702@dev.mellanox.co.il> Hi, OFED-1.4-rc3 release is available on http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-rc3.tgz To get BUILD_ID run ofed_info Please report any issues in bugzilla https://bugs.openfabrics.org/ for OFED 1.4 Tziporet & Vladimir ======================================================================== Release information: ------------------------------ Linux Operating Systems: - RedHat EL4 up4: 2.6.9-42.ELsmp * - RedHat EL4 up5: 2.6.9-55.ELsmp - RedHat EL4 up6: 2.6.9-67.ELsmp - RedHat EL4 up7: 2.6.9-78.ELsmp - RedHat EL5: 2.6.18-8.el5 - RedHat EL5 up1: 2.6.18-53.el5 - RedHat EL5 up2: 2.6.18-92.el5 - CentOS 5.2: 2.6.18-92.el5 - Fedora C9: 2.6.25-14.fc9 * - SLES10: 2.6.16.21-0.8-smp - SLES10 SP1: 2.6.16.46-0.12-smp - SLES10 SP1 up1: 2.6.16.53-0.16-smp - SLES10 SP2: 2.6.16.60-0.21-smp - OpenSuSE 10.3: 2.6.22.5-31 * - kernel.org: 2.6.26 and 2.6.27 * Minimal QA for these versions Systems: * x86_64 * x86 * ia64 * ppc64 Main Changes from OFED-1.4-rc2 ========================= - Kernel base updated to 2.6.27 - NFS-RDMA is NOT enabled by default. To enable it one must chose it using custom installation, or add it to ofed.conf file. - Updated MPI packages: mvapich-1.1.0-3064, mvapich2-trunk-3073, openmpi-1.2.8-1 - Updated bonding package: ib-bonding-0.9.0-31 - Updated uDAPL: compat-dapl-1.2.11-1, dapl-2.0.14-1 - NFS-RDMA to work on RHEL 5.1 - OSM: Cashed routing - Cleanup compilation warning - 27 bugs fixed (see attached for details) - Kernel git tree changes: Tasks that should be completed for the rc4: ================================ 1. High priority bug fixes From vlad at lists.openfabrics.org Mon Oct 20 03:22:28 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 20 Oct 2008 03:22:28 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081020-0200 daily build status Message-ID: <20081020102228.728E4E609FC@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From eli at mellanox.co.il Mon Oct 20 07:12:55 2008 From: eli at mellanox.co.il (Eli Cohen) Date: Mon, 20 Oct 2008 16:12:55 +0200 Subject: [ofa-general] [PATCH] ib_core: Use weak ordering for data registered memory Message-ID: <20081020141255.GA23619@mtls03> Some architectures support weak ordering in which case better performance is possible. IB registered memory used for data can be weakly ordered becuase the the completion queues' buffers are registered as strongly ordered. This will result in flushing all data related outstanding DMA requests by the HCA when a completion is DMAed to a completion queue buffer. Signed-off-by: Eli Cohen Signed-off-by: Arnd Bergmann --- drivers/infiniband/core/umem.c | 8 ++++++-- include/rdma/ib_umem.h | 2 ++ 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 6f7c096..6a1ff26 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -51,8 +51,8 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d int i; list_for_each_entry_safe(chunk, tmp, &umem->chunk_list, list) { - ib_dma_unmap_sg(dev, chunk->page_list, - chunk->nents, DMA_BIDIRECTIONAL); + ib_dma_unmap_sg_attrs(dev, chunk->page_list, + chunk->nents, DMA_BIDIRECTIONAL, &chunk->attrs); for (i = 0; i < chunk->nents; ++i) { struct page *page = sg_page(&chunk->page_list[i]); @@ -91,6 +91,9 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, if (dmasync) dma_set_attr(DMA_ATTR_WRITE_BARRIER, &attrs); + else + dma_set_attr(DMA_ATTR_WEAK_ORDERING, &attrs); + if (!can_do_mlock()) return ERR_PTR(-EPERM); @@ -155,6 +158,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, if (ret < 0) goto out; + chunk->attrs = attrs; cur_base += ret * PAGE_SIZE; npages -= ret; diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index 9ee0d2e..90f3712 100644 --- a/include/rdma/ib_umem.h +++ b/include/rdma/ib_umem.h @@ -36,6 +36,7 @@ #include #include #include +#include struct ib_ucontext; @@ -56,6 +57,7 @@ struct ib_umem_chunk { struct list_head list; int nents; int nmap; + struct dma_attrs attrs; struct scatterlist page_list[0]; }; -- 1.6.0.2 From rdreier at cisco.com Mon Oct 20 07:43:04 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Oct 2008 07:43:04 -0700 Subject: [ofa-general] Re: [PATCH] ib_core: Use weak ordering for data registered memory In-Reply-To: <20081020141255.GA23619@mtls03> (Eli Cohen's message of "Mon, 20 Oct 2008 16:12:55 +0200") References: <20081020141255.GA23619@mtls03> Message-ID: > Some architectures support weak ordering in which case better > performance is possible. IB registered memory used for data can be > weakly ordered becuase the the completion queues' buffers are > registered as strongly ordered. This will result in flushing all data > related outstanding DMA requests by the HCA when a completion is DMAed > to a completion queue buffer. This would break the Mellanox HW's guarantee of writing the last byte of an RDMA last, right? So on platforms where this has an effect (only Cell at the moment) some applications could be subtly broken? - R. From rdreier at cisco.com Mon Oct 20 08:50:30 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Oct 2008 08:50:30 -0700 Subject: [ofa-general] [PATCH] mthca: Create sysfs entries for MSI-X interrupts if enabled In-Reply-To: <48FC1D56.2070505@sgi.com> (Vincent Rizza's message of "Mon, 20 Oct 2008 16:55:34 +1100") References: <48FC061D.9090508@sgi.com> <48FC1D56.2070505@sgi.com> Message-ID: > We needed a way to link IRQs to IB cards. In /proc/interrupts it lists > the IRQ and driver module name but when you put in more than one card > you can't tell which interrupts belong to which card. Same thing for > Connect-X. Rather than going driver-by-driver I think it would make much more sense to add MSI/MSI-X information to the /sys/devices/pci/ directory in generic code, so go along with the irq attribute that is already there. Because if you care about mthca or mlx4, then you're also going to care about your HBA or NIC, and it doesn't make sense to duplicate this code over and over. - R. From halr at obsidianresearch.com Mon Oct 20 13:30:16 2008 From: halr at obsidianresearch.com (Hal Rosenstock) Date: Mon, 20 Oct 2008 14:30:16 -0600 Subject: [ofa-general] [PATCH] ibdm/Fabric.cpp: Determine node type properly Message-ID: <48FCEA58.9050906@obsidianresearch.com> Oren, Attached is a patch to determine the node type properly rather than basing it on the number of ports (switches can have 2 ports too). -- Hal -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: patch-ibdiag-fabric2 URL: From arnd at arndb.de Mon Oct 20 14:41:06 2008 From: arnd at arndb.de (Arnd Bergmann) Date: Mon, 20 Oct 2008 23:41:06 +0200 Subject: [ofa-general] Re: [PATCH] ib_core: Use weak ordering for data registered memory In-Reply-To: References: <20081020141255.GA23619@mtls03> Message-ID: <200810202341.06856.arnd@arndb.de> On Monday 20 October 2008, Roland Dreier wrote: >  > Some architectures support weak ordering in which case better >  > performance is possible. IB registered memory used for data can be >  > weakly ordered becuase the the completion queues' buffers are >  > registered as strongly ordered. This will result in flushing all data >  > related outstanding DMA requests by the HCA when a completion is DMAed >  > to a completion queue buffer. > > This would break the Mellanox HW's guarantee of writing the last byte of > an RDMA last, right?  So on platforms where this has an effect (only > Cell at the moment) some applications could be subtly broken? Yes, that is true. In our testing with openmpi, we had to disable eager RDMA. However, without this patch RDMA infiniband performance on some of our machines sucks so bad that we would not want to advertise support for it and I would really love to see this patch make it into OFED-1.4 and Linux-2.6.28. We (IBM and Mellanox) have discussed adding a module parameter for whether or not this should be enabled at runtime, and possibly extending the ibverbs interface to allow the application to choose. The question there remains what the default should be. AFAIU, the IB specification does not give any such guarantees about the ordering within RDMA transfers, right? If that is so, applications relying on the ordering would be broken to start with. Also, the existing code evidently does not use DMA_ATTR_WRITE_BARRIER for the allocation where we would use DMA_ATTR_WEAK_ORDERING. Because of what I can only explain as lack of coordination, DMA_ATTR_WRITE_BARRIER seems to be an almost exact opposite of DMA_ATTR_WEAK_ORDERING, which means that on platforms that use the former (SGI Altix so far), not passing any DMA attribute would break these applications in exactly the same way that weak ordering on cell is breaking them. Arnd <>< From pradeeps at linux.vnet.ibm.com Mon Oct 20 16:31:38 2008 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Mon, 20 Oct 2008 16:31:38 -0700 Subject: [ofa-general] Bonding fail over not working Message-ID: <48FD14DA.6000007@linux.vnet.ibm.com> I downloaded a recent version of Roland's git tree and tried IPoIB bonding. Fail over does not seem to be working at all. I have tried OFED 1.3.2 on a Rhel5 derivative and that (fail over) worked as expected. Is this a known issue? Given that OFED 1.4 will be in sync with main line kernel, is this an issue to be addressed in OFED 1.4 too? Has any one else tried this out recently? My impression is that all bonding patches were already upstream. Pradeep From vlad at lists.openfabrics.org Tue Oct 21 03:19:43 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 21 Oct 2008 03:19:43 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081021-0200 daily build status Message-ID: <20081021101944.0FD2DE60CF0@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From mike.marty at gmail.com Tue Oct 21 10:03:59 2008 From: mike.marty at gmail.com (Mike Marty) Date: Tue, 21 Oct 2008 12:03:59 -0500 Subject: [ofa-general] ***SPAM*** libibverbs configuration directory and userspace device-specific driver Message-ID: <229af89c0810211003y508e4bb6h818f5f019ddefd05@mail.gmail.com> I am attempting to use libibverbs with a build from src (./configure && make install). However "ibv_devices" complains: libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0 I manually created the /usr/local/etc/libibverbs.d directory, but nothing exists in it, and I can't seem to find any documentation on creating files manually. My /sys/class/infiniband_verbs/uverbs0/ directory does indeed exist: root at leopard:/usr/src/rpm/SOURCES/libibverbs-1.1.2/src# ls -l /sys/class/infiniband_verbs/uverbs0/ total 0 -r--r--r-- 1 root root 4096 2008-10-20 16:09 abi_version -r--r--r-- 1 root root 4096 2008-10-20 16:19 dev lrwxrwxrwx 1 root root 0 2008-10-20 16:09 device -> ../../../devices/pci0000:00/0000:00:1c.0/0000:03:00.0 -r--r--r-- 1 root root 4096 2008-10-20 16:09 ibdev lrwxrwxrwx 1 root root 0 2008-10-20 16:19 subsystem -> ../../../class/infiniband_verbs --w------- 1 root root 4096 2008-10-20 16:19 uevent root at leopard:/usr/src/rpm/SOURCES/libibverbs-1.1.2/src# Any ideas? Thanks, Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From mike.marty at gmail.com Tue Oct 21 12:10:35 2008 From: mike.marty at gmail.com (Mike Marty) Date: Tue, 21 Oct 2008 14:10:35 -0500 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** libibverbs configuration directory and userspace device-specific driver In-Reply-To: <229af89c0810211003y508e4bb6h818f5f019ddefd05@mail.gmail.com> References: <229af89c0810211003y508e4bb6h818f5f019ddefd05@mail.gmail.com> Message-ID: <229af89c0810211210w2ad70310j22cf15df0ff30cc@mail.gmail.com> Disregard...found that ibv_register_driver wasn't called because my lib_mlx4 library was not installed right --Mike On Tue, Oct 21, 2008 at 12:03 PM, Mike Marty wrote: > > I am attempting to use libibverbs with a build from src (./configure && > make install). However "ibv_devices" complains: > > libibverbs: Warning: no userspace device-specific driver found for > /sys/class/infiniband_verbs/uverbs0 > > > I manually created the /usr/local/etc/libibverbs.d directory, but nothing > exists in it, and I can't seem to find any documentation on creating files > manually. > > My /sys/class/infiniband_verbs/uverbs0/ directory does indeed exist: > > root at leopard:/usr/src/rpm/SOURCES/libibverbs-1.1.2/src# ls -l > /sys/class/infiniband_verbs/uverbs0/ > total 0 > -r--r--r-- 1 root root 4096 2008-10-20 16:09 abi_version > -r--r--r-- 1 root root 4096 2008-10-20 16:19 dev > lrwxrwxrwx 1 root root 0 2008-10-20 16:09 device -> > ../../../devices/pci0000:00/0000:00:1c.0/0000:03:00.0 > -r--r--r-- 1 root root 4096 2008-10-20 16:09 ibdev > lrwxrwxrwx 1 root root 0 2008-10-20 16:19 subsystem -> > ../../../class/infiniband_verbs > --w------- 1 root root 4096 2008-10-20 16:19 uevent > root at leopard:/usr/src/rpm/SOURCES/libibverbs-1.1.2/src# > > > Any ideas? > > Thanks, > Mike > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From or.gerlitz at gmail.com Tue Oct 21 13:49:44 2008 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Tue, 21 Oct 2008 22:49:44 +0200 Subject: ***SPAM*** Re: [ofa-general] Bonding fail over not working In-Reply-To: <48FD14DA.6000007@linux.vnet.ibm.com> References: <48FD14DA.6000007@linux.vnet.ibm.com> Message-ID: <15ddcffd0810211349l4caa1329s32b0dbe0a1e38f60@mail.gmail.com> Pradeep Satyanarayana wrote: > I downloaded a recent version of Roland's git tree and tried IPoIB bonding. > Fail over does not seem to be working at all. I have tried OFED 1.3.2 on a > Rhel5 derivative and that (fail over) worked as expected. > > Is this a known issue? Given that OFED 1.4 will be in sync with main line > kernel, is this an issue to be addressed in OFED 1.4 too? Has any one else > tried this out recently? My impression is that all bonding patches were > already upstream. I just tried ipoib/bonding with mainline kernel 2.6.27 and it works fine, as expected, see below. Can you repeat the exact sequence and see if it works for you, or send the settings that break bonding/ipoib on your system? I didn't use network scripts but this should be the issue if you use the directives that come with the ib-bonding package. $ modprobe bonding mode=active-backup miimon=100 $ modprobe ib_ipoib $ echo +ib0 > /sys/class/net/bond0/bonding/slaves $ echo +ib1 > /sys/class/net/bond0/bonding/slaves $ ifconfig bond0 10.10.5.62/16 up $ ping 10.10.0.90 & $ ifconfig ib0 down $ dmesg | grep bonding bonding: MII link monitoring set to 100 ms bonding: bond0: doing slave updates when interface is down. bonding: bond0: Adding slave ib0. bonding bond0: master_dev is not up in bond_enslave bonding: bond0: Warning: enslaved VLAN challenged slave ib0. Adding VLANs will be blocked as long as ib0 is part of bond bond0 bonding: bond0: Warning: The first slave device specified does not support setting the MAC address. Setting fail_over_mac to active.<6>bonding: bond0: enslaving ib0 as a backup interface with a down link. bonding: bond0: doing slave updates when interface is down. bonding: bond0: Adding slave ib1. bonding bond0: master_dev is not up in bond_enslave bonding: bond0: Warning: enslaved VLAN challenged slave ib1. Adding VLANs will be blocked as long as ib1 is part of bond bond0 bonding: bond0: enslaving ib1 as a backup interface with a down link. bonding: bond0: link status definitely up for interface ib0. bonding: bond0: making interface ib0 the new active one. bonding: bond0: first active interface up! bonding: bond0: link status definitely up for interface ib1. bonding: bond0: link status definitely down for interface ib0, disabling it bonding: bond0: making interface ib1 the new active one. -------------- next part -------------- An HTML attachment was scrubbed... URL: From pradeeps at linux.vnet.ibm.com Tue Oct 21 15:13:08 2008 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Tue, 21 Oct 2008 15:13:08 -0700 Subject: [ofa-general] Bonding fail over not working In-Reply-To: <15ddcffd0810211349l4caa1329s32b0dbe0a1e38f60@mail.gmail.com> References: <48FD14DA.6000007@linux.vnet.ibm.com> <15ddcffd0810211349l4caa1329s32b0dbe0a1e38f60@mail.gmail.com> Message-ID: <48FE53F4.2010006@linux.vnet.ibm.com> Or Gerlitz wrote: > Pradeep Satyanarayana > wrote: > > I downloaded a recent version of Roland's git tree and tried IPoIB > bonding. Fail over does not seem to be working at all. I have tried > OFED 1.3.2 on a Rhel5 derivative and that (fail over) worked as > expected. > > Is this a known issue? Given that OFED 1.4 will be in sync with main > line kernel, is this an issue to be addressed in OFED 1.4 too? Has > any one else tried this out recently? My impression is that all > bonding patches were already upstream. > > > I just tried ipoib/bonding with mainline kernel 2.6.27 and it works > fine, as expected, see below. Can you repeat the exact sequence and see > if it works for you, or send the settings that break bonding/ipoib on > your system? I didn't use network scripts but this should be the issue > if you use the directives that come with the ib-bonding package. I just retested and it is indeed working as expected. My earlier conclusions were erroneous. Between my experiments some one must have stepped on the cable and when I checked the cabling in the lab the port was disconnected. No wonder, the fail over did not occur as expected. Sorry about the false report! Pradeep From or.gerlitz at gmail.com Tue Oct 21 15:17:13 2008 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 22 Oct 2008 00:17:13 +0200 Subject: ***SPAM*** Re: [ofa-general] [PATCH v2] IB/ipoib: fix netdev offload features support for child (VLAN) devices In-Reply-To: References: Message-ID: <15ddcffd0810211517r1251640x94cfb81b55ad67b7@mail.gmail.com> On Thu, Oct 16, 2008 at 3:20 PM, Or Gerlitz wrote: > Child devices were created without any offload features set, fix this by > moving the code that computes the features into generic function which is > now called through non-child and child device creation. > > Signed-off-by: Or Gerlitz > > -- v1 has a bug where the 'result' flag in ipoib_vlan_add may be used > uninitialized Hi Roland, I'd like to have this patch gets into .28 and maybe even to -stable, as without it, child devices deliver much less TCP BW they can. What are the chances? Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From cameron at harr.org Tue Oct 21 16:57:56 2008 From: cameron at harr.org (Cameron Harr) Date: Tue, 21 Oct 2008 17:57:56 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48F79CF8.3010905@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> Message-ID: <48FE6C84.7030300@harr.org> Vladislav Bolkhovitin wrote: > > I guess, you use a regular caching IO? The lowest packet size it can > produce is a PAGE_SIZE (4K). Target can't change it. You can have > lower packets only with O_DIRECT or sg interface. But I'm not sure it > will be performance effective. I do everything with Direct IO, which is automatic when using the BLOCKIO method in SCST. > > I'd recommend you to use 4K packets and deadline IO scheduler. I've tried this before, but can probably set it as my default config. -Cameron From sashak at voltaire.com Tue Oct 21 18:46:43 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 22 Oct 2008 03:46:43 +0200 Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash: fix extra memory allocations Message-ID: <20081022014643.GD20450@sashak.voltaire.com> Save memory allocations - dij_channels array cannot exceed number of switch's ports. Also save some cycles (and flow) in repeated and obsolete initializations. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_ucast_lash.c | 28 ++++++++++------------------ 1 files changed, 10 insertions(+), 18 deletions(-) diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index 2d4acf0..76ec9d1 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -666,7 +666,7 @@ static switch_t *switch_create(lash_t * p_lash, unsigned id, osm_switch_t * p_sw memset(sw, 0, sizeof(*sw)); sw->id = id; - sw->dij_channels = malloc((num_switches) * sizeof(int)); + sw->dij_channels = malloc(num_ports * sizeof(int)); if (!sw->dij_channels) { free(sw); return NULL; @@ -876,18 +876,10 @@ static int lash_core(lash_t * p_lash) int output_link2, i_next_switch2; int cycle_found2 = 0; int status = 0; - int *switch_bitmap = NULL; /* Bitmap to check if we have processed this pair */ + int *switch_bitmap; /* Bitmap to check if we have processed this pair */ OSM_LOG_ENTER(p_log); - switch_bitmap = - (int *)malloc(num_switches * num_switches * sizeof(int)); - if (!switch_bitmap) { - OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D04: " - "Failed allocating switch_bitmap - out of memory\n"); - goto Exit; - } - for (i = 0; i < num_switches; i++) { shortest_path(p_lash, i); @@ -901,14 +893,19 @@ static int lash_core(lash_t * p_lash) } for (j = 0; j < num_switches; j++) { - for (k = 0; k < num_switches; k++) { - switch_bitmap[j * num_switches + k] = 0; - } switches[j]->used_channels = 0; switches[j]->q_state = UNQUEUED; } } + switch_bitmap = malloc(num_switches * num_switches * sizeof(int)); + if (!switch_bitmap) { + OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 4D04: " + "Failed allocating switch_bitmap - out of memory\n"); + goto Exit; + } + memset(switch_bitmap, 0, num_switches * num_switches * sizeof(int)); + for (i = 0; i < num_switches; i++) { for (dest_switch = 0; dest_switch < num_switches; dest_switch++) if (dest_switch != i && switch_bitmap[i * num_switches + dest_switch] == 0) { @@ -999,11 +996,6 @@ static int lash_core(lash_t * p_lash) switch_bitmap[i * num_switches + dest_switch] = 1; switch_bitmap[dest_switch * num_switches + i] = 1; } - - for (j = 0; j < num_switches; j++) { - switches[j]->used_channels = 0; - switches[j]->q_state = UNQUEUED; - } } OSM_LOG(p_log, OSM_LOG_INFO, -- 1.6.0.1.196.g01914 From sashak at voltaire.com Tue Oct 21 18:52:20 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 22 Oct 2008 03:52:20 +0200 Subject: [ofa-general] [PATCH] opensm/osm_ucast_lash: simplify get_phys_connection() prototype Message-ID: <20081022015220.GE20450@sashak.voltaire.com> Symplify get_phys_connection() prototype and usage. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_ucast_lash.c | 27 ++++++++++++--------------- 1 files changed, 12 insertions(+), 15 deletions(-) diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index 76ec9d1..1036c9f 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -282,13 +282,12 @@ inline static void dequeue(cl_list_t * bfsq, switch_t ** sw) (*sw)->q_state = MST_MEMBER; } -static int get_phys_connection(switch_t ** switches, int switch_from, - int switch_to) +static int get_phys_connection(switch_t *sw, int switch_to) { unsigned int i = 0; - for (i = 0; i < switches[switch_from]->num_connections; i++) - if (switches[switch_from]->phys_connections[i] == switch_to) + for (i = 0; i < sw->num_connections; i++) + if (sw->phys_connections[i] == switch_to) return i; return i; } @@ -318,28 +317,26 @@ static void shortest_path(lash_t * p_lash, int ir) cl_list_destroy(&bfsq); } -static void generate_routing_func_for_mst(lash_t * p_lash, int sw, +static void generate_routing_func_for_mst(lash_t * p_lash, int sw_id, reachable_dest_t ** destinations) { int i, next_switch; - switch_t **switches = p_lash->switches; - int num_channels = switches[sw]->used_channels; + switch_t *sw = p_lash->switches[sw_id]; + int num_channels = sw->used_channels; reachable_dest_t *dest, *i_dest, *concat_dest = NULL, *prev; for (i = 0; i < num_channels; i++) { - next_switch = switches[sw]->dij_channels[i]; + next_switch = sw->dij_channels[i]; generate_routing_func_for_mst(p_lash, next_switch, &dest); i_dest = dest; prev = i_dest; while (i_dest != NULL) { - if (switches[sw]->routing_table[i_dest->switch_id]. - out_link == NONE) { - switches[sw]->routing_table[i_dest->switch_id]. - out_link = - get_phys_connection(switches, sw, - next_switch); + if (sw->routing_table[i_dest->switch_id].out_link == + NONE) { + sw->routing_table[i_dest->switch_id].out_link = + get_phys_connection(sw, next_switch); } prev = i_dest; @@ -352,7 +349,7 @@ static void generate_routing_func_for_mst(lash_t * p_lash, int sw, } i_dest = (reachable_dest_t *) malloc(sizeof(reachable_dest_t)); - i_dest->switch_id = sw; + i_dest->switch_id = sw->id; i_dest->next = concat_dest; *destinations = i_dest; } -- 1.6.0.1.196.g01914 From wangwhao at cn.ibm.com Tue Oct 21 20:13:20 2008 From: wangwhao at cn.ibm.com (Wen Hao Wang) Date: Wed, 22 Oct 2008 11:13:20 +0800 Subject: [ofa-general] ibsysstat cpu output is incomplete In-Reply-To: <20080927043307.GW16914@sashak.voltaire.com> Message-ID: > On 08:55 Sat 27 Sep , Wen Hao Wang wrote: >> >> I opened bug 1237 for this issue. > > Yes, I saw already. > >> If the incomplete output is related to >> the packet size limitation, I think we may solve the problem by two >> solutions: >> 1. Increase the limitation to one large value (not sure whether this is >> feasible) > > We cannot do it following class 0x33 playload size limitation. > >> 2. Transfer multiple packets until all the information is got and printed > > Using RMPP (as allowed for classes in that range). > > Sasha Hi Sasha: Can ibsysstat use RMPP by default to give complete message? Wen Hao Wang Email: wangwhao at cn.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From nicolas.morey-chaisemartin at ext.bull.net Tue Oct 21 23:10:28 2008 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Wed, 22 Oct 2008 08:10:28 +0200 Subject: [ofa-general] [PATCH] perftest : Corrected ibv_poll_cq calls Message-ID: <48FEC3D4.6070604@ext.bull.net> Hi, WHile reading the code to undertsand the perftests, I noticed that in 2 of them, the loop on ibv_poll_cq only exists when the call read a CQE, but not in the case of an error (though the code looks for an error return value right after). (My apologies if patch isn't at the right format...First time using git.) Nicolas Morey-Chaisemartin ______ diff --git a/read_lat.c b/read_lat.c index 8119f57..bf3f5fd 100755 --- a/read_lat.c +++ b/read_lat.c @@ -706,7 +706,7 @@ int run_iter(struct pingpong_context *ctx, struct user_parameters *user_param, } do { ne = ibv_poll_cq(ctx->cq, 1, &wc); - } while (!user_param->use_event && ne < 1); + } while (!user_param->use_event && ne ==0); if (ne < 0) { fprintf(stderr, "poll CQ failed %d\n", ne); diff --git a/send_lat.c b/send_lat.c index b2796d6..f69e9a8 100755 --- a/send_lat.c +++ b/send_lat.c @@ -887,7 +887,7 @@ int run_iter(struct pingpong_context *ctx, struct user_parameters *user_param, } do { ne = ibv_poll_cq(ctx->rcq, 1, &wc); - } while (!user_param->use_event && ne < 1); + } while (!user_param->use_event && ne == 0); if (ne < 0) { fprintf(stderr, "Poll Recieve CQ failed %d\n", ne); @@ -950,7 +950,7 @@ int run_iter(struct pingpong_context *ctx, struct user_parameters *user_param, /* poll on scq */ do { ne = ibv_poll_cq(ctx->scq, 1, &wc); - } while (!user_param->use_event && ne < 1); + } while (!user_param->use_event && ne == 0); if (ne < 0) { fprintf(stderr, "poll SCQ failed %d\n", ne); From ogerlitz at voltaire.com Wed Oct 22 01:43:43 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 22 Oct 2008 10:43:43 +0200 Subject: [ofa-general] Re: [ewg] Update from September OpenFabrics Interoperability Event at UNH-IOL In-Reply-To: <48FE05F0.8070608@iol.unh.edu> References: <48FB2C81.3080301@mellanox.co.il> <4D511C95BE7F4D8E92B2BAE9E0D67AA0@annapurna> <48FE05F0.8070608@iol.unh.edu> Message-ID: <48FEE7BF.2020604@voltaire.com> Bob Noseworthy wrote: > A bug for the observed IPoIB issue was logged last Friday, and > updated yesterday confirming that RC3 still demonstrates the issue. > This is logged as https://bugs.openfabrics.org/show_bug.cgi?id=1287 > > Further issues/observations from the recent OFA Interoperability Logo > Group's September Interoperability Event are at the end of this > email. Summary of reported IPoIB issue: If IPoIB datagram mode is > enabled, and IP frames of 8K or larger are sent, and no ARP entry > exists for the destination, then the first IP frame is always lost > (ping used), no matter what the timeout is set to (as high as 15s) Looking in the code, the issue you report seems to be related to the length of internal queue used by ipoib to keep skbs whose neighbour doesn't have yet an IB Address-Handle (L2 info needed for xmit) associated with > drivers/infiniband/ulp/ipoib/ipoib.h: IPOIB_MAX_PATH_REC_QUEUE = 3, > drivers/infiniband/ulp/ipoib/ipoib_main.c if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) > drivers/infiniband/ulp/ipoib/ipoib_main.c: skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) { > drivers/infiniband/ulp/ipoib/ipoib_main.c: if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) the current code will keep up to three skbs and then drop all the ones that follows till the point in time a reply for the driver path query is received from the SA. Unless I miss something, this code is there from day one (Q4/2005), do you claim that with older code drops this issue has not been observed? I am cc-ing here Roland, the maintainer of the driver, so you can check things with him. > The following is a short summary of various updates from the September > OpenFabrics Interoperability Event. Due to confidentiality reasons, > many details are occluded. Per the request of the IWG on Oct 14, this > information is being shared with the EWG. > > Testing is ongoing with RC3 and future 1.4RCs on a best effort basis > until the GA, at which time the Logo Event will be held for those > participating. If you have additional questions about these > comments, the Interoperability Events, Logo Events, or the OFA > Interoperability Test Plan, please feel free to contact us here at UNH-IOL May I ask whose decision was it to test the Linux kernel RDMA stack in its "ofed" flavor and what was the reasonings behind it? the main-line kernel IB/iWARP code is well maintained and has an associated small supporting developer community. The ofed kernel bits contain code which was not accepted yet to the upstream kernel so you are actually testing not the product delivered by the ofa maintainers but rather a different creature, are you aware to that? Or. From vst at vlnb.net Wed Oct 22 00:45:42 2008 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Wed, 22 Oct 2008 11:45:42 +0400 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48FE6C84.7030300@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> Message-ID: <48FEDA26.4080304@vlnb.net> Cameron Harr wrote: > Vladislav Bolkhovitin wrote: >> I guess, you use a regular caching IO? The lowest packet size it can >> produce is a PAGE_SIZE (4K). Target can't change it. You can have >> lower packets only with O_DIRECT or sg interface. But I'm not sure it >> will be performance effective. > I do everything with Direct IO, which is automatic when using the > BLOCKIO method in SCST. I meant on initiator(s), not on the target. >> I'd recommend you to use 4K packets and deadline IO scheduler. > I've tried this before, but can probably set it as my default config. > > -Cameron > From vlad at lists.openfabrics.org Wed Oct 22 03:19:40 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 22 Oct 2008 03:19:40 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081022-0200 daily build status Message-ID: <20081022101940.EE0BCE601C7@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From sashak at voltaire.com Wed Oct 22 03:48:51 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 22 Oct 2008 12:48:51 +0200 Subject: [ofa-general] ibsysstat cpu output is incomplete In-Reply-To: References: <20080927043307.GW16914@sashak.voltaire.com> Message-ID: <20081022104851.GJ20450@sashak.voltaire.com> On 11:13 Wed 22 Oct , Wen Hao Wang wrote: > > Can ibsysstat use RMPP by default to give complete message? Yes, I thought about this - by default or when message size is bigger than regular payload size. Sasha From kliteyn at dev.mellanox.co.il Wed Oct 22 05:56:28 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 22 Oct 2008 14:56:28 +0200 Subject: [ofa-general] opensm as service - cfg files Message-ID: <48FF22FC.6000606@dev.mellanox.co.il> Hi Sasha, I was just trying to put some order in my head regarding the use of opensm as service, and I have couple of questions. Some of them might be dumb, so please bear with me... :) 1. OpenSM config file. Do we still need opensm/scripts/opensm.conf? I think it's not used any more. 2. From opensm/scripts/opensm.init.in: @sbindir@/opensm -B $OPTIONS > /dev/null Is someone setting the $OPTIONS variable? I think it was set in the config file in the past, but not now. 3. From opensm/scripts/redhat-opensm.init.in: CONFIG=@sysconfdir@/sysconfig/opensm.conf if [ -f $CONFIG ]; then . $CONFIG fi From opensm/scripts/opensm.init.in: if [[ -s /etc/sysconfig/opensm ]]; then . /etc/sysconfig/opensm fi If it's not some naming convention, perhaps we should use opensm.conf in both cases? 4. Logrotate: opensm/scripts/opensm.spec.in installs logrotate file as follows: install -m 644 scripts/opensm.logrotate $etc/logrotate.d/opensm I may be off here, but should the installed file name be opensmd to match the service name? -- Yevgeny From sashak at voltaire.com Wed Oct 22 06:10:18 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 22 Oct 2008 15:10:18 +0200 Subject: [ofa-general] Re: opensm as service - cfg files In-Reply-To: <48FF22FC.6000606@dev.mellanox.co.il> References: <48FF22FC.6000606@dev.mellanox.co.il> Message-ID: <20081022131018.GN20450@sashak.voltaire.com> Hi Yevgeny, On 14:56 Wed 22 Oct , Yevgeny Kliteynik wrote: > > 1. OpenSM config file. > Do we still need opensm/scripts/opensm.conf? > I think it's not used any more. Correct, not used and not installed. Actually can be removed. > 2. From opensm/scripts/opensm.init.in: > @sbindir@/opensm -B $OPTIONS > /dev/null > Is someone setting the $OPTIONS variable? Have no idea. Hope nodody is. > I think it was > set in the config file in the past, but not now. In this script there still be '. /etc/sysconfig/opensm', so hypothetically somebody can define $OPTIONS there even now. > 3. From opensm/scripts/redhat-opensm.init.in: > CONFIG=@sysconfdir@/sysconfig/opensm.conf > if [ -f $CONFIG ]; then > . $CONFIG > fi > > From opensm/scripts/opensm.init.in: > if [[ -s /etc/sysconfig/opensm ]]; then > . /etc/sysconfig/opensm > fi > > If it's not some naming convention, perhaps we should use > opensm.conf in both cases? I guess it is OFED legacy. I don't have clear idea about why it was done this way. Basically have nothing against unifying or even removing this stuff completely. > 4. Logrotate: > opensm/scripts/opensm.spec.in installs logrotate file as follows: > install -m 644 scripts/opensm.logrotate $etc/logrotate.d/opensm > I may be off here, but should the installed file name be opensmd > to match the service name? I think rather the service name should be renamed to 'opensm' and not 'opensmd'. "d" at end is pure OFED convention, most distros are not using this. Sasha From eli at dev.mellanox.co.il Wed Oct 22 06:38:39 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Wed, 22 Oct 2008 15:38:39 +0200 Subject: [ofa-general] Re: [PATCH] ib_core: Use weak ordering for data registered memory In-Reply-To: References: <20081020141255.GA23619@mtls03> Message-ID: <20081022133838.GA1294@mtls03> On Mon, Oct 20, 2008 at 07:43:04AM -0700, Roland Dreier wrote: > > Some architectures support weak ordering in which case better > > performance is possible. IB registered memory used for data can be > > weakly ordered becuase the the completion queues' buffers are > > registered as strongly ordered. This will result in flushing all data > > related outstanding DMA requests by the HCA when a completion is DMAed > > to a completion queue buffer. > > This would break the Mellanox HW's guarantee of writing the last byte of > an RDMA last, right? So on platforms where this has an effect (only > Cell at the moment) some applications could be subtly broken? > In theory it would break Mellanox's guarantee for strict ordering on data, but in practice it will not since the only architecture that supports weak ordering is CELL. As Arnd suggested in his response email, here is the patch with a module parameter which by default will not configure weak ordering for data. Anyone wishing to benefit from weak ordering will have to set the module parameter accordingly. >From 2c1e0f4d8138c1fbd675e7ada4384f59269acb1f Mon Sep 17 00:00:00 2001 From: Eli Cohen Date: Mon, 20 Oct 2008 15:52:22 +0200 Subject: [PATCH] ib_core: Use weak ordering for data registered memory Some architectures support weak ordering in which case better performance is possible. IB registered memory used for data can be weakly ordered becuase the the completion queues' buffers are registered as strongly ordered. This will result in flushing all data related outstanding DMA requests by the HCA when a completion is DMAed to a completion queue buffer. This patch will allow weak ordering for data if ib_core is loaded with the module parameter, allow_weak_ordering, set to a none zero value. Signed-off-by: Eli Cohen Signed-off-by: Arnd Bergmann --- drivers/infiniband/core/umem.c | 12 ++++++++++-- include/rdma/ib_umem.h | 2 ++ 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 6f7c096..d21853d 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -40,6 +40,10 @@ #include "uverbs.h" +static int allow_weak_ordering; +module_param(allow_weak_ordering, bool, 0444); +MODULE_PARM_DESC(allow_weak_ordering, "Allow weak ordering for data registered memory"); + #define IB_UMEM_MAX_PAGE_CHUNK \ ((PAGE_SIZE - offsetof(struct ib_umem_chunk, page_list)) / \ ((void *) &((struct ib_umem_chunk *) 0)->page_list[1] - \ @@ -51,8 +55,8 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d int i; list_for_each_entry_safe(chunk, tmp, &umem->chunk_list, list) { - ib_dma_unmap_sg(dev, chunk->page_list, - chunk->nents, DMA_BIDIRECTIONAL); + ib_dma_unmap_sg_attrs(dev, chunk->page_list, + chunk->nents, DMA_BIDIRECTIONAL, &chunk->attrs); for (i = 0; i < chunk->nents; ++i) { struct page *page = sg_page(&chunk->page_list[i]); @@ -91,6 +95,9 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, if (dmasync) dma_set_attr(DMA_ATTR_WRITE_BARRIER, &attrs); + else if (allow_weak_ordering) + dma_set_attr(DMA_ATTR_WEAK_ORDERING, &attrs); + if (!can_do_mlock()) return ERR_PTR(-EPERM); @@ -155,6 +162,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, if (ret < 0) goto out; + chunk->attrs = attrs; cur_base += ret * PAGE_SIZE; npages -= ret; diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index 9ee0d2e..90f3712 100644 --- a/include/rdma/ib_umem.h +++ b/include/rdma/ib_umem.h @@ -36,6 +36,7 @@ #include #include #include +#include struct ib_ucontext; @@ -56,6 +57,7 @@ struct ib_umem_chunk { struct list_head list; int nents; int nmap; + struct dma_attrs attrs; struct scatterlist page_list[0]; }; -- 1.6.0.2 From cameron at harr.org Wed Oct 22 06:39:38 2008 From: cameron at harr.org (Cameron Harr) Date: Wed, 22 Oct 2008 07:39:38 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48FEDA26.4080304@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> Message-ID: <48FF2D1A.8000101@harr.org> Vladislav Bolkhovitin wrote: > Cameron Harr wrote: >> Vladislav Bolkhovitin wrote: >>> I guess, you use a regular caching IO? The lowest packet size it can >>> produce is a PAGE_SIZE (4K). Target can't change it. You can have >>> lower packets only with O_DIRECT or sg interface. But I'm not sure >>> it will be performance effective. >> I do everything with Direct IO, which is automatic when using the >> BLOCKIO method in SCST. > > I meant on initiator(s), not on the target. > Sorry - but yes, I always run the benchmark apps with direct IO From kliteyn at dev.mellanox.co.il Wed Oct 22 07:41:12 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 22 Oct 2008 16:41:12 +0200 Subject: [ofa-general] [PATCH] opensm/scripts/opensm.conf: remove obsolete config file Message-ID: <48FF3B88.7000201@dev.mellanox.co.il> This conf file is not used any more and should be removed Signed-off-by: Yevgeny Kliteynik --- opensm/scripts/opensm.conf | 137 -------------------------------------------- 1 files changed, 0 insertions(+), 137 deletions(-) delete mode 100644 opensm/scripts/opensm.conf diff --git a/opensm/scripts/opensm.conf b/opensm/scripts/opensm.conf deleted file mode 100644 index 502eea2..0000000 --- a/opensm/scripts/opensm.conf +++ /dev/null @@ -1,137 +0,0 @@ -# DEBUG mode -# This option specifies a debug option. -# These options are not normally needed. -# The number following -d selects the debug -# option to enable as follows: -# OPT Description -# --- ----------------- -# 0 - Ignore other SM nodes. -# 1 - Force single threaded dispatching. -# 2 - Force log flushing after each log message. -# 3 - Disable multicast support. -# 4 - Put OpenSM in memory tracking mode. -# 10.. Put OpenSM in testability mode. -# none, no debug options are enabled. -DEBUG=none - -# LMC -# This option specifies the subnet's LMC value. -# The number of LIDs assigned to each port is 2^LMC. -# The LMC value must be in the range 0-7. -# LMC values > 0 allow multiple paths between ports. -# LMC values > 0 should only be used if the subnet -# topology actually provides multiple paths between -# ports, i.e. multiple interconnects between switches. -# OpenSM defaults to LMC = 0, which allows -# one path between any two ports. -LMC=0 - -# MAXSMPS -# This option specifies the number of VL15 SMP MADs -# allowed on the wire at any one time. -# Specifying -maxsmps 0 allows unlimited outstanding SMPs. -# Without -maxsmps, OpenSM defaults to a maximum of -# four outstanding SMP. -MAXSMPS=4 - -# REASSIGN_LIDS -# This option causes OpenSM to reassign LIDs to all -# end nodes. Specifying "REASSIGN_LIDS=yes" on a running subnet -# may disrupt subnet traffic. -# With "REASSIGN_LIDS=no", OpenSM attempts to preserve existing -# LID assignments resolving multiple use of same LID. -REASSIGN_LIDS="no" - -# SWEEP -# This option specifies the number of seconds between -# subnet sweeps. Specifying SWEEP=0 disables sweeping. -# OpenSM defaults to a sweep interval of 10 seconds. -SWEEP=10 - -# TIMEOUT -# This option specifies the time in milliseconds -# used for transaction timeouts. -# Specifying -t 0 disables timeouts. -# Without -t, OpenSM defaults to a timeout value of -# 200 milliseconds. -TIMEOUT=200 - -# OSM_LOG -# This option defines the log to be the given file. -# By default the log goes to /var/log/opensm.log -# For the log to go to standard output use OSM_LOG=stdout. -OSM_LOG=/var/log/opensm.log - -# VERBOSE -# This option increases the log verbosity level. -# The "-v" option may be specified multiple times -# to further increase the verbosity level. -# "-V" option sets the maximum verbosity level and -# forces log flushing. -# The "-V" is equivalent to "-vf 0xFF -d 2". -VERBOSE="none" - -# UPDN -# This option activate UPDN algorithm instead of Min Hop -# algorithm (default). -# To switch on UPDN algorithm set UPDN="on" -UPDN="off" - -# GUID_FILE -# This option only allowed when UPDN algorithm is activated -# It specifies the guid list file from which to fetch the guid list -# The file contain in each line only one valid guid -GUID_FILE="none" - -# This option specifies the local port GUID value -# with which OpenSM should bind. OpenSM may be -# bound to 1 port at a time. -# If GUID given is 0, opensmd use PORT_NUM parameter. -# Without -g (GUID="none"), OpenSM trys to use the default port. -GUID=0 - -# OSM_HOSTS -# The list of all SM's IP addresses in InfiniBand subnet -# Used to handover mechanism -OSM_HOSTS="" - -# OSM_CACHE_DIR -OSM_CACHE_DIR=/var/cache/opensm - -# CACHE_OPTIONS -# Cache the given command line options into the file -# OSM_CACHE_DIR/opensm.opts for use next invocation -# Set to '--cache-options' or '-c' in order to enable -CACHE_OPTIONS="none" - -# HONORE_GUID2LID -# This option forces OpenSM to honor the guid2lid file, -# when it comes out of Standby state, if such file exists -# under OSM_CACHE_DIR, and is valid. -# Set to '--honor_guid2lid' or '-x' to enable. -# By default this is FALSE. Will be set automatically to '--honor_guid2lid' -# if OSM_HOSTS includes list of more then one IP addresses. -HONORE_GUID2LID="none" - -# RCP -# This option osed by SLDD daemon for handover mechanism -# to copy local cache file to remote computer -RCP=/usr/bin/scp - -# RSH -# This option osed by SLDD daemon for handover mechanism -# to execute commands on remote computer -RSH=/usr/bin/ssh - -# RESCAN_TIME -# This option osed by SLDD daemon for handover mechanism -# Time between sweep of sldd daemon in seconds -RESCAN_TIME=60 - -# PORT_NUM -# This option defines HCA's port number which OpenSM should bind -PORT_NUM=1 - -# ONBOOT -# To start OpenSM automatically set ONBOOT=yes -ONBOOT=no -- 1.5.1.4 From arnd at arndb.de Wed Oct 22 07:41:16 2008 From: arnd at arndb.de (Arnd Bergmann) Date: Wed, 22 Oct 2008 16:41:16 +0200 Subject: [ofa-general] Re: [PATCH] ib_core: Use weak ordering for data registered memory In-Reply-To: <20081022133838.GA1294@mtls03> References: <20081020141255.GA23619@mtls03> <20081022133838.GA1294@mtls03> Message-ID: <200810221641.17172.arnd@arndb.de> On Wednesday 22 October 2008, Eli Cohen wrote: > In theory it would break Mellanox's guarantee for strict ordering on > data, but in practice it will not since the only architecture that > supports weak ordering is CELL. As Arnd suggested in his response > email, here is the patch with a module parameter which by default will > not configure weak ordering for data. Anyone wishing to benefit from > weak ordering will have to set the module parameter accordingly. > Thank you for following up with this patch, it looks good. As a minor detail, I think we should make this module parameter writable in order to allow switching the behaviour without reloading the infiniband drivers. so instead of module_param(allow_weak_ordering, bool, 0444); I think it would be better to use module_param(allow_weak_ordering, bool, 0644); As mentioned before, I would personally also prefer to make this attribute '1' by default instead of zero, but I trust your judgement if you think the default should be '0'. Arnd <>< From kliteyn at dev.mellanox.co.il Wed Oct 22 07:45:19 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 22 Oct 2008 16:45:19 +0200 Subject: [ofa-general] [PATCH] opensm/opensm/Makefile.am: allow 'make dist' from non-source directory Message-ID: <48FF3C7F.7000702@dev.mellanox.co.il> Hi Sasha, gen_chlog.sh expects to be in the source location where it sees .git directory, so when compiling from some other directory, 'make dist' fails. Signed-off-by: Yevgeny Kliteynik --- opensm/Makefile.am | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/Makefile.am b/opensm/Makefile.am index 0fb1363..f8b66b3 100644 --- a/opensm/Makefile.am +++ b/opensm/Makefile.am @@ -27,5 +27,5 @@ EXTRA_DIST = autogen.sh opensm.spec $(various_scripts) $(man_MANS) $(docs) dist-hook: $(EXTRA_DIST) if [ -x $(top_srcdir)/../gen_chlog.sh ] ; then \ - $(top_srcdir)/../gen_chlog.sh $(PACKAGE) > $(distdir)/ChangeLog ; \ + cd $(top_srcdir)/.. ; ./gen_chlog.sh $(PACKAGE) > $(distdir)/ChangeLog ; cd - ; \ fi -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Wed Oct 22 07:57:52 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 22 Oct 2008 16:57:52 +0200 Subject: [ofa-general] [PATCH] opensm/scripts: handling opensm config file Message-ID: <48FF3F70.8000905@dev.mellanox.co.il> Hi Sasha, Following my previous questions, how about the following patch: Use similar opensm config file name for redhat and suse distros, and pass this config file to opensm wit the "--config" option. Signed-off-by: Yevgeny Kliteynik --- opensm/scripts/opensm.init.in | 6 +++--- opensm/scripts/redhat-opensm.init.in | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/opensm/scripts/opensm.init.in b/opensm/scripts/opensm.init.in index c31f017..af0fd19 100644 --- a/opensm/scripts/opensm.init.in +++ b/opensm/scripts/opensm.init.in @@ -53,13 +53,13 @@ if [[ -s /etc/rc.status ]]; then failure() { rc_status -v; } success() { rc_status -v; } fi -if [[ -s /etc/sysconfig/opensm ]]; then - . /etc/sysconfig/opensm +if [ -s /etc/sysconfig/opensm.conf ]; then + OPTIONS="--config /etc/sysconfig/opensm.conf" fi start () { echo -n "Starting opensm: " - @sbindir@/opensm -B $OPTIONS > /dev/null + @sbindir@/opensm --daemon $OPTIONS > /dev/null if [[ $RETVAL -eq 0 ]]; then touch /var/lock/subsys/opensm success diff --git a/opensm/scripts/redhat-opensm.init.in b/opensm/scripts/redhat-opensm.init.in index aad783b..a5755ef 100755 --- a/opensm/scripts/redhat-opensm.init.in +++ b/opensm/scripts/redhat-opensm.init.in @@ -49,7 +49,7 @@ exec_prefix=@exec_prefix@ CONFIG=@sysconfdir@/sysconfig/opensm.conf if [ -f $CONFIG ]; then - . $CONFIG + OPTIONS="--config ${CONFIG}" fi prog=@sbindir@/opensm @@ -147,7 +147,7 @@ start() # Start opensm echo -n "Starting IB Subnet Manager" - $prog --daemon ${HONORE_GUID2LID} > /dev/null + $prog --daemon ${HONORE_GUID2LID} ${OPTIONS} > /dev/null cnt=0; alive=0 while [ $cnt -lt 6 -a $alive -ne 1 ]; do echo -n "."; -- 1.5.1.4 From sashak at voltaire.com Wed Oct 22 08:22:46 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 22 Oct 2008 17:22:46 +0200 Subject: [ofa-general] Re: [PATCH] opensm/scripts: handling opensm config file In-Reply-To: <48FF3F70.8000905@dev.mellanox.co.il> References: <48FF3F70.8000905@dev.mellanox.co.il> Message-ID: <20081022152246.GU20450@sashak.voltaire.com> On 16:57 Wed 22 Oct , Yevgeny Kliteynik wrote: > Hi Sasha, > > Following my previous questions, how about the following patch: > > Use similar opensm config file name for redhat and suse distros, > and pass this config file to opensm wit the "--config" option. > > Signed-off-by: Yevgeny Kliteynik > --- > opensm/scripts/opensm.init.in | 6 +++--- > opensm/scripts/redhat-opensm.init.in | 4 ++-- > 2 files changed, 5 insertions(+), 5 deletions(-) > > diff --git a/opensm/scripts/opensm.init.in b/opensm/scripts/opensm.init.in > index c31f017..af0fd19 100644 > --- a/opensm/scripts/opensm.init.in > +++ b/opensm/scripts/opensm.init.in > @@ -53,13 +53,13 @@ if [[ -s /etc/rc.status ]]; then > failure() { rc_status -v; } > success() { rc_status -v; } > fi > -if [[ -s /etc/sysconfig/opensm ]]; then > - . /etc/sysconfig/opensm > +if [ -s /etc/sysconfig/opensm.conf ]; then > + OPTIONS="--config /etc/sysconfig/opensm.conf" Why we should specify OpenSM config file explicitly? It has the default location (/etc/opensm/opensm.conf or something). It is not the same as /etc/sysconfig/blahblah script where $OPTIONS environment variable could be defined. Again, I'm fine with removing all this /etc/sysconfig/* stuff completely if nobody uses this. Sasha > fi > > start () { > echo -n "Starting opensm: " > - @sbindir@/opensm -B $OPTIONS > /dev/null > + @sbindir@/opensm --daemon $OPTIONS > /dev/null > if [[ $RETVAL -eq 0 ]]; then > touch /var/lock/subsys/opensm > success > diff --git a/opensm/scripts/redhat-opensm.init.in b/opensm/scripts/redhat-opensm.init.in > index aad783b..a5755ef 100755 > --- a/opensm/scripts/redhat-opensm.init.in > +++ b/opensm/scripts/redhat-opensm.init.in > @@ -49,7 +49,7 @@ exec_prefix=@exec_prefix@ > > CONFIG=@sysconfdir@/sysconfig/opensm.conf > if [ -f $CONFIG ]; then > - . $CONFIG > + OPTIONS="--config ${CONFIG}" > fi > > prog=@sbindir@/opensm > @@ -147,7 +147,7 @@ start() > > # Start opensm > echo -n "Starting IB Subnet Manager" > - $prog --daemon ${HONORE_GUID2LID} > /dev/null > + $prog --daemon ${HONORE_GUID2LID} ${OPTIONS} > /dev/null > cnt=0; alive=0 > while [ $cnt -lt 6 -a $alive -ne 1 ]; do > echo -n "."; > -- > 1.5.1.4 > From sashak at voltaire.com Wed Oct 22 08:27:54 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 22 Oct 2008 17:27:54 +0200 Subject: [ofa-general] Re: [PATCH] opensm/scripts/opensm.conf: remove obsolete config file In-Reply-To: <48FF3B88.7000201@dev.mellanox.co.il> References: <48FF3B88.7000201@dev.mellanox.co.il> Message-ID: <20081022152754.GV20450@sashak.voltaire.com> On 16:41 Wed 22 Oct , Yevgeny Kliteynik wrote: > This conf file is not used any more and should be removed > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From sashak at voltaire.com Wed Oct 22 08:28:22 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 22 Oct 2008 17:28:22 +0200 Subject: [ofa-general] Re: [PATCH] opensm/opensm/Makefile.am: allow 'make dist' from non-source directory In-Reply-To: <48FF3C7F.7000702@dev.mellanox.co.il> References: <48FF3C7F.7000702@dev.mellanox.co.il> Message-ID: <20081022152822.GW20450@sashak.voltaire.com> On 16:45 Wed 22 Oct , Yevgeny Kliteynik wrote: > Hi Sasha, > > gen_chlog.sh expects to be in the source location where > it sees .git directory, so when compiling from some other > directory, 'make dist' fails. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From kliteyn at dev.mellanox.co.il Wed Oct 22 08:27:22 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 22 Oct 2008 17:27:22 +0200 Subject: [ofa-general] [PATCH 1/2] ofed/docs/README.txt: fixing white space mess Message-ID: <48FF465A.4080708@dev.mellanox.co.il> Hi Tziporet, Fixing some white space mess in README.txt: removed trailing blanks, fixed mixed usage of tabs and spaces, removed empty lines at the end of the file. Also, the date is updated to 'October 2008'. Please apply OFED 1.4 docs. Signed-off-by: Yevgeny Kliteynik --- README.txt | 49 ++++++++++++++++++++++++------------------------- 1 files changed, 24 insertions(+), 25 deletions(-) diff --git a/README.txt b/README.txt index c6bafd0..779732a 100644 --- a/README.txt +++ b/README.txt @@ -1,8 +1,8 @@ - Open Fabrics Enterprise Distribution (OFED) + Open Fabrics Enterprise Distribution (OFED) Version 1.4 - README - - July 2008 + README + + October 2008 This is the OpenFabrics Enterprise Distribution (OFED) version 1.4 software package supporting InfiniBand and iWARP fabrics. It is composed @@ -10,7 +10,7 @@ of several software modules intended for use on a computer cluster constructed as an InfiniBand subnet or an iWARP network. *** Note: If you plan to upgrade OFED on your cluster, please upgrade all - its nodes to this new version. + its nodes to this new version. This document includes the following sections: @@ -18,7 +18,7 @@ This document includes the following sections: 2. OFED Package Contents 3. Installing OFED Software 4. Starting and Verifying the IB Fabric -5. MPI (Message Passing Interface) +5. MPI (Message Passing Interface) 6. Related Documentation OpenFabrics Home Page: http://www.openfabrics.org @@ -85,10 +85,10 @@ Note: The installer will warn you if you attempt to compile any of the The OFED Distribution package generates RPMs for installing the following: o OpenFabrics core and ULPs - - HCA drivers (mthca, mlx4, ipath, ehca) + - HCA drivers (mthca, mlx4, ipath, ehca) - iWARP driver (cxgb3, nes) - core - - Upper Layer Protocols: IPoIB, SDP, SRP Initiator and target, iSER + - Upper Layer Protocols: IPoIB, SDP, SRP Initiator and target, iSER Initiator and target, RDS, uDAPL, qlgc_vnic and NFS-RDMA. o OpenFabrics utilities - OpenSM: InfiniBand Subnet Manager @@ -120,11 +120,11 @@ Install Quick Guide: OFED_Installation_Guide.txt under OFED-1.4/docs. -Notes: -1. The install script removes previously installed IB packages and +Notes: +1. The install script removes previously installed IB packages and re-installs from scratch. You will be prompted to acknowledge the deletion of the old packages. However, configuration files (.conf) will be - preserved and saved with a ".rpmsave" extension. + preserved and saved with a ".rpmsave" extension. 2. After the installer completes, information about the OFED installation such as the prefix, the kernel version, and @@ -144,13 +144,13 @@ Notes: 2) Check that the IB driver is running on all nodes: ibv_devinfo should print "hca_id: " on the first line. - + 3) Make sure that a Subnet Manager is running by invoking the sminfo utility. - If an SM is not running, sminfo prints: + If an SM is not running, sminfo prints: sminfo: iberror: query failed If an SM is running, sminfo prints the LID and other SM node information. Example: - sminfo: sm lid 0x1 sm guid 0x2c9010b7c2ae1, activity count 20 priority 1 + sminfo: sm lid 0x1 sm guid 0x2c9010b7c2ae1, activity count 20 priority 1 To check if OpenSM is running on the management node, enter: /etc/init.d/opensm status To start OpenSM, enter: /etc/init.d/opensm start @@ -161,21 +161,21 @@ Notes: report a "PORT_ACTIVE" state. 5) Check the network connectivity status: run ibchecknet to see if the subnet - is "clean" and ready for ULP/application use. The following tools display - more information in addition to IB info: ibnetdiscover, ibhosts, and - ibswitches. + is "clean" and ready for ULP/application use. The following tools display + more information in addition to IB info: ibnetdiscover, ibhosts, and + ibswitches. 6) Alternatively, instead of running steps 3 to 5 you can use the ibdiagnet utility to perform a set of tests on your network. Upon finding an error, ibdiagnet will print a message starting with a "-E-". For a more complete report of the network features you should run ibdiagnet -r. If you have a topology file describing your network you can feed this file to ibdiagnet - (using the option: -t ) and all reports will use the names they + (using the option: -t ) and all reports will use the names they appear in the file (instead of LIDs, GUIDs and directed routes). 7) To run an application over SDP set the following variables: - env LD_PRELOAD='stack_prefix'/lib/libsdp.so - LIBSDP_CONFIG_FILE='stack_prefix'/etc/libsdp.conf + env LD_PRELOAD='stack_prefix'/lib/libsdp.so + LIBSDP_CONFIG_FILE='stack_prefix'/etc/libsdp.conf (or LD_PRELOAD='stack_prefix'/lib64/libsdp.so on 64 bit machines) The default 'stack_prefix' is /usr @@ -185,7 +185,7 @@ Notes: In Step 2 of the main menu of install.pl, options 2, 3 and 4 can install one or more MPI stacks. Multiple MPI stacks can be installed -simultaneously -- they will not conflict with each other. +simultaneously -- they will not conflict with each other. Three MPI stacks are included in this release of OFED: - MVAPICH 1.1.0 @@ -202,13 +202,12 @@ the tests. 6. Related Documentation ======================== -1) Release Notes for OFED Distribution components are to be found under - OFED-1.4/docs and, after the package installation, under +1) Release Notes for OFED Distribution components are to be found under + OFED-1.4/docs and, after the package installation, under /usr/share/doc/ofed-docs-1.4 for RedHat /usr/share/doc/packages/ofed-docs-1.4 for SuSE. 2) For a detailed installation guide, see OFED_Installation_Guide.txt. 3) For more information, please visit the OFED web-page http://www.openfabrics.org -For more information contact your vendor. - +For more information contact your vendor. -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Wed Oct 22 08:30:29 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 22 Oct 2008 17:30:29 +0200 Subject: [ofa-general] [PATCH 2/2] ofed/docs/README.txt: fixing opensm daemon and cfg file names Message-ID: <48FF4715.7020804@dev.mellanox.co.il> Tziporet, Fixing opensm daemon name from 'opensm' to 'opensmd' and configuration file from 'opensm' to 'opensm.conf' Please apply OFED 1.4 docs. Signed-off-by: Yevgeny Kliteynik --- README.txt | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/README.txt b/README.txt index 779732a..ca53918 100644 --- a/README.txt +++ b/README.txt @@ -152,10 +152,10 @@ Notes: Example: sminfo: sm lid 0x1 sm guid 0x2c9010b7c2ae1, activity count 20 priority 1 - To check if OpenSM is running on the management node, enter: /etc/init.d/opensm status - To start OpenSM, enter: /etc/init.d/opensm start + To check if OpenSM is running on the management node, enter: /etc/init.d/opensmd status + To start OpenSM, enter: /etc/init.d/opensmd start - Note: OpenSM parameters can be set via the file /etc/sysconfig/opensm. + Note: OpenSM parameters can be set via the file /etc/sysconfig/opensm.conf 4) Verify the status of ports by using ibv_devinfo: all connected ports should report a "PORT_ACTIVE" state. -- 1.5.1.4 From sashak at voltaire.com Wed Oct 22 08:36:06 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 22 Oct 2008 17:36:06 +0200 Subject: [ofa-general] [PATCH 2/2] ofed/docs/README.txt: fixing opensm daemon and cfg file names In-Reply-To: <48FF4715.7020804@dev.mellanox.co.il> References: <48FF4715.7020804@dev.mellanox.co.il> Message-ID: <20081022153606.GX20450@sashak.voltaire.com> On 17:30 Wed 22 Oct , Yevgeny Kliteynik wrote: > Tziporet, > > Fixing opensm daemon name from 'opensm' to 'opensmd' > and configuration file from 'opensm' to 'opensm.conf' > > Please apply OFED 1.4 docs. > > Signed-off-by: Yevgeny Kliteynik > --- > README.txt | 6 +++--- > 1 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/README.txt b/README.txt > index 779732a..ca53918 100644 > --- a/README.txt > +++ b/README.txt > @@ -152,10 +152,10 @@ Notes: > Example: > sminfo: sm lid 0x1 sm guid 0x2c9010b7c2ae1, activity count 20 priority 1 > > - To check if OpenSM is running on the management node, enter: /etc/init.d/opensm status > - To start OpenSM, enter: /etc/init.d/opensm start > + To check if OpenSM is running on the management node, enter: /etc/init.d/opensmd status > + To start OpenSM, enter: /etc/init.d/opensmd start > > - Note: OpenSM parameters can be set via the file /etc/sysconfig/opensm. > + Note: OpenSM parameters can be set via the file /etc/sysconfig/opensm.conf I think it is better to mention the real OpenSM config file - '/etc/opensm/opensm.conf' as configured in OFED instead of an obsolete wrapper script's "config". Sasha From tziporet at mellanox.co.il Wed Oct 22 08:36:41 2008 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Wed, 22 Oct 2008 17:36:41 +0200 Subject: [ofa-general] OFED meeting agenda for today (Oct 22) Message-ID: <5D49E7A8952DC44FB38C38FA0D758EADC30A86@mtlexch01.mtl.com> Agenda for OFED meeting today on OFED 1.4 status: 1. OFED 1.4 status: - RC3 was done on Monday Oct 20 - main changes: - Kernel base updated to 2.6.27 - NFS-RDMA is NOT enabled by default. To enable it one must chose it using custom installation, or add it to ofed.conf file. - Updated MPI packages: mvapich-1.1.0-3064, mvapich2-trunk-3073, openmpi-1.2.8-1 - Updated bonding package: ib-bonding-0.9.0-31 - Updated uDAPL: compat-dapl-1.2.11-1, dapl-2.0.14-1 - NFS-RDMA to work on RHEL 5.1 - OSM: Cashed routing - Decide on RC4 date 2. Bugs review: 1283 blo jeremy.brown at qlogic.com Intel MPI fails on Qlogc HCA 1242 cri yannick.cote at qlogic.com kernel panic while running mpi2007 against ofed1.4 -- ib_... 1257 cri eli at mellanox.co.il Severe performance penalty for PCIe strict ordering 1262 cri andy.grover at oracle.com congestion hang with RDS 1282 cri amirv at mellanox.co.il Kernel panic during Netperf run 1164 maj yosefe at voltaire.com iperf over IPoIB fails for 100 tcp connections 1221 maj Jeffrey.C.Becker at nasa.gov SLES10 sp2: remote logins via ssh fail due to rpcbind and... 1284 maj monis at voltaire.com Bonding - when eth bonding and IB bonding are configuerd,... 3. Review OFED BOF slides - Woody to lead Tziporet From kliteyn at dev.mellanox.co.il Wed Oct 22 08:57:04 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 22 Oct 2008 17:57:04 +0200 Subject: [ofa-general] [PATCH 2/2] ofed/docs/README.txt: fixing opensm daemon and cfg file names In-Reply-To: <20081022153606.GX20450@sashak.voltaire.com> References: <48FF4715.7020804@dev.mellanox.co.il> <20081022153606.GX20450@sashak.voltaire.com> Message-ID: <48FF4D50.8010205@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 17:30 Wed 22 Oct , Yevgeny Kliteynik wrote: >> Tziporet, >> >> Fixing opensm daemon name from 'opensm' to 'opensmd' >> and configuration file from 'opensm' to 'opensm.conf' >> >> Please apply OFED 1.4 docs. >> >> Signed-off-by: Yevgeny Kliteynik >> --- >> README.txt | 6 +++--- >> 1 files changed, 3 insertions(+), 3 deletions(-) >> >> diff --git a/README.txt b/README.txt >> index 779732a..ca53918 100644 >> --- a/README.txt >> +++ b/README.txt >> @@ -152,10 +152,10 @@ Notes: >> Example: >> sminfo: sm lid 0x1 sm guid 0x2c9010b7c2ae1, activity count 20 priority 1 >> >> - To check if OpenSM is running on the management node, enter: /etc/init.d/opensm status >> - To start OpenSM, enter: /etc/init.d/opensm start >> + To check if OpenSM is running on the management node, enter: /etc/init.d/opensmd status >> + To start OpenSM, enter: /etc/init.d/opensmd start >> >> - Note: OpenSM parameters can be set via the file /etc/sysconfig/opensm. >> + Note: OpenSM parameters can be set via the file /etc/sysconfig/opensm.conf > > I think it is better to mention the real OpenSM config file - > '/etc/opensm/opensm.conf' as configured in OFED instead of an obsolete > wrapper script's "config". I think you're right. Patch v2 shortly. -- Yevgeny > Sasha > From kliteyn at dev.mellanox.co.il Wed Oct 22 09:04:01 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 22 Oct 2008 18:04:01 +0200 Subject: [ofa-general] [PATCH 2/2 V2] ofed/docs/README.txt: fixing opensm daemon and cfg file names Message-ID: <48FF4EF1.1020502@dev.mellanox.co.il> Tziporet, Fixing opensm daemon name from 'opensm' to 'opensmd' and configuration file from old '/etc/sysconfig/opensm' to new '/etc/opensm/opensm.conf'. Please apply OFED 1.4 docs. Signed-off-by: Yevgeny Kliteynik --- README.txt | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/README.txt b/README.txt index 779732a..a7849b5 100644 --- a/README.txt +++ b/README.txt @@ -152,10 +152,10 @@ Notes: Example: sminfo: sm lid 0x1 sm guid 0x2c9010b7c2ae1, activity count 20 priority 1 - To check if OpenSM is running on the management node, enter: /etc/init.d/opensm status - To start OpenSM, enter: /etc/init.d/opensm start + To check if OpenSM is running on the management node, enter: /etc/init.d/opensmd status + To start OpenSM, enter: /etc/init.d/opensmd start - Note: OpenSM parameters can be set via the file /etc/sysconfig/opensm. + Note: OpenSM parameters can be set via the file /etc/opensm/opensm.conf 4) Verify the status of ports by using ibv_devinfo: all connected ports should report a "PORT_ACTIVE" state. -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Wed Oct 22 09:16:46 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 22 Oct 2008 18:16:46 +0200 Subject: [ofa-general] Re: [PATCH] opensm/scripts: handling opensm config file In-Reply-To: <20081022152246.GU20450@sashak.voltaire.com> References: <48FF3F70.8000905@dev.mellanox.co.il> <20081022152246.GU20450@sashak.voltaire.com> Message-ID: <48FF51EE.4020804@dev.mellanox.co.il> Sasha, Sasha Khapyorsky wrote: >> -if [[ -s /etc/sysconfig/opensm ]]; then >> - . /etc/sysconfig/opensm >> +if [ -s /etc/sysconfig/opensm.conf ]; then >> + OPTIONS="--config /etc/sysconfig/opensm.conf" > > Why we should specify OpenSM config file explicitly? It has the default > location (/etc/opensm/opensm.conf or something). OK, agree > It is not the same as /etc/sysconfig/blahblah script where $OPTIONS > environment variable could be defined. > > Again, I'm fine with removing all this /etc/sysconfig/* stuff completely > if nobody uses this. I'd like to remove it too. I know that people do use it, but this usage should be replaced by /etc/opensm/opensm.conf anyway. I really don't think that it's a good idea to have opensm configured by several different configuration files. So how do we know whether or not can this stuff be removed? -- Yevgeny > Sasha > From sashak at voltaire.com Wed Oct 22 09:24:16 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 22 Oct 2008 18:24:16 +0200 Subject: [ofa-general] Re: [PATCH] opensm/scripts: handling opensm config file In-Reply-To: <48FF51EE.4020804@dev.mellanox.co.il> References: <48FF3F70.8000905@dev.mellanox.co.il> <20081022152246.GU20450@sashak.voltaire.com> <48FF51EE.4020804@dev.mellanox.co.il> Message-ID: <20081022162416.GY20450@sashak.voltaire.com> On 18:16 Wed 22 Oct , Yevgeny Kliteynik wrote: > > I'd like to remove it too. I know that people do use it, but > this usage should be replaced by /etc/opensm/opensm.conf anyway. > I really don't think that it's a good idea to have opensm > configured by several different configuration files. > > So how do we know whether or not can this stuff be removed? As usual - let's publish the patch on the list. If there will no motivated complains I will apply it. Sasha From chu11 at llnl.gov Wed Oct 22 09:54:29 2008 From: chu11 at llnl.gov (Al Chu) Date: Wed, 22 Oct 2008 09:54:29 -0700 Subject: [ofa-general] Re: [PATCH] opensm/scripts: handling opensm config file In-Reply-To: <20081022162416.GY20450@sashak.voltaire.com> References: <48FF3F70.8000905@dev.mellanox.co.il> <20081022152246.GU20450@sashak.voltaire.com> <48FF51EE.4020804@dev.mellanox.co.il> <20081022162416.GY20450@sashak.voltaire.com> Message-ID: <1224694469.1197.367.camel@cardanus.llnl.gov> On Wed, 2008-10-22 at 18:24 +0200, Sasha Khapyorsky wrote: > On 18:16 Wed 22 Oct , Yevgeny Kliteynik wrote: > > > > I'd like to remove it too. I know that people do use it, but > > this usage should be replaced by /etc/opensm/opensm.conf anyway. You cannot specify an alternate config file (-F/--config) in /etc/opensm/opensm.conf??? :-) I think this is the most often used reason for the syconfig stuff. > > I really don't think that it's a good idea to have opensm > > configured by several different configuration files. > > > > So how do we know whether or not can this stuff be removed? > > As usual - let's publish the patch on the list. If there will no > motivated complains I will apply it. You may wish to ping Redhat/Suse for what they think (b/c if it doesn't meet their requirements, they will just add it back), but I don't think it should be removed. Al > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From vst at vlnb.net Wed Oct 22 10:13:38 2008 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Wed, 22 Oct 2008 21:13:38 +0400 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48FF2D1A.8000101@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> Message-ID: <48FF5F42.2050902@vlnb.net> Cameron Harr wrote: > Vladislav Bolkhovitin wrote: >> Cameron Harr wrote: >>> Vladislav Bolkhovitin wrote: >>>> I guess, you use a regular caching IO? The lowest packet size it can >>>> produce is a PAGE_SIZE (4K). Target can't change it. You can have >>>> lower packets only with O_DIRECT or sg interface. But I'm not sure >>>> it will be performance effective. >>> I do everything with Direct IO, which is automatic when using the >>> BLOCKIO method in SCST. >> I meant on initiator(s), not on the target. >> > Sorry - but yes, I always run the benchmark apps with direct IO Then, there's one more reason why we should find out the cause of such a big variation between runs. Can you repeat all the tests with the latest SCST SVN trunk/ including SRPT driver with each run for at least few minutes? Which backstorage do you use for BLOCKIO? From sashak at voltaire.com Wed Oct 22 10:18:07 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 22 Oct 2008 19:18:07 +0200 Subject: [ofa-general] Re: [PATCH] opensm/scripts: handling opensm config file In-Reply-To: <1224694469.1197.367.camel@cardanus.llnl.gov> References: <48FF3F70.8000905@dev.mellanox.co.il> <20081022152246.GU20450@sashak.voltaire.com> <48FF51EE.4020804@dev.mellanox.co.il> <20081022162416.GY20450@sashak.voltaire.com> <1224694469.1197.367.camel@cardanus.llnl.gov> Message-ID: <20081022171807.GB20450@sashak.voltaire.com> On 09:54 Wed 22 Oct , Al Chu wrote: > On Wed, 2008-10-22 at 18:24 +0200, Sasha Khapyorsky wrote: > > On 18:16 Wed 22 Oct , Yevgeny Kliteynik wrote: > > > > > > I'd like to remove it too. I know that people do use it, but > > > this usage should be replaced by /etc/opensm/opensm.conf anyway. > > You cannot specify an alternate config file (-F/--config) > in /etc/opensm/opensm.conf??? :-) Sure, but we clearly shouldn't replace the default unconditionally :) > You may wish to ping Redhat/Suse for what they think (b/c if it doesn't > meet their requirements, they will just add it back), That would be fine - we will learn better then. > but I don't think > it should be removed. Do you use it (/etc/sysconfig/opensm*)? Sasha From cameron at harr.org Wed Oct 22 10:20:19 2008 From: cameron at harr.org (Cameron Harr) Date: Wed, 22 Oct 2008 11:20:19 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48FF5F42.2050902@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> Message-ID: <48FF60D3.9020809@harr.org> Vladislav Bolkhovitin wrote: > Cameron Harr wrote: >> Vladislav Bolkhovitin wrote: >>> Cameron Harr wrote: >>>> Vladislav Bolkhovitin wrote: >>>>> I guess, you use a regular caching IO? The lowest packet size it >>>>> can produce is a PAGE_SIZE (4K). Target can't change it. You can >>>>> have lower packets only with O_DIRECT or sg interface. But I'm not >>>>> sure it will be performance effective. >>>> I do everything with Direct IO, which is automatic when using the >>>> BLOCKIO method in SCST. >>> I meant on initiator(s), not on the target. >>> >> Sorry - but yes, I always run the benchmark apps with direct IO > > Then, there's one more reason why we should find out the cause of such > a big variation between runs. Can you repeat all the tests with the > latest SCST SVN trunk/ including SRPT driver with each run for at > least few minutes? From a little testing, the updated SCST tree doesn't work with the OFED-1.3.1 SRP stack, though I have gotten it working with the infiniband drivers in the normal distribution kernel. Shall I use those modules? Also, as I mentioned before, my time is going to be fairly limited for a while, but I'll try to squeeze this in and will make sure I run for longer periods of time. I'll also try to calculate an exact end number based on iop/runtime. > > Which backstorage do you use for BLOCKIO? Fusion IO ioDrive. I generally use 1 or 2, but somtimes up to 4 at a time. From chu11 at llnl.gov Wed Oct 22 10:23:33 2008 From: chu11 at llnl.gov (Al Chu) Date: Wed, 22 Oct 2008 10:23:33 -0700 Subject: [ofa-general] Re: [PATCH] opensm/scripts: handling opensm config file In-Reply-To: <20081022171807.GB20450@sashak.voltaire.com> References: <48FF3F70.8000905@dev.mellanox.co.il> <20081022152246.GU20450@sashak.voltaire.com> <48FF51EE.4020804@dev.mellanox.co.il> <20081022162416.GY20450@sashak.voltaire.com> <1224694469.1197.367.camel@cardanus.llnl.gov> <20081022171807.GB20450@sashak.voltaire.com> Message-ID: <1224696213.1197.371.camel@cardanus.llnl.gov> On Wed, 2008-10-22 at 19:18 +0200, Sasha Khapyorsky wrote: > On 09:54 Wed 22 Oct , Al Chu wrote: > > On Wed, 2008-10-22 at 18:24 +0200, Sasha Khapyorsky wrote: > > > On 18:16 Wed 22 Oct , Yevgeny Kliteynik wrote: > > > > > > > > I'd like to remove it too. I know that people do use it, but > > > > this usage should be replaced by /etc/opensm/opensm.conf anyway. > > > > You cannot specify an alternate config file (-F/--config) > > in /etc/opensm/opensm.conf??? :-) > > Sure, but we clearly shouldn't replace the default unconditionally :) Absolutely. My point being that if you wanted to specify an alternate config path (w/o editting the init script), I think specify OPTIONS in the sysconfig is the only way?? > > You may wish to ping Redhat/Suse for what they think (b/c if it doesn't > > meet their requirements, they will just add it back), > > That would be fine - we will learn better then. > > > but I don't think > > it should be removed. > > Do you use it (/etc/sysconfig/opensm*)? We don't personally. But I do know that the system administrators do use sysconfig for a variety of other daemons. So I think keeping the sysconfig OPTIONS variable available as an option is important. Al > Sasha -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From chu11 at llnl.gov Wed Oct 22 10:27:52 2008 From: chu11 at llnl.gov (Al Chu) Date: Wed, 22 Oct 2008 10:27:52 -0700 Subject: [ofa-general] Re: opensm as service - cfg files In-Reply-To: <20081022131018.GN20450@sashak.voltaire.com> References: <48FF22FC.6000606@dev.mellanox.co.il> <20081022131018.GN20450@sashak.voltaire.com> Message-ID: <1224696472.1197.376.camel@cardanus.llnl.gov> Hey Sasha, Yevgeny, > > 4. Logrotate: > > opensm/scripts/opensm.spec.in installs logrotate file as follows: > > install -m 644 scripts/opensm.logrotate $etc/logrotate.d/opensm > > I may be off here, but should the installed file name be opensmd > > to match the service name? > > I think rather the service name should be renamed to 'opensm' and not > 'opensmd'. "d" at end is pure OFED convention, most distros are not using > this. I decided to look this up in rpmlint to see what they say ... '''Your logrotate file should be named /etc/logrotate.d/.''', So I think keeping it 'opensm' is a good idea. Al > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From bart.vanassche at gmail.com Wed Oct 22 10:34:21 2008 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Wed, 22 Oct 2008 19:34:21 +0200 Subject: ***SPAM*** Re: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48FF60D3.9020809@harr.org> References: <48E386F6.5040502@fusionio.com> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> Message-ID: On Wed, Oct 22, 2008 at 7:20 PM, Cameron Harr wrote: > From a little testing, the updated SCST tree doesn't work with the > OFED-1.3.1 SRP stack, though I have gotten it working with the infiniband > drivers in the normal distribution kernel. Shall I use those modules? During all tests I ran myself with SCST and SRPT I have used a vanilla Linux kernel instead of the OFED stack. Bart. From vst at vlnb.net Wed Oct 22 10:37:50 2008 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Wed, 22 Oct 2008 21:37:50 +0400 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48FF60D3.9020809@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> Message-ID: <48FF64EE.5050102@vlnb.net> Cameron Harr wrote: >>>>> Vladislav Bolkhovitin wrote: >>>>>> I guess, you use a regular caching IO? The lowest packet size it >>>>>> can produce is a PAGE_SIZE (4K). Target can't change it. You can >>>>>> have lower packets only with O_DIRECT or sg interface. But I'm not >>>>>> sure it will be performance effective. >>>>> I do everything with Direct IO, which is automatic when using the >>>>> BLOCKIO method in SCST. >>>> I meant on initiator(s), not on the target. >>>> >>> Sorry - but yes, I always run the benchmark apps with direct IO >> Then, there's one more reason why we should find out the cause of such >> a big variation between runs. Can you repeat all the tests with the >> latest SCST SVN trunk/ including SRPT driver with each run for at >> least few minutes? > > From a little testing, the updated SCST tree doesn't work with the > OFED-1.3.1 SRP stack, though I have gotten it working with the > infiniband drivers in the normal distribution kernel. Shall I use those > modules? I think, only Vu can answer it. > Also, as I mentioned before, my time is going to be fairly limited for a > while, but I'll try to squeeze this in and will make sure I run for > longer periods of time. I'll also try to calculate an exact end number > based on iop/runtime. >> Which backstorage do you use for BLOCKIO? > Fusion IO ioDrive. I generally use 1 or 2, but somtimes up to 4 at a time. It might be a reason of not stable results. Can you try with NULLIO to narrow things down a bit? From John.Marshall at ec.gc.ca Wed Oct 22 11:16:26 2008 From: John.Marshall at ec.gc.ca (John Marshall) Date: Wed, 22 Oct 2008 18:16:26 +0000 Subject: [ofa-general] OOM problem with ib_ipoib? Message-ID: <48FF6DFA.9080409@ec.gc.ca> Hi, Summary: I believe I have been having an OOM problem caused by the ib_ipoib module. I do not see the problem until it is loaded. The problem manifests itself when the kernel cache (grep Cached /proc/meminfo) containing file data is maxed out. Normally, the cached data should be written out and released by pdflush. In this case, it is not. Notes: 1) it is NOT necessary for the ib interfaces to actually be used or up! 2) I am using ofed 1.3.2 which I have built on my own machine. 3) I have similar weird behavior when using 1.4-rc3 and a 2.6.26 kernel. ---------- System info: root# lsmod | grep ib ib_ipoib 77512 0 ib_cm 33260 1 ib_ipoib ib_sa 36628 2 ib_ipoib,ib_cm ib_mthca 124832 0 ib_umad 16232 0 ib_uverbs 38792 0 ib_mad 35188 4 ib_cm,ib_sa,ib_mthca,ib_umad ib_core 54304 7 ib_ipoib,ib_cm,ib_sa,ib_mthca,ib_umad,ib_uverbs,ib_mad ipv6 242980 29 ib_ipoib libata 145584 1 ata_generic scsi_mod 142316 6 sg,sr_mod,usb_storage,sd_mod,megaraid_sas,libata root# uname -r 2.6.24-etchnhalf.1-686-bigmem root# cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 65 model name : Dual-Core AMD Opteron(tm) Processor 8220 stepping : 3 cpu MHz : 2793.163 cache size : 1024 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy ts fid vid ttp tm stc bogomips : 5589.70 clflush size : 64 ***** 7 more similar entries (2 cpu, 4-core each) **** root# cat /proc/meminfo cat /proc/meminfo MemTotal: 33274492 kB MemFree: 147716 kB Buffers: 840 kB Cached: 32532792 kB SwapCached: 0 kB Active: 19956 kB Inactive: 32524692 kB HighTotal: 32635808 kB HighFree: 77008 kB LowTotal: 638684 kB LowFree: 70708 kB SwapTotal: 16386260 kB SwapFree: 16386168 kB Dirty: 88 kB Writeback: 0 kB AnonPages: 11032 kB Mapped: 7940 kB Slab: 537012 kB SReclaimable: 487100 kB SUnreclaim: 49912 kB PageTables: 656 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 33023504 kB Committed_AS: 61360 kB VmallocTotal: 118776 kB VmallocUsed: 96800 kB VmallocChunk: 13112 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB # dpkg -l |grep ofed ii libibcm 1.0.2-1 ofed-1.3.2: libibcm ii libibcommon 1.0.8-1 ofed-1.3.2: libibcommon ii libibmad 1.1.6-1 ofed-1.3.2: libibmad ii libibumad 1.1.7-1 ofed-1.3.2: libibumad ii libibverbs 1.1.1-1 ofed-1.3.2: libibverbs ii libipathverbs 1.1-1 ofed-1.3.2: libipathverbs ii libmlx4 1.0-1 ofed-1.3.2: libmlx ii libmthca 1.0.4-1 ofed-1.3.2: libmthca ii librdmacm 1.0.7-1 ofed-1.3.2: librdmacm ii libsdp 1.1.99-1 ofed-1.3.2: libsdp ii ofa-kernel 1.3.2-2.6.24-etchnhalf.1-686-bigmem-1 ofed-1.3.2: ofa_kernel ---------- How to provoke #1 (prior to loading ib_ipoib): non-root$ dd if=/dev/zero of=/tmp/50G bs=1M count=50000 root# modprobe ib_ipoib Output from dmesg: modprobe: page allocation failure. order:1, mode:0x20 Pid: 6839, comm: modprobe Not tainted 2.6.24-etchnhalf.1-686-bigmem #1 [] __alloc_pages+0x2c4/0x2d5 [] cache_alloc_refill+0x299/0x4b1 [] __kmalloc+0x75/0xbc [] __alloc_skb+0x49/0xf5 [] ipoib_cm_alloc_rx_skb+0x31/0x218 [ib_ipoib] [] ipoib_cm_dev_init+0x50c/0x552 [ib_ipoib] [] dma_pool_free+0xb0/0x18c [] ipoib_transport_dev_init+0xd2/0x3d1 [ib_ipoib] [] ipoib_ib_dev_init+0x2c/0x6e [ib_ipoib] [] ipoib_dev_init+0xab/0xd0 [ib_ipoib] [] ipoib_add_one+0x220/0x3cf [ib_ipoib] [] resched_task+0x52/0x54 [] ib_register_client+0x48/0x6c [ib_core] [] ipoib_init_module+0xd2/0xf8 [ib_ipoib] [] sys_init_module+0x15e3/0x16fb [] vma_prio_tree_insert+0x17/0x2a [] __kmalloc+0x0/0xbc [] syscall_call+0x7/0xb ======================= Mem-info: DMA per-cpu: CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 CPU 1: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 CPU 2: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 CPU 3: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 CPU 4: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 CPU 5: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 CPU 6: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 CPU 7: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 Normal per-cpu: CPU 0: Hot: hi: 186, btch: 31 usd: 121 Cold: hi: 62, btch: 15 usd: 58 CPU 1: Hot: hi: 186, btch: 31 usd: 42 Cold: hi: 62, btch: 15 usd: 26 CPU 2: Hot: hi: 186, btch: 31 usd: 152 Cold: hi: 62, btch: 15 usd: 57 CPU 3: Hot: hi: 186, btch: 31 usd: 63 Cold: hi: 62, btch: 15 usd: 59 CPU 4: Hot: hi: 186, btch: 31 usd: 72 Cold: hi: 62, btch: 15 usd: 55 CPU 5: Hot: hi: 186, btch: 31 usd: 174 Cold: hi: 62, btch: 15 usd: 61 CPU 6: Hot: hi: 186, btch: 31 usd: 66 Cold: hi: 62, btch: 15 usd: 48 CPU 7: Hot: hi: 186, btch: 31 usd: 35 Cold: hi: 62, btch: 15 usd: 54 HighMem per-cpu: CPU 0: Hot: hi: 186, btch: 31 usd: 31 Cold: hi: 62, btch: 15 usd: 9 CPU 1: Hot: hi: 186, btch: 31 usd: 30 Cold: hi: 62, btch: 15 usd: 5 CPU 2: Hot: hi: 186, btch: 31 usd: 93 Cold: hi: 62, btch: 15 usd: 8 CPU 3: Hot: hi: 186, btch: 31 usd: 3 Cold: hi: 62, btch: 15 usd: 14 CPU 4: Hot: hi: 186, btch: 31 usd: 37 Cold: hi: 62, btch: 15 usd: 53 CPU 5: Hot: hi: 186, btch: 31 usd: 67 Cold: hi: 62, btch: 15 usd: 49 CPU 6: Hot: hi: 186, btch: 31 usd: 15 Cold: hi: 62, btch: 15 usd: 30 CPU 7: Hot: hi: 186, btch: 31 usd: 138 Cold: hi: 62, btch: 15 usd: 61 Active:5136 inactive:8135705 dirty:12 writeback:0 unstable:0 free:15715 slab:136280 mapped:2348 pagetables:164 bounce:0 DMA free:3524kB min:68kB low:84kB high:100kB active:0kB inactive:0kB present:16256kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 873 34020 34020 Normal free:1368kB min:3744kB low:4680kB high:5616kB active:288kB inactive:252kB present:894080kB pages_scanned:32 all_unreclaimable? no lowmem_reserve[]: 0 0 265176 265176 HighMem free:59588kB min:512kB low:36080kB high:71652kB active:20256kB inactive:32541032kB present:33942528kB pages_scanned:32 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 DMA: 2*4kB 4*8kB 4*16kB 4*32kB 5*64kB 1*128kB 3*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3496kB Normal: 0*4kB 0*8kB 1*16kB 0*32kB 1*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 1232kB HighMem: 34*4kB 23*8kB 28*16kB 2*32kB 4*64kB 1*128kB 4*256kB 3*512kB 2*1024kB 5*2048kB 11*4096kB = 61120kB Swap cache: add 27, delete 27, find 1/2, race 0+0 Free swap = 16386168kB Total swap = 16386260kB Free swap: 16386168kB 8781824 pages of RAM 8552448 pages of HIGHMEM 463201 reserved pages 8140201 pages shared 0 pages swap cached 12 pages dirty 0 pages writeback 2382 pages mapped 136255 pages slab 167 pages pagetables ib%d: failed to allocate receive buffer 144 ---------- How to provoke #2 (with ib_ipoib loaded): non-root$ dd if=/dev/zero of=/tmp/50G bs=1M count=50000 This results in an OOM triggering the OOM-killer which starts killing processes. ---------- Any help would be appreciated, as well as confirmation of the same sort of behavior. Thanks, John From sashak at voltaire.com Wed Oct 22 12:11:37 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 22 Oct 2008 21:11:37 +0200 Subject: [ofa-general] Re: [PATCH] opensm/scripts: handling opensm config file In-Reply-To: <1224696213.1197.371.camel@cardanus.llnl.gov> References: <48FF3F70.8000905@dev.mellanox.co.il> <20081022152246.GU20450@sashak.voltaire.com> <48FF51EE.4020804@dev.mellanox.co.il> <20081022162416.GY20450@sashak.voltaire.com> <1224694469.1197.367.camel@cardanus.llnl.gov> <20081022171807.GB20450@sashak.voltaire.com> <1224696213.1197.371.camel@cardanus.llnl.gov> Message-ID: <20081022191137.GB28713@sashak.voltaire.com> On 10:23 Wed 22 Oct , Al Chu wrote: > > Absolutely. My point being that if you wanted to specify an alternate > config path (w/o editting the init script), I think specify OPTIONS in > the sysconfig is the only way?? Ok. I see your point. So let's just unify things. I would suggest at least to call the file in '/etc/sysconfig/opensm' and not 'opensm.conf' to prevent confusions with OpenSM config. The patch is below, any objections, comments? Sasha >From d2771cd83bc69e3fed00be0eecea943d53d5b0eb Mon Sep 17 00:00:00 2001 From: Sasha Khapyorsky Date: Wed, 22 Oct 2008 20:56:37 +0200 Subject: [PATCH] opensm/scripts: unify scripts' config Startup scripts will use /etc/sysconfig/opensm file as its config scripts where various environment variables could be defined. Signed-off-by: Sasha Khapyorsky --- opensm/scripts/opensm.init.in | 8 +++++--- opensm/scripts/redhat-opensm.init.in | 6 +++--- opensm/scripts/sldd.sh.in | 14 ++++---------- 3 files changed, 12 insertions(+), 16 deletions(-) diff --git a/opensm/scripts/opensm.init.in b/opensm/scripts/opensm.init.in index c31f017..52293eb 100644 --- a/opensm/scripts/opensm.init.in +++ b/opensm/scripts/opensm.init.in @@ -53,13 +53,15 @@ if [[ -s /etc/rc.status ]]; then failure() { rc_status -v; } success() { rc_status -v; } fi -if [[ -s /etc/sysconfig/opensm ]]; then - . /etc/sysconfig/opensm + +CONFIG=@sysconfdir@/sysconfig/opensm +if [[ -s $CONFIG ]]; then + . $CONFIG fi start () { echo -n "Starting opensm: " - @sbindir@/opensm -B $OPTIONS > /dev/null + @sbindir@/opensm --daemon $OPTIONS > /dev/null if [[ $RETVAL -eq 0 ]]; then touch /var/lock/subsys/opensm success diff --git a/opensm/scripts/redhat-opensm.init.in b/opensm/scripts/redhat-opensm.init.in index aad783b..9c22275 100755 --- a/opensm/scripts/redhat-opensm.init.in +++ b/opensm/scripts/redhat-opensm.init.in @@ -39,7 +39,7 @@ # $Id: openib-1.0-opensm.init,v 1.5 2006/08/02 18:18:23 dledford Exp $ # # processname: @sbindir@/opensm -# config: @sysconfdir@/sysconfig/opensm.conf +# config: @sysconfdir@/sysconfig/opensm # pidfile: /var/run/opensm.pid prefix=@prefix@ @@ -47,7 +47,7 @@ exec_prefix=@exec_prefix@ . /etc/rc.d/init.d/functions -CONFIG=@sysconfdir@/sysconfig/opensm.conf +CONFIG=@sysconfdir@/sysconfig/opensm if [ -f $CONFIG ]; then . $CONFIG fi @@ -147,7 +147,7 @@ start() # Start opensm echo -n "Starting IB Subnet Manager" - $prog --daemon ${HONORE_GUID2LID} > /dev/null + $prog --daemon ${HONORE_GUID2LID} ${OPTIONS} > /dev/null cnt=0; alive=0 while [ $cnt -lt 6 -a $alive -ne 1 ]; do echo -n "."; diff --git a/opensm/scripts/sldd.sh.in b/opensm/scripts/sldd.sh.in index edb1454..f7635fe 100755 --- a/opensm/scripts/sldd.sh.in +++ b/opensm/scripts/sldd.sh.in @@ -42,18 +42,12 @@ prefix=@prefix@ exec_prefix=@exec_prefix@ -# config: @sysconfdir@/sysconfig/opensm.conf - -[ -f @sysconfdir@/sysconfig/opensm.conf ] && CONFIG=@sysconfdir@/sysconfig/opensm.conf - -SLDD_DEBUG=${SLDD_DEBUG:-0} - -if [ -z "$CONFIG" ]; then - [ $SLDD_DEBUG -eq 1 ] && echo "Config file not found." - exit 0 +CONFIG=@sysconfdir@/sysconfig/opensm +if [ -f $CONFIG ]; then + . $CONFIG fi -. ${CONFIG} +SLDD_DEBUG=${SLDD_DEBUG:-0} CACHE_FILE=${CACHE_FILE:-/var/cache/opensm/guid2lid} CACHE_DIR=$(dirname ${CACHE_FILE}) -- 1.6.0.1.196.g01914 From kliteyn at dev.mellanox.co.il Wed Oct 22 12:17:05 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 22 Oct 2008 21:17:05 +0200 Subject: [ofa-general] Re: [PATCH] opensm/scripts: handling opensm config file In-Reply-To: <20081022191137.GB28713@sashak.voltaire.com> References: <48FF3F70.8000905@dev.mellanox.co.il> <20081022152246.GU20450@sashak.voltaire.com> <48FF51EE.4020804@dev.mellanox.co.il> <20081022162416.GY20450@sashak.voltaire.com> <1224694469.1197.367.camel@cardanus.llnl.gov> <20081022171807.GB20450@sashak.voltaire.com> <1224696213.1197.371.camel@cardanus.llnl.gov> <20081022191137.GB28713@sashak.voltaire.com> Message-ID: <48FF7C31.8030502@dev.mellanox.co.il> Sasha Khapyorsky wrote: > On 10:23 Wed 22 Oct , Al Chu wrote: >> Absolutely. My point being that if you wanted to specify an alternate >> config path (w/o editting the init script), I think specify OPTIONS in >> the sysconfig is the only way?? > > Ok. I see your point. > > So let's just unify things. I would suggest at least to call the file in > '/etc/sysconfig/opensm' and not 'opensm.conf' to prevent confusions with > OpenSM config. The patch is below, any objections, comments? I'm all for it. -- Yevgeny > Sasha > > >>From d2771cd83bc69e3fed00be0eecea943d53d5b0eb Mon Sep 17 00:00:00 2001 > From: Sasha Khapyorsky > Date: Wed, 22 Oct 2008 20:56:37 +0200 > Subject: [PATCH] opensm/scripts: unify scripts' config > > Startup scripts will use /etc/sysconfig/opensm file as its config > scripts where various environment variables could be defined. > > Signed-off-by: Sasha Khapyorsky > --- > opensm/scripts/opensm.init.in | 8 +++++--- > opensm/scripts/redhat-opensm.init.in | 6 +++--- > opensm/scripts/sldd.sh.in | 14 ++++---------- > 3 files changed, 12 insertions(+), 16 deletions(-) > > diff --git a/opensm/scripts/opensm.init.in b/opensm/scripts/opensm.init.in > index c31f017..52293eb 100644 > --- a/opensm/scripts/opensm.init.in > +++ b/opensm/scripts/opensm.init.in > @@ -53,13 +53,15 @@ if [[ -s /etc/rc.status ]]; then > failure() { rc_status -v; } > success() { rc_status -v; } > fi > -if [[ -s /etc/sysconfig/opensm ]]; then > - . /etc/sysconfig/opensm > + > +CONFIG=@sysconfdir@/sysconfig/opensm > +if [[ -s $CONFIG ]]; then > + . $CONFIG > fi > > start () { > echo -n "Starting opensm: " > - @sbindir@/opensm -B $OPTIONS > /dev/null > + @sbindir@/opensm --daemon $OPTIONS > /dev/null > if [[ $RETVAL -eq 0 ]]; then > touch /var/lock/subsys/opensm > success > diff --git a/opensm/scripts/redhat-opensm.init.in b/opensm/scripts/redhat-opensm.init.in > index aad783b..9c22275 100755 > --- a/opensm/scripts/redhat-opensm.init.in > +++ b/opensm/scripts/redhat-opensm.init.in > @@ -39,7 +39,7 @@ > # $Id: openib-1.0-opensm.init,v 1.5 2006/08/02 18:18:23 dledford Exp $ > # > # processname: @sbindir@/opensm > -# config: @sysconfdir@/sysconfig/opensm.conf > +# config: @sysconfdir@/sysconfig/opensm > # pidfile: /var/run/opensm.pid > > prefix=@prefix@ > @@ -47,7 +47,7 @@ exec_prefix=@exec_prefix@ > > . /etc/rc.d/init.d/functions > > -CONFIG=@sysconfdir@/sysconfig/opensm.conf > +CONFIG=@sysconfdir@/sysconfig/opensm > if [ -f $CONFIG ]; then > . $CONFIG > fi > @@ -147,7 +147,7 @@ start() > > # Start opensm > echo -n "Starting IB Subnet Manager" > - $prog --daemon ${HONORE_GUID2LID} > /dev/null > + $prog --daemon ${HONORE_GUID2LID} ${OPTIONS} > /dev/null > cnt=0; alive=0 > while [ $cnt -lt 6 -a $alive -ne 1 ]; do > echo -n "."; > diff --git a/opensm/scripts/sldd.sh.in b/opensm/scripts/sldd.sh.in > index edb1454..f7635fe 100755 > --- a/opensm/scripts/sldd.sh.in > +++ b/opensm/scripts/sldd.sh.in > @@ -42,18 +42,12 @@ > prefix=@prefix@ > exec_prefix=@exec_prefix@ > > -# config: @sysconfdir@/sysconfig/opensm.conf > - > -[ -f @sysconfdir@/sysconfig/opensm.conf ] && CONFIG=@sysconfdir@/sysconfig/opensm.conf > - > -SLDD_DEBUG=${SLDD_DEBUG:-0} > - > -if [ -z "$CONFIG" ]; then > - [ $SLDD_DEBUG -eq 1 ] && echo "Config file not found." > - exit 0 > +CONFIG=@sysconfdir@/sysconfig/opensm > +if [ -f $CONFIG ]; then > + . $CONFIG > fi > > -. ${CONFIG} > +SLDD_DEBUG=${SLDD_DEBUG:-0} > > CACHE_FILE=${CACHE_FILE:-/var/cache/opensm/guid2lid} > CACHE_DIR=$(dirname ${CACHE_FILE}) From chu11 at llnl.gov Wed Oct 22 13:30:37 2008 From: chu11 at llnl.gov (Al Chu) Date: Wed, 22 Oct 2008 13:30:37 -0700 Subject: [ofa-general] Re: [PATCH] opensm/scripts: handling opensm config file In-Reply-To: <20081022191137.GB28713@sashak.voltaire.com> References: <48FF3F70.8000905@dev.mellanox.co.il> <20081022152246.GU20450@sashak.voltaire.com> <48FF51EE.4020804@dev.mellanox.co.il> <20081022162416.GY20450@sashak.voltaire.com> <1224694469.1197.367.camel@cardanus.llnl.gov> <20081022171807.GB20450@sashak.voltaire.com> <1224696213.1197.371.camel@cardanus.llnl.gov> <20081022191137.GB28713@sashak.voltaire.com> Message-ID: <1224707437.1197.378.camel@cardanus.llnl.gov> On Wed, 2008-10-22 at 21:11 +0200, Sasha Khapyorsky wrote: > On 10:23 Wed 22 Oct , Al Chu wrote: > > > > Absolutely. My point being that if you wanted to specify an alternate > > config path (w/o editting the init script), I think specify OPTIONS in > > the sysconfig is the only way?? > > Ok. I see your point. > > So let's just unify things. I would suggest at least to call the file in > '/etc/sysconfig/opensm' and not 'opensm.conf' to prevent confusions with > OpenSM config. The patch is below, any objections, comments? Looks good to me. Al > Sasha > > > >From d2771cd83bc69e3fed00be0eecea943d53d5b0eb Mon Sep 17 00:00:00 2001 > From: Sasha Khapyorsky > Date: Wed, 22 Oct 2008 20:56:37 +0200 > Subject: [PATCH] opensm/scripts: unify scripts' config > > Startup scripts will use /etc/sysconfig/opensm file as its config > scripts where various environment variables could be defined. > > Signed-off-by: Sasha Khapyorsky > --- > opensm/scripts/opensm.init.in | 8 +++++--- > opensm/scripts/redhat-opensm.init.in | 6 +++--- > opensm/scripts/sldd.sh.in | 14 ++++---------- > 3 files changed, 12 insertions(+), 16 deletions(-) > > diff --git a/opensm/scripts/opensm.init.in b/opensm/scripts/opensm.init.in > index c31f017..52293eb 100644 > --- a/opensm/scripts/opensm.init.in > +++ b/opensm/scripts/opensm.init.in > @@ -53,13 +53,15 @@ if [[ -s /etc/rc.status ]]; then > failure() { rc_status -v; } > success() { rc_status -v; } > fi > -if [[ -s /etc/sysconfig/opensm ]]; then > - . /etc/sysconfig/opensm > + > +CONFIG=@sysconfdir@/sysconfig/opensm > +if [[ -s $CONFIG ]]; then > + . $CONFIG > fi > > start () { > echo -n "Starting opensm: " > - @sbindir@/opensm -B $OPTIONS > /dev/null > + @sbindir@/opensm --daemon $OPTIONS > /dev/null > if [[ $RETVAL -eq 0 ]]; then > touch /var/lock/subsys/opensm > success > diff --git a/opensm/scripts/redhat-opensm.init.in b/opensm/scripts/redhat-opensm.init.in > index aad783b..9c22275 100755 > --- a/opensm/scripts/redhat-opensm.init.in > +++ b/opensm/scripts/redhat-opensm.init.in > @@ -39,7 +39,7 @@ > # $Id: openib-1.0-opensm.init,v 1.5 2006/08/02 18:18:23 dledford Exp $ > # > # processname: @sbindir@/opensm > -# config: @sysconfdir@/sysconfig/opensm.conf > +# config: @sysconfdir@/sysconfig/opensm > # pidfile: /var/run/opensm.pid > > prefix=@prefix@ > @@ -47,7 +47,7 @@ exec_prefix=@exec_prefix@ > > . /etc/rc.d/init.d/functions > > -CONFIG=@sysconfdir@/sysconfig/opensm.conf > +CONFIG=@sysconfdir@/sysconfig/opensm > if [ -f $CONFIG ]; then > . $CONFIG > fi > @@ -147,7 +147,7 @@ start() > > # Start opensm > echo -n "Starting IB Subnet Manager" > - $prog --daemon ${HONORE_GUID2LID} > /dev/null > + $prog --daemon ${HONORE_GUID2LID} ${OPTIONS} > /dev/null > cnt=0; alive=0 > while [ $cnt -lt 6 -a $alive -ne 1 ]; do > echo -n "."; > diff --git a/opensm/scripts/sldd.sh.in b/opensm/scripts/sldd.sh.in > index edb1454..f7635fe 100755 > --- a/opensm/scripts/sldd.sh.in > +++ b/opensm/scripts/sldd.sh.in > @@ -42,18 +42,12 @@ > prefix=@prefix@ > exec_prefix=@exec_prefix@ > > -# config: @sysconfdir@/sysconfig/opensm.conf > - > -[ -f @sysconfdir@/sysconfig/opensm.conf ] && CONFIG=@sysconfdir@/sysconfig/opensm.conf > - > -SLDD_DEBUG=${SLDD_DEBUG:-0} > - > -if [ -z "$CONFIG" ]; then > - [ $SLDD_DEBUG -eq 1 ] && echo "Config file not found." > - exit 0 > +CONFIG=@sysconfdir@/sysconfig/opensm > +if [ -f $CONFIG ]; then > + . $CONFIG > fi > > -. ${CONFIG} > +SLDD_DEBUG=${SLDD_DEBUG:-0} > > CACHE_FILE=${CACHE_FILE:-/var/cache/opensm/guid2lid} > CACHE_DIR=$(dirname ${CACHE_FILE}) -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From vuhuong at mellanox.com Wed Oct 22 13:41:20 2008 From: vuhuong at mellanox.com (Vu Pham) Date: Wed, 22 Oct 2008 13:41:20 -0700 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48FF64EE.5050102@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <48FF64EE.5050102@vlnb.net> Message-ID: <48FF8FF0.2050308@mellanox.com> Vladislav Bolkhovitin wrote: > Cameron Harr wrote: >>>>>> Vladislav Bolkhovitin wrote: >>>>>>> I guess, you use a regular caching IO? The lowest packet size it >>>>>>> can produce is a PAGE_SIZE (4K). Target can't change it. You can >>>>>>> have lower packets only with O_DIRECT or sg interface. But I'm >>>>>>> not sure it will be performance effective. >>>>>> I do everything with Direct IO, which is automatic when using the >>>>>> BLOCKIO method in SCST. >>>>> I meant on initiator(s), not on the target. >>>>> >>>> Sorry - but yes, I always run the benchmark apps with direct IO >>> Then, there's one more reason why we should find out the cause of >>> such a big variation between runs. Can you repeat all the tests with >>> the latest SCST SVN trunk/ including SRPT driver with each run for >>> at least few minutes? >> >> From a little testing, the updated SCST tree doesn't work with the >> OFED-1.3.1 SRP stack, though I have gotten it working with the >> infiniband drivers in the normal distribution kernel. Shall I use >> those modules? > > I think, only Vu can answer it. ofed-1.3.1 srpt only works with scst-1.0.0 You can run 2.6.26 + its IB stack + latest SVN scst and srpt -vu > >> Also, as I mentioned before, my time is going to be fairly limited >> for a while, but I'll try to squeeze this in and will make sure I run >> for longer periods of time. I'll also try to calculate an exact end >> number based on iop/runtime. >>> Which backstorage do you use for BLOCKIO? >> Fusion IO ioDrive. I generally use 1 or 2, but somtimes up to 4 at a >> time. > > It might be a reason of not stable results. Can you try with NULLIO to > narrow things down a bit? From cameron at harr.org Wed Oct 22 13:49:14 2008 From: cameron at harr.org (Cameron Harr) Date: Wed, 22 Oct 2008 14:49:14 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48FF8FF0.2050308@mellanox.com> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <48FF64EE.5050102@vlnb.net> <48FF8FF0.2050308@mellanox.com> Message-ID: <48FF91CA.1060603@harr.org> Vu Pham wrote: > Vladislav Bolkhovitin wrote: >> Cameron Harr wrote: >>>>>>> Vladislav Bolkhovitin wrote: >>>>>>>> I guess, you use a regular caching IO? The lowest packet size >>>>>>>> it can produce is a PAGE_SIZE (4K). Target can't change it. You >>>>>>>> can have lower packets only with O_DIRECT or sg interface. But >>>>>>>> I'm not sure it will be performance effective. >>>>>>> I do everything with Direct IO, which is automatic when using >>>>>>> the BLOCKIO method in SCST. >>>>>> I meant on initiator(s), not on the target. >>>>>> >>>>> Sorry - but yes, I always run the benchmark apps with direct IO >>>> Then, there's one more reason why we should find out the cause of >>>> such a big variation between runs. Can you repeat all the tests >>>> with the latest SCST SVN trunk/ including SRPT driver with each run >>>> for at least few minutes? >>> >>> From a little testing, the updated SCST tree doesn't work with the >>> OFED-1.3.1 SRP stack, though I have gotten it working with the >>> infiniband drivers in the normal distribution kernel. Shall I use >>> those modules? >> >> I think, only Vu can answer it. > > ofed-1.3.1 srpt only works with scst-1.0.0 > You can run 2.6.26 + its IB stack + latest SVN scst and srpt > > -vu I think I'm stuck with the default CentOS kernel: 2.6.18-92.1.13.el5, but I can use that. > >> >>> Also, as I mentioned before, my time is going to be fairly limited >>> for a while, but I'll try to squeeze this in and will make sure I >>> run for longer periods of time. I'll also try to calculate an exact >>> end number based on iop/runtime. >>>> Which backstorage do you use for BLOCKIO? >>> Fusion IO ioDrive. I generally use 1 or 2, but somtimes up to 4 at a >>> time. >> >> It might be a reason of not stable results. Can you try with NULLIO >> to narrow things down a bit? I actually tested NULLIO a while back at Vu's suggestion, but I believe I forgot to report on it. From my recollection, NULLIO gave me pretty consistent block sizes of 512B, which is the BS I used in the benchmark. From sashak at voltaire.com Wed Oct 22 14:14:10 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 22 Oct 2008 23:14:10 +0200 Subject: [ofa-general] Re: [PATCH] opensm/scripts: handling opensm config file In-Reply-To: <1224707437.1197.378.camel@cardanus.llnl.gov> References: <48FF3F70.8000905@dev.mellanox.co.il> <20081022152246.GU20450@sashak.voltaire.com> <48FF51EE.4020804@dev.mellanox.co.il> <20081022162416.GY20450@sashak.voltaire.com> <1224694469.1197.367.camel@cardanus.llnl.gov> <20081022171807.GB20450@sashak.voltaire.com> <1224696213.1197.371.camel@cardanus.llnl.gov> <20081022191137.GB28713@sashak.voltaire.com> <1224707437.1197.378.camel@cardanus.llnl.gov> Message-ID: <20081022211410.GC28713@sashak.voltaire.com> On 13:30 Wed 22 Oct , Al Chu wrote: > On Wed, 2008-10-22 at 21:11 +0200, Sasha Khapyorsky wrote: > > On 10:23 Wed 22 Oct , Al Chu wrote: > > > > > > Absolutely. My point being that if you wanted to specify an alternate > > > config path (w/o editting the init script), I think specify OPTIONS in > > > the sysconfig is the only way?? > > > > Ok. I see your point. > > > > So let's just unify things. I would suggest at least to call the file in > > '/etc/sysconfig/opensm' and not 'opensm.conf' to prevent confusions with > > OpenSM config. The patch is below, any objections, comments? > > Looks good to me. Applied. Sasha From John.Marshall at ec.gc.ca Wed Oct 22 15:16:13 2008 From: John.Marshall at ec.gc.ca (John Marshall) Date: Wed, 22 Oct 2008 22:16:13 +0000 Subject: [ofa-general] OOM problem with ib_ipoib? In-Reply-To: <48FF6DFA.9080409@ec.gc.ca> References: <48FF6DFA.9080409@ec.gc.ca> Message-ID: <48FFA62D.3030305@ec.gc.ca> John Marshall wrote: > Hi, > > Summary: I believe I have been having an OOM problem caused by the > ib_ipoib module. I do not see the problem until it is > loaded. The problem manifests itself when the kernel cache > (grep Cached /proc/meminfo) containing file data is maxed > out. Normally, the cached data should be written out and > released by pdflush. In this case, it is not. > > Notes: > 1) it is NOT necessary for the ib interfaces to actually > be used or up! > 2) I am using ofed 1.3.2 which I have built on my own > machine. > 3) I have similar weird behavior when using 1.4-rc3 > and a 2.6.26 kernel. An additional item: when rebuilt for the same 2.6.24 kernel as mentioned below, but without BIGMEM, I do not encounter the same problem. > > ---------- > > System info: > > root# lsmod | grep ib > ib_ipoib 77512 0 > ib_cm 33260 1 ib_ipoib > ib_sa 36628 2 ib_ipoib,ib_cm > ib_mthca 124832 0 > ib_umad 16232 0 > ib_uverbs 38792 0 > ib_mad 35188 4 ib_cm,ib_sa,ib_mthca,ib_umad > ib_core 54304 7 > ib_ipoib,ib_cm,ib_sa,ib_mthca,ib_umad,ib_uverbs,ib_mad > ipv6 242980 29 ib_ipoib > libata 145584 1 ata_generic > scsi_mod 142316 6 > sg,sr_mod,usb_storage,sd_mod,megaraid_sas,libata > > root# uname -r > 2.6.24-etchnhalf.1-686-bigmem > > root# cat /proc/cpuinfo > processor : 0 > vendor_id : AuthenticAMD > cpu family : 15 > model : 65 > model name : Dual-Core AMD Opteron(tm) Processor 8220 > stepping : 3 > cpu MHz : 2793.163 > cache size : 1024 KB > physical id : 0 > siblings : 2 > core id : 0 > cpu cores : 2 > fdiv_bug : no > hlt_bug : no > f00f_bug : no > coma_bug : no > fpu : yes > fpu_exception : yes > cpuid level : 1 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge > mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext > fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm > extapic cr8_legacy ts fid vid ttp tm stc > bogomips : 5589.70 > clflush size : 64 > > ***** 7 more similar entries (2 cpu, 4-core each) **** > > root# cat /proc/meminfo > cat /proc/meminfo > MemTotal: 33274492 kB > MemFree: 147716 kB > Buffers: 840 kB > Cached: 32532792 kB > SwapCached: 0 kB > Active: 19956 kB > Inactive: 32524692 kB > HighTotal: 32635808 kB > HighFree: 77008 kB > LowTotal: 638684 kB > LowFree: 70708 kB > SwapTotal: 16386260 kB > SwapFree: 16386168 kB > Dirty: 88 kB > Writeback: 0 kB > AnonPages: 11032 kB > Mapped: 7940 kB > Slab: 537012 kB > SReclaimable: 487100 kB > SUnreclaim: 49912 kB > PageTables: 656 kB > NFS_Unstable: 0 kB > Bounce: 0 kB > CommitLimit: 33023504 kB > Committed_AS: 61360 kB > VmallocTotal: 118776 kB > VmallocUsed: 96800 kB > VmallocChunk: 13112 kB > HugePages_Total: 0 > HugePages_Free: 0 > HugePages_Rsvd: 0 > HugePages_Surp: 0 > Hugepagesize: 2048 kB > > # dpkg -l |grep ofed > ii libibcm > 1.0.2-1 ofed-1.3.2: libibcm > ii libibcommon > 1.0.8-1 ofed-1.3.2: libibcommon > ii libibmad > 1.1.6-1 ofed-1.3.2: libibmad > ii libibumad > 1.1.7-1 ofed-1.3.2: libibumad > ii libibverbs > 1.1.1-1 ofed-1.3.2: libibverbs > ii libipathverbs > 1.1-1 ofed-1.3.2: libipathverbs > ii libmlx4 > 1.0-1 ofed-1.3.2: libmlx > ii libmthca > 1.0.4-1 ofed-1.3.2: libmthca > ii librdmacm > 1.0.7-1 ofed-1.3.2: librdmacm > ii libsdp > 1.1.99-1 ofed-1.3.2: libsdp > ii ofa-kernel > 1.3.2-2.6.24-etchnhalf.1-686-bigmem-1 ofed-1.3.2: ofa_kernel > > ---------- > > How to provoke #1 (prior to loading ib_ipoib): > > non-root$ dd if=/dev/zero of=/tmp/50G bs=1M count=50000 > > root# modprobe ib_ipoib > > Output from dmesg: > > modprobe: page allocation failure. order:1, mode:0x20 > Pid: 6839, comm: modprobe Not tainted 2.6.24-etchnhalf.1-686-bigmem #1 > [] __alloc_pages+0x2c4/0x2d5 > [] cache_alloc_refill+0x299/0x4b1 > [] __kmalloc+0x75/0xbc > [] __alloc_skb+0x49/0xf5 > [] ipoib_cm_alloc_rx_skb+0x31/0x218 [ib_ipoib] > [] ipoib_cm_dev_init+0x50c/0x552 [ib_ipoib] > [] dma_pool_free+0xb0/0x18c > [] ipoib_transport_dev_init+0xd2/0x3d1 [ib_ipoib] > [] ipoib_ib_dev_init+0x2c/0x6e [ib_ipoib] > [] ipoib_dev_init+0xab/0xd0 [ib_ipoib] > [] ipoib_add_one+0x220/0x3cf [ib_ipoib] > [] resched_task+0x52/0x54 > [] ib_register_client+0x48/0x6c [ib_core] > [] ipoib_init_module+0xd2/0xf8 [ib_ipoib] > [] sys_init_module+0x15e3/0x16fb > [] vma_prio_tree_insert+0x17/0x2a > [] __kmalloc+0x0/0xbc > [] syscall_call+0x7/0xb > ======================= > Mem-info: > DMA per-cpu: > CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: > 1 usd: 0 > CPU 1: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: > 1 usd: 0 > CPU 2: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: > 1 usd: 0 > CPU 3: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: > 1 usd: 0 > CPU 4: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: > 1 usd: 0 > CPU 5: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: > 1 usd: 0 > CPU 6: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: > 1 usd: 0 > CPU 7: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: > 1 usd: 0 > Normal per-cpu: > CPU 0: Hot: hi: 186, btch: 31 usd: 121 Cold: hi: 62, btch: > 15 usd: 58 > CPU 1: Hot: hi: 186, btch: 31 usd: 42 Cold: hi: 62, btch: > 15 usd: 26 > CPU 2: Hot: hi: 186, btch: 31 usd: 152 Cold: hi: 62, btch: > 15 usd: 57 > CPU 3: Hot: hi: 186, btch: 31 usd: 63 Cold: hi: 62, btch: > 15 usd: 59 > CPU 4: Hot: hi: 186, btch: 31 usd: 72 Cold: hi: 62, btch: > 15 usd: 55 > CPU 5: Hot: hi: 186, btch: 31 usd: 174 Cold: hi: 62, btch: > 15 usd: 61 > CPU 6: Hot: hi: 186, btch: 31 usd: 66 Cold: hi: 62, btch: > 15 usd: 48 > CPU 7: Hot: hi: 186, btch: 31 usd: 35 Cold: hi: 62, btch: > 15 usd: 54 > HighMem per-cpu: > CPU 0: Hot: hi: 186, btch: 31 usd: 31 Cold: hi: 62, btch: > 15 usd: 9 > CPU 1: Hot: hi: 186, btch: 31 usd: 30 Cold: hi: 62, btch: > 15 usd: 5 > CPU 2: Hot: hi: 186, btch: 31 usd: 93 Cold: hi: 62, btch: > 15 usd: 8 > CPU 3: Hot: hi: 186, btch: 31 usd: 3 Cold: hi: 62, btch: > 15 usd: 14 > CPU 4: Hot: hi: 186, btch: 31 usd: 37 Cold: hi: 62, btch: > 15 usd: 53 > CPU 5: Hot: hi: 186, btch: 31 usd: 67 Cold: hi: 62, btch: > 15 usd: 49 > CPU 6: Hot: hi: 186, btch: 31 usd: 15 Cold: hi: 62, btch: > 15 usd: 30 > CPU 7: Hot: hi: 186, btch: 31 usd: 138 Cold: hi: 62, btch: > 15 usd: 61 > Active:5136 inactive:8135705 dirty:12 writeback:0 unstable:0 > free:15715 slab:136280 mapped:2348 pagetables:164 bounce:0 > DMA free:3524kB min:68kB low:84kB high:100kB active:0kB inactive:0kB > present:16256kB pages_scanned:0 all_unreclaimable? yes > lowmem_reserve[]: 0 873 34020 34020 > Normal free:1368kB min:3744kB low:4680kB high:5616kB active:288kB > inactive:252kB present:894080kB pages_scanned:32 all_unreclaimable? no > lowmem_reserve[]: 0 0 265176 265176 > HighMem free:59588kB min:512kB low:36080kB high:71652kB active:20256kB > inactive:32541032kB present:33942528kB pages_scanned:32 > all_unreclaimable? no > lowmem_reserve[]: 0 0 0 0 > DMA: 2*4kB 4*8kB 4*16kB 4*32kB 5*64kB 1*128kB 3*256kB 0*512kB 0*1024kB > 1*2048kB 0*4096kB = 3496kB > Normal: 0*4kB 0*8kB 1*16kB 0*32kB 1*64kB 1*128kB 0*256kB 0*512kB > 1*1024kB 0*2048kB 0*4096kB = 1232kB > HighMem: 34*4kB 23*8kB 28*16kB 2*32kB 4*64kB 1*128kB 4*256kB 3*512kB > 2*1024kB 5*2048kB 11*4096kB = 61120kB > Swap cache: add 27, delete 27, find 1/2, race 0+0 > Free swap = 16386168kB > Total swap = 16386260kB > Free swap: 16386168kB > 8781824 pages of RAM > 8552448 pages of HIGHMEM > 463201 reserved pages > 8140201 pages shared > 0 pages swap cached > 12 pages dirty > 0 pages writeback > 2382 pages mapped > 136255 pages slab > 167 pages pagetables > ib%d: failed to allocate receive buffer 144 > > ---------- > > How to provoke #2 (with ib_ipoib loaded): > > non-root$ dd if=/dev/zero of=/tmp/50G bs=1M count=50000 > > This results in an OOM triggering the OOM-killer which starts killing > processes. > > ---------- > > Any help would be appreciated, as well as confirmation of the same > sort of behavior. > > Thanks, > John > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Wed Oct 22 15:50:23 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 22 Oct 2008 15:50:23 -0700 Subject: [ofa-general] Re: [PATCH] IB/ipoib: enhance ethtool support In-Reply-To: (Or Gerlitz's message of "Thu, 16 Oct 2008 15:13:04 +0200 (IST)") References: Message-ID: thanks, applied From rdreier at cisco.com Wed Oct 22 15:50:27 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 22 Oct 2008 15:50:27 -0700 Subject: [ofa-general] Re: [PATCH v2] IB/ipoib: fix netdev offload features support for child (VLAN) devices In-Reply-To: (Or Gerlitz's message of "Thu, 16 Oct 2008 15:20:50 +0200 (IST)") References: Message-ID: thanks, applied. From mhagen at iol.unh.edu Wed Oct 22 07:09:12 2008 From: mhagen at iol.unh.edu (Mikkel Hagen) Date: Wed, 22 Oct 2008 10:09:12 -0400 Subject: [ofa-general] Re: [Ofalab] [ewg] Update from September OpenFabrics Interoperability Event at UNH-IOL In-Reply-To: <48FEE7BF.2020604@voltaire.com> References: <48FB2C81.3080301@mellanox.co.il> <4D511C95BE7F4D8E92B2BAE9E0D67AA0@annapurna> <48FE05F0.8070608@iol.unh.edu> <48FEE7BF.2020604@voltaire.com> Message-ID: <48FF3408.40600@iol.unh.edu> An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Oct 22 15:54:53 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 22 Oct 2008 15:54:53 -0700 Subject: [ofa-general] Re: [PATCH]IB/ehca:reject dynamic memory add/remove In-Reply-To: <200810131310.34413.ossrosch@linux.vnet.ibm.com> (Stefan Roscher's message of "Mon, 13 Oct 2008 13:10:32 +0200") References: <200810131310.34413.ossrosch@linux.vnet.ibm.com> Message-ID: thanks, applied with a slightly expanded changelog. From wangwhao at cn.ibm.com Wed Oct 22 17:49:16 2008 From: wangwhao at cn.ibm.com (Wen Hao Wang) Date: Thu, 23 Oct 2008 08:49:16 +0800 Subject: [ofa-general] ibsysstat cpu output is incomplete In-Reply-To: <20081022104851.GJ20450@sashak.voltaire.com> Message-ID: > On 11:13 Wed 22 Oct , Wen Hao Wang wrote: >> >> Can ibsysstat use RMPP by default to give complete message? > > Yes, I thought about this - by default or when message size is bigger > than regular payload size. > > Sasha OK. I will keep bug 1237 open in OpenIB website, and wait which version will include the update. Thanks Wen Hao Wang Email: wangwhao at cn.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Oct 22 21:37:31 2008 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 22 Oct 2008 21:37:31 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get the last batch of merges for the 2.6.28 window. As the dirstat shows, this is almost entirely the new mlx4_en driver, which is a 10G Ethernet driver for Mellanox ConnectX adapters: 1.2% drivers/infiniband/ 98.6% drivers/net/mlx4/ Jeff Garzik and I agreed that this new driver should go through my tree because it depended on changes to the existing mlx4_ib InfiniBand driver that I wanted to review and apply through my tree. Other than mlx4_en, there are just miscellaneous fixes of the usual type (low-level hardware drivers and IPoIB). Hoang-Nam Nguyen (1): IB/ehca: Don't allow creating UC QP with SRQ Julien Brunel (1): RDMA/ucma: Test ucma_alloc_multicast() return against NULL, not with IS_ERR() Or Gerlitz (2): IPoIB: Clean up ethtool support IPoIB: Set netdev offload features properly for child (VLAN) interfaces Roland Dreier (4): IPoIB: Always initialize poll_timer to avoid crash on unload IB/mad: Use krealloc() to resize snoop table Update NetEffect maintainer emails to Intel emails Merge branches 'cma', 'cxgb3', 'ehca', 'ipoib', 'mad', 'mlx4' and 'nes' into for-next Stefan Roscher (2): IB/ehca: Fix reported max number of QPs and CQs in systems with >1 adapter IB/ehca: Reject dynamic memory add/remove when ehca adapter is present Steve Wise (1): RDMA/cxgb3: Remove cmid reference on tid allocation failures Yevgeny Petrilin (7): mlx4_core: Add QP range reservation support mlx4_core: Support multiple pre-reserved QP regions mlx4_core: Get ethernet MTU and default address from firmware mlx4_core: Ethernet MAC/VLAN management mlx4_core: Multiple port type support mlx4_en: Add driver for Mellanox ConnectX 10GbE NIC mlx4_core: Add Ethernet PCI device IDs MAINTAINERS | 4 +- drivers/infiniband/core/mad.c | 14 +- drivers/infiniband/core/ucma.c | 4 +- drivers/infiniband/hw/cxgb3/iwch_cm.c | 1 + drivers/infiniband/hw/ehca/ehca_classes.h | 2 + drivers/infiniband/hw/ehca/ehca_cq.c | 4 +- drivers/infiniband/hw/ehca/ehca_main.c | 83 ++- drivers/infiniband/hw/ehca/ehca_qp.c | 10 +- drivers/infiniband/hw/mlx4/mad.c | 6 +- drivers/infiniband/hw/mlx4/main.c | 11 +- drivers/infiniband/hw/mlx4/mlx4_ib.h | 1 + drivers/infiniband/hw/mlx4/qp.c | 21 +- drivers/infiniband/ulp/ipoib/ipoib.h | 1 + drivers/infiniband/ulp/ipoib/ipoib_ethtool.c | 9 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 7 +- drivers/infiniband/ulp/ipoib/ipoib_main.c | 67 +- drivers/infiniband/ulp/ipoib/ipoib_vlan.c | 4 + drivers/net/Kconfig | 9 + drivers/net/mlx4/Makefile | 7 +- drivers/net/mlx4/alloc.c | 97 +++- drivers/net/mlx4/cq.c | 2 +- drivers/net/mlx4/en_cq.c | 146 ++++ drivers/net/mlx4/en_main.c | 254 ++++++ drivers/net/mlx4/en_netdev.c | 1088 ++++++++++++++++++++++++++ drivers/net/mlx4/en_params.c | 480 ++++++++++++ drivers/net/mlx4/en_port.c | 261 ++++++ drivers/net/mlx4/en_port.h | 570 ++++++++++++++ drivers/net/mlx4/en_resources.c | 96 +++ drivers/net/mlx4/en_rx.c | 1080 +++++++++++++++++++++++++ drivers/net/mlx4/en_tx.c | 820 +++++++++++++++++++ drivers/net/mlx4/eq.c | 2 +- drivers/net/mlx4/fw.c | 20 +- drivers/net/mlx4/fw.h | 7 +- drivers/net/mlx4/main.c | 287 +++++++- drivers/net/mlx4/mcg.c | 4 +- drivers/net/mlx4/mlx4.h | 45 +- drivers/net/mlx4/mlx4_en.h | 561 +++++++++++++ drivers/net/mlx4/mr.c | 2 +- drivers/net/mlx4/pd.c | 4 +- drivers/net/mlx4/port.c | 282 +++++++ drivers/net/mlx4/qp.c | 81 ++- drivers/net/mlx4/srq.c | 2 +- include/linux/mlx4/cmd.h | 9 + include/linux/mlx4/device.h | 54 ++- 44 files changed, 6394 insertions(+), 125 deletions(-) create mode 100644 drivers/net/mlx4/en_cq.c create mode 100644 drivers/net/mlx4/en_main.c create mode 100644 drivers/net/mlx4/en_netdev.c create mode 100644 drivers/net/mlx4/en_params.c create mode 100644 drivers/net/mlx4/en_port.c create mode 100644 drivers/net/mlx4/en_port.h create mode 100644 drivers/net/mlx4/en_resources.c create mode 100644 drivers/net/mlx4/en_rx.c create mode 100644 drivers/net/mlx4/en_tx.c create mode 100644 drivers/net/mlx4/mlx4_en.h create mode 100644 drivers/net/mlx4/port.c From eli at dev.mellanox.co.il Thu Oct 23 03:09:46 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Thu, 23 Oct 2008 12:09:46 +0200 Subject: [ofa-general] Re: [PATCH] ib_core: Use weak ordering for data registered memory In-Reply-To: <200810221641.17172.arnd@arndb.de> References: <20081020141255.GA23619@mtls03> <20081022133838.GA1294@mtls03> <200810221641.17172.arnd@arndb.de> Message-ID: <20081023100946.GD3911@mtls03> On Wed, Oct 22, 2008 at 04:41:16PM +0200, Arnd Bergmann wrote: > > As a minor detail, I think we should make this module parameter > writable in order to allow switching the behaviour without > reloading the infiniband drivers. Although changing the value of the parameter at runtime should not cause any inconsistencies in the driver, it might affect other applications running so I personally prefer to have this parameter as read only and let the system administrator decide on that before loading the driver. > > > As mentioned before, I would personally also prefer to make > this attribute '1' by default instead of zero, but I trust > your judgement if you think the default should be '0'. > Again, I prefer not to change the default that used to be till now, so I'd rather have the default as strong ordering. From vlad at lists.openfabrics.org Thu Oct 23 03:19:44 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 23 Oct 2008 03:19:44 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081023-0200 daily build status Message-ID: <20081023101945.35C1EE60DC2@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From arnd at arndb.de Thu Oct 23 04:47:53 2008 From: arnd at arndb.de (Arnd Bergmann) Date: Thu, 23 Oct 2008 13:47:53 +0200 Subject: [ofa-general] Re: [PATCH] ib_core: Use weak ordering for data registered memory In-Reply-To: <20081023100946.GD3911@mtls03> References: <20081020141255.GA23619@mtls03> <200810221641.17172.arnd@arndb.de> <20081023100946.GD3911@mtls03> Message-ID: <200810231347.54268.arnd@arndb.de> On Thursday 23 October 2008, Eli Cohen wrote: > On Wed, Oct 22, 2008 at 04:41:16PM +0200, Arnd Bergmann wrote: > > > > As a minor detail, I think we should make this module parameter > > writable in order to allow switching the behaviour without > > reloading the infiniband drivers. > > Although changing the value of the parameter at runtime should not > cause any inconsistencies in the driver, it might affect other > applications running so I personally prefer to have this parameter as > read only and let the system administrator decide on that before > loading the driver. ok, that makes sense. > > As mentioned before, I would personally also prefer to make > > this attribute '1' by default instead of zero, but I trust > > your judgement if you think the default should be '0'. > > > > Again, I prefer not to change the default that used to be till now, so > I'd rather have the default as strong ordering. ok. Roland, please use the version that Eli posted yesterday. Arnd <>< From ogerlitz at voltaire.com Thu Oct 23 05:17:10 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 23 Oct 2008 14:17:10 +0200 Subject: [ofa-general] Re: [PATCH v2] IB/ipoib: fix netdev offload features support for child (VLAN) devices In-Reply-To: References: Message-ID: <49006B46.3060406@voltaire.com> Roland Dreier wrote: > thanks, applied. Hi Roland, Are you OK with pushing this to -stable as well? its not a regression, but for the case of distros using that kernel, having this patch would make the performance of ipoib virtual networks much better. Or. From vst at vlnb.net Thu Oct 23 05:10:18 2008 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Thu, 23 Oct 2008 16:10:18 +0400 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48FF91CA.1060603@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <48FF64EE.5050102@vlnb.net> <48FF8FF0.2050308@mellanox.com> <48FF91CA.1060603@harr.org> Message-ID: <490069AA.7010000@vlnb.net> Cameron Harr wrote: >>>> Also, as I mentioned before, my time is going to be fairly limited >>>> for a while, but I'll try to squeeze this in and will make sure I >>>> run for longer periods of time. I'll also try to calculate an exact >>>> end number based on iop/runtime. >>>>> Which backstorage do you use for BLOCKIO? >>>> Fusion IO ioDrive. I generally use 1 or 2, but somtimes up to 4 at a >>>> time. >>> It might be a reason of not stable results. Can you try with NULLIO >>> to narrow things down a bit? > > I actually tested NULLIO a while back at Vu's suggestion, but I believe > I forgot to report on it. From my recollection, NULLIO gave me pretty > consistent block sizes of 512B, which is the BS I used in the benchmark. Were the IOPS, CS and IRQ rates results stable and consistent between runs? Anyway, try with the latest code. There are noticeable changes there. BTW, since you use the solid state media, you don't need any IO scheduler. Hence, noop should be the best choice for you. From olga.shern at gmail.com Thu Oct 23 05:34:32 2008 From: olga.shern at gmail.com (Olga Shern (Voltaire)) Date: Thu, 23 Oct 2008 14:34:32 +0200 Subject: ***SPAM*** Re: [ofa-general] OFED-1.4-rc3 is available In-Reply-To: <48FC4D88.3040702@dev.mellanox.co.il> References: <48FC4D88.3040702@dev.mellanox.co.il> Message-ID: > - 27 bugs fixed (see attached for details) Hi Vlad, I don't see the attached file. Olga From sashak at voltaire.com Thu Oct 23 05:44:06 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 23 Oct 2008 14:44:06 +0200 Subject: [ofa-general] invalid OpenSM version in OFED Message-ID: <20081023124406.GH28713@sashak.voltaire.com> Hi Vlad, I noticed that OFED...tgz has OpenSM (and I guess other management packages) packaged with invalid version. For example OFED-1.4-rc3 OpenSM package instead of the real daily version 3.2.2_20081019_ad24a3e has only 3.2.2 (which actually was almost 200 commits before). For this reason I cannot detect properly installed OpenSM version when binary package was used. It make problem investigation and debugging unnecessary difficult. Could we fix this ASAP? Sasha From philippe.gregoire at cea.fr Thu Oct 23 05:53:20 2008 From: philippe.gregoire at cea.fr (Philippe Gregoire) Date: Thu, 23 Oct 2008 14:53:20 +0200 Subject: [ofa-general] opensm as service - cfg files In-Reply-To: <48FF22FC.6000606@dev.mellanox.co.il> References: <48FF22FC.6000606@dev.mellanox.co.il> Message-ID: <490073C0.70109@cea.fr> Hi Yevgeny, Is it possible to write this service so it will be able to manage multiple instances of opensm on the same node, I mean start and stop all instances at the same time or separately. This will be very usefull when you have several Infiniband storage devices connected directly to one node, so you have to run several opensm -g guid processes on this node. It is authorized to have a service that understand parameters like: service start 0x8000010232 or service start ddn12.conf Philippe Gregoire CEA/DAM. Yevgeny Kliteynik a écrit : > Hi Sasha, > > I was just trying to put some order in my head regarding > the use of opensm as service, and I have couple of questions. > Some of them might be dumb, so please bear with me... :) > > 1. OpenSM config file. > Do we still need opensm/scripts/opensm.conf? > I think it's not used any more. > > 2. From opensm/scripts/opensm.init.in: > @sbindir@/opensm -B $OPTIONS > /dev/null > Is someone setting the $OPTIONS variable? I think it was > set in the config file in the past, but not now. > > 3. From opensm/scripts/redhat-opensm.init.in: > CONFIG=@sysconfdir@/sysconfig/opensm.conf > if [ -f $CONFIG ]; then > . $CONFIG > fi > > From opensm/scripts/opensm.init.in: > if [[ -s /etc/sysconfig/opensm ]]; then > . /etc/sysconfig/opensm > fi > > If it's not some naming convention, perhaps we should use > opensm.conf in both cases? > > 4. Logrotate: > opensm/scripts/opensm.spec.in installs logrotate file as follows: > install -m 644 scripts/opensm.logrotate $etc/logrotate.d/opensm > I may be off here, but should the installed file name be opensmd > to match the service name? > > -- Yevgeny > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From cameron at harr.org Thu Oct 23 06:34:37 2008 From: cameron at harr.org (Cameron Harr) Date: Thu, 23 Oct 2008 07:34:37 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <490069AA.7010000@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <48FF64EE.5050102@vlnb.net> <48FF8FF0.2050308@mellanox.com> <48FF91CA.1060603@harr.org> <490069AA.7010000@vlnb.net> Message-ID: <49007D6D.1020100@harr.org> Vladislav Bolkhovitin wrote: > Cameron Harr wrote: >>>>> Also, as I mentioned before, my time is going to be fairly limited >>>>> for a while, but I'll try to squeeze this in and will make sure I >>>>> run for longer periods of time. I'll also try to calculate an >>>>> exact end number based on iop/runtime. >>>>>> Which backstorage do you use for BLOCKIO? >>>>> Fusion IO ioDrive. I generally use 1 or 2, but somtimes up to 4 at >>>>> a time. >>>> It might be a reason of not stable results. Can you try with NULLIO >>>> to narrow things down a bit? >> >> I actually tested NULLIO a while back at Vu's suggestion, but I >> believe I forgot to report on it. From my recollection, NULLIO gave >> me pretty consistent block sizes of 512B, which is the BS I used in >> the benchmark. > > Were the IOPS, CS and IRQ rates results stable and consistent between > runs? I don't remember about the IOPs, but CS and IRQ rates were stable. However, they've also become stable in my later runs with scst_threads=[123]. > > Anyway, try with the latest code. There are noticeable changes there. > > BTW, since you use the solid state media, you don't need any IO > scheduler. Hence, noop should be the best choice for you. noop is what I've been running for the past while, before switching to deadline yesterday. From John.Marshall at ec.gc.ca Thu Oct 23 07:01:52 2008 From: John.Marshall at ec.gc.ca (John Marshall) Date: Thu, 23 Oct 2008 14:01:52 +0000 Subject: [ofa-general] OOM problem with ib_ipoib? In-Reply-To: <48FFA62D.3030305@ec.gc.ca> References: <48FF6DFA.9080409@ec.gc.ca> <48FFA62D.3030305@ec.gc.ca> Message-ID: <490083D0.5000807@ec.gc.ca> Is this the right list to be reporting this sort of issue? Thanks, John John Marshall wrote: > John Marshall wrote: >> Hi, >> >> Summary: I believe I have been having an OOM problem caused by the >> ib_ipoib module. I do not see the problem until it is >> loaded. The problem manifests itself when the kernel cache >> (grep Cached /proc/meminfo) containing file data is maxed >> out. Normally, the cached data should be written out and >> released by pdflush. In this case, it is not. >> >> Notes: >> 1) it is NOT necessary for the ib interfaces to actually >> be used or up! >> 2) I am using ofed 1.3.2 which I have built on my own >> machine. >> 3) I have similar weird behavior when using 1.4-rc3 >> and a 2.6.26 kernel. > An additional item: when rebuilt for the same 2.6.24 kernel > as mentioned below, but without BIGMEM, I do not encounter > the same problem. >> >> ---------- >> >> System info: >> >> root# lsmod | grep ib >> ib_ipoib 77512 0 >> ib_cm 33260 1 ib_ipoib >> ib_sa 36628 2 ib_ipoib,ib_cm >> ib_mthca 124832 0 >> ib_umad 16232 0 >> ib_uverbs 38792 0 >> ib_mad 35188 4 ib_cm,ib_sa,ib_mthca,ib_umad >> ib_core 54304 7 >> ib_ipoib,ib_cm,ib_sa,ib_mthca,ib_umad,ib_uverbs,ib_mad >> ipv6 242980 29 ib_ipoib >> libata 145584 1 ata_generic >> scsi_mod 142316 6 >> sg,sr_mod,usb_storage,sd_mod,megaraid_sas,libata >> >> root# uname -r >> 2.6.24-etchnhalf.1-686-bigmem >> >> root# cat /proc/cpuinfo >> processor : 0 >> vendor_id : AuthenticAMD >> cpu family : 15 >> model : 65 >> model name : Dual-Core AMD Opteron(tm) Processor 8220 >> stepping : 3 >> cpu MHz : 2793.163 >> cache size : 1024 KB >> physical id : 0 >> siblings : 2 >> core id : 0 >> cpu cores : 2 >> fdiv_bug : no >> hlt_bug : no >> f00f_bug : no >> coma_bug : no >> fpu : yes >> fpu_exception : yes >> cpuid level : 1 >> wp : yes >> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr >> pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext >> fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm >> extapic cr8_legacy ts fid vid ttp tm stc >> bogomips : 5589.70 >> clflush size : 64 >> >> ***** 7 more similar entries (2 cpu, 4-core each) **** >> >> root# cat /proc/meminfo >> cat /proc/meminfo >> MemTotal: 33274492 kB >> MemFree: 147716 kB >> Buffers: 840 kB >> Cached: 32532792 kB >> SwapCached: 0 kB >> Active: 19956 kB >> Inactive: 32524692 kB >> HighTotal: 32635808 kB >> HighFree: 77008 kB >> LowTotal: 638684 kB >> LowFree: 70708 kB >> SwapTotal: 16386260 kB >> SwapFree: 16386168 kB >> Dirty: 88 kB >> Writeback: 0 kB >> AnonPages: 11032 kB >> Mapped: 7940 kB >> Slab: 537012 kB >> SReclaimable: 487100 kB >> SUnreclaim: 49912 kB >> PageTables: 656 kB >> NFS_Unstable: 0 kB >> Bounce: 0 kB >> CommitLimit: 33023504 kB >> Committed_AS: 61360 kB >> VmallocTotal: 118776 kB >> VmallocUsed: 96800 kB >> VmallocChunk: 13112 kB >> HugePages_Total: 0 >> HugePages_Free: 0 >> HugePages_Rsvd: 0 >> HugePages_Surp: 0 >> Hugepagesize: 2048 kB >> >> # dpkg -l |grep ofed >> ii libibcm >> 1.0.2-1 ofed-1.3.2: libibcm >> ii libibcommon >> 1.0.8-1 ofed-1.3.2: libibcommon >> ii libibmad >> 1.1.6-1 ofed-1.3.2: libibmad >> ii libibumad >> 1.1.7-1 ofed-1.3.2: libibumad >> ii libibverbs >> 1.1.1-1 ofed-1.3.2: libibverbs >> ii libipathverbs >> 1.1-1 ofed-1.3.2: libipathverbs >> ii libmlx4 >> 1.0-1 ofed-1.3.2: libmlx >> ii libmthca >> 1.0.4-1 ofed-1.3.2: libmthca >> ii librdmacm >> 1.0.7-1 ofed-1.3.2: librdmacm >> ii libsdp >> 1.1.99-1 ofed-1.3.2: libsdp >> ii ofa-kernel >> 1.3.2-2.6.24-etchnhalf.1-686-bigmem-1 ofed-1.3.2: ofa_kernel >> >> ---------- >> >> How to provoke #1 (prior to loading ib_ipoib): >> >> non-root$ dd if=/dev/zero of=/tmp/50G bs=1M count=50000 >> >> root# modprobe ib_ipoib >> >> Output from dmesg: >> >> modprobe: page allocation failure. order:1, mode:0x20 >> Pid: 6839, comm: modprobe Not tainted 2.6.24-etchnhalf.1-686-bigmem #1 >> [] __alloc_pages+0x2c4/0x2d5 >> [] cache_alloc_refill+0x299/0x4b1 >> [] __kmalloc+0x75/0xbc >> [] __alloc_skb+0x49/0xf5 >> [] ipoib_cm_alloc_rx_skb+0x31/0x218 [ib_ipoib] >> [] ipoib_cm_dev_init+0x50c/0x552 [ib_ipoib] >> [] dma_pool_free+0xb0/0x18c >> [] ipoib_transport_dev_init+0xd2/0x3d1 [ib_ipoib] >> [] ipoib_ib_dev_init+0x2c/0x6e [ib_ipoib] >> [] ipoib_dev_init+0xab/0xd0 [ib_ipoib] >> [] ipoib_add_one+0x220/0x3cf [ib_ipoib] >> [] resched_task+0x52/0x54 >> [] ib_register_client+0x48/0x6c [ib_core] >> [] ipoib_init_module+0xd2/0xf8 [ib_ipoib] >> [] sys_init_module+0x15e3/0x16fb >> [] vma_prio_tree_insert+0x17/0x2a >> [] __kmalloc+0x0/0xbc >> [] syscall_call+0x7/0xb >> ======================= >> Mem-info: >> DMA per-cpu: >> CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: >> 1 usd: 0 >> CPU 1: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: >> 1 usd: 0 >> CPU 2: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: >> 1 usd: 0 >> CPU 3: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: >> 1 usd: 0 >> CPU 4: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: >> 1 usd: 0 >> CPU 5: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: >> 1 usd: 0 >> CPU 6: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: >> 1 usd: 0 >> CPU 7: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: >> 1 usd: 0 >> Normal per-cpu: >> CPU 0: Hot: hi: 186, btch: 31 usd: 121 Cold: hi: 62, btch: >> 15 usd: 58 >> CPU 1: Hot: hi: 186, btch: 31 usd: 42 Cold: hi: 62, btch: >> 15 usd: 26 >> CPU 2: Hot: hi: 186, btch: 31 usd: 152 Cold: hi: 62, btch: >> 15 usd: 57 >> CPU 3: Hot: hi: 186, btch: 31 usd: 63 Cold: hi: 62, btch: >> 15 usd: 59 >> CPU 4: Hot: hi: 186, btch: 31 usd: 72 Cold: hi: 62, btch: >> 15 usd: 55 >> CPU 5: Hot: hi: 186, btch: 31 usd: 174 Cold: hi: 62, btch: >> 15 usd: 61 >> CPU 6: Hot: hi: 186, btch: 31 usd: 66 Cold: hi: 62, btch: >> 15 usd: 48 >> CPU 7: Hot: hi: 186, btch: 31 usd: 35 Cold: hi: 62, btch: >> 15 usd: 54 >> HighMem per-cpu: >> CPU 0: Hot: hi: 186, btch: 31 usd: 31 Cold: hi: 62, btch: >> 15 usd: 9 >> CPU 1: Hot: hi: 186, btch: 31 usd: 30 Cold: hi: 62, btch: >> 15 usd: 5 >> CPU 2: Hot: hi: 186, btch: 31 usd: 93 Cold: hi: 62, btch: >> 15 usd: 8 >> CPU 3: Hot: hi: 186, btch: 31 usd: 3 Cold: hi: 62, btch: >> 15 usd: 14 >> CPU 4: Hot: hi: 186, btch: 31 usd: 37 Cold: hi: 62, btch: >> 15 usd: 53 >> CPU 5: Hot: hi: 186, btch: 31 usd: 67 Cold: hi: 62, btch: >> 15 usd: 49 >> CPU 6: Hot: hi: 186, btch: 31 usd: 15 Cold: hi: 62, btch: >> 15 usd: 30 >> CPU 7: Hot: hi: 186, btch: 31 usd: 138 Cold: hi: 62, btch: >> 15 usd: 61 >> Active:5136 inactive:8135705 dirty:12 writeback:0 unstable:0 >> free:15715 slab:136280 mapped:2348 pagetables:164 bounce:0 >> DMA free:3524kB min:68kB low:84kB high:100kB active:0kB inactive:0kB >> present:16256kB pages_scanned:0 all_unreclaimable? yes >> lowmem_reserve[]: 0 873 34020 34020 >> Normal free:1368kB min:3744kB low:4680kB high:5616kB active:288kB >> inactive:252kB present:894080kB pages_scanned:32 all_unreclaimable? no >> lowmem_reserve[]: 0 0 265176 265176 >> HighMem free:59588kB min:512kB low:36080kB high:71652kB >> active:20256kB inactive:32541032kB present:33942528kB >> pages_scanned:32 all_unreclaimable? no >> lowmem_reserve[]: 0 0 0 0 >> DMA: 2*4kB 4*8kB 4*16kB 4*32kB 5*64kB 1*128kB 3*256kB 0*512kB >> 0*1024kB 1*2048kB 0*4096kB = 3496kB >> Normal: 0*4kB 0*8kB 1*16kB 0*32kB 1*64kB 1*128kB 0*256kB 0*512kB >> 1*1024kB 0*2048kB 0*4096kB = 1232kB >> HighMem: 34*4kB 23*8kB 28*16kB 2*32kB 4*64kB 1*128kB 4*256kB 3*512kB >> 2*1024kB 5*2048kB 11*4096kB = 61120kB >> Swap cache: add 27, delete 27, find 1/2, race 0+0 >> Free swap = 16386168kB >> Total swap = 16386260kB >> Free swap: 16386168kB >> 8781824 pages of RAM >> 8552448 pages of HIGHMEM >> 463201 reserved pages >> 8140201 pages shared >> 0 pages swap cached >> 12 pages dirty >> 0 pages writeback >> 2382 pages mapped >> 136255 pages slab >> 167 pages pagetables >> ib%d: failed to allocate receive buffer 144 >> >> ---------- >> >> How to provoke #2 (with ib_ipoib loaded): >> >> non-root$ dd if=/dev/zero of=/tmp/50G bs=1M count=50000 >> >> This results in an OOM triggering the OOM-killer which starts killing >> processes. >> >> ---------- >> >> Any help would be appreciated, as well as confirmation of the same >> sort of behavior. >> >> Thanks, >> John >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From tziporet at dev.mellanox.co.il Thu Oct 23 07:29:14 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 23 Oct 2008 16:29:14 +0200 Subject: [ofa-general] [GIT PULL] please pull infiniband.git In-Reply-To: References: Message-ID: <49008A3A.4090206@mellanox.co.il> Roland Dreier wrote: > This will get the last batch of merges for the 2.6.28 window. As the > dirstat shows, this is almost entirely the new mlx4_en driver, which > is a 10G Ethernet driver for Mellanox ConnectX adapters: > > 1.2% drivers/infiniband/ > What about XRC? I thought it was queued for 2.6.28? > 98.6% drivers/net/mlx4/ > > Jeff Garzik and I agreed that this new driver should go through my > tree because it depended on changes to the existing mlx4_ib InfiniBand > driver that I wanted to review and apply through my tree. > > Thanks for driving this to the kernel Tziporet From rdreier at cisco.com Thu Oct 23 07:29:55 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 23 Oct 2008 07:29:55 -0700 Subject: [ofa-general] Re: [PATCH v2] IB/ipoib: fix netdev offload features support for child (VLAN) devices In-Reply-To: <49006B46.3060406@voltaire.com> (Or Gerlitz's message of "Thu, 23 Oct 2008 14:17:10 +0200") References: <49006B46.3060406@voltaire.com> Message-ID: > Are you OK with pushing this to -stable as well? its not a regression, > but for the case of distros using that kernel, having this patch would > make the performance of ipoib virtual networks much better. No, I don't see how this meets any of the criteria for -stable. From vlad at mellanox.co.il Thu Oct 23 07:32:21 2008 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 23 Oct 2008 16:32:21 +0200 Subject: [ofa-general] OFED-1.4-rc3 is available In-Reply-To: References: <48FC4D88.3040702@dev.mellanox.co.il> Message-ID: <49008AF5.9060908@mellanox.co.il> Olga Shern (Voltaire) wrote: >> - 27 bugs fixed (see attached for details) > > Hi Vlad, > > I don't see the attached file. > > Olga Hi Olga, Here are the missing files: The list of the fixed bugs and ofed kernel changes between rc2 and rc3. Thanks, Regards, Vladimir -------------- next part -------------- A non-text attachment was scrubbed... Name: ofed-1.4-rc3-fixed-bugs.csv Type: text/csv Size: 3116 bytes Desc: not available URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ofed_kernel-1.4-rc2_rc3.log URL: From rdreier at cisco.com Thu Oct 23 07:31:54 2008 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 23 Oct 2008 07:31:54 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git In-Reply-To: <49008A3A.4090206@mellanox.co.il> (Tziporet Koren's message of "Thu, 23 Oct 2008 16:29:14 +0200") References: <49008A3A.4090206@mellanox.co.il> Message-ID: > What about XRC? I thought it was queued for 2.6.28? No, we never got through the review of everything, and there are still bugs to fix too (the issue with a process releasing an XRCD with QPs/SRQs still attached that Jack has a hacky fix for, but which the right fix is still not implemented) From tziporet at dev.mellanox.co.il Thu Oct 23 07:49:08 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 23 Oct 2008 16:49:08 +0200 Subject: [ofa-general] [GIT PULL] please pull infiniband.git In-Reply-To: References: <49008A3A.4090206@mellanox.co.il> Message-ID: <49008EE4.3000200@mellanox.co.il> Roland Dreier wrote: > > What about XRC? I thought it was queued for 2.6.28? > > No, we never got through the review of everything, and there are still > bugs to fix too (the issue with a process releasing an XRCD with > QPs/SRQs still attached that Jack has a hacky fix for, but which the > right fix is still not implemented) > > So - what should we do that it will be part of 2.6.29? If you can send us all things that must be fixed we can make it (at last) Tziporet From amar.mudrankit at qlogic.com Thu Oct 23 07:50:53 2008 From: amar.mudrankit at qlogic.com (Amar Mudrankit) Date: Thu, 23 Oct 2008 20:20:53 +0530 Subject: [ofa-general] ***SPAM*** NFS-RDMA compilation problem Message-ID: While I was trying to install OFED-1.4-rc3 over SLES 10 SP 2 with NFS-RDMA selected for installation, I got the following error message: nfs-utils-1.1.1 rpm is required to build kernel-ib I have downloaded and installed successfully, the nfs-utils-1.1.4 **source .tgz** from http://www.kernel.org/pub/linux/utils/nfs, still I was hit with the same error message. I was not able to find out nfs-utils rpm that would install over SLES 10 SP 2. Can anybody please point me to the location of rpm? Why is OFED installation unable to detect the latest installation of nfs utils compiled from source and is fully dependent upon the rpm installation? Regards, Amar From ogerlitz at voltaire.com Thu Oct 23 08:13:03 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 23 Oct 2008 17:13:03 +0200 Subject: [ofa-general] Re: [PATCH v2] IB/ipoib: fix netdev offload features support for child (VLAN) devices In-Reply-To: References: <49006B46.3060406@voltaire.com> Message-ID: <4900947F.8050403@voltaire.com> Roland Dreier wrote: > > Are you OK with pushing this to -stable as well? its not a regression, > > but for the case of distros using that kernel, having this patch would > > make the performance of ipoib virtual networks much better. > > No, I don't see how this meets any of the criteria for -stable. > got it. So I would need to push that directly to the distros... Or. From ogerlitz at voltaire.com Thu Oct 23 08:14:34 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 23 Oct 2008 17:14:34 +0200 (IST) Subject: [ofa-general] fwd: [PATCH] IPoIB: Set netdev offload features properly for child (VLAN) interfaces Message-ID: Hi, this patch may be of much use for distros using pre 2.6.28 kernels, in case it meets the next version of yours, please consider for inclusion. Or. commit 83bb63f62bda28be88b21216fbb59838a10f2348 Author: Or Gerlitz Date: Wed Oct 22 15:49:49 2008 -0700 IPoIB: Set netdev offload features properly for child (VLAN) interfaces Child devices were created without any offload features set, fix this by moving the code that computes the features into generic function which is now called through non-child and child device creation. Signed-off-by: Or Gerlitz -- v1 has a bug where the 'result' flag in ipoib_vlan_add may be used uninitialized Signed-off-by: Roland Dreier diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 68ba5c3..e0c7dfa 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -507,6 +507,7 @@ int ipoib_pkey_dev_delay_open(struct net_device *dev); void ipoib_drain_cq(struct net_device *dev); void ipoib_set_ethtool_ops(struct net_device *dev); +int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca); #ifdef CONFIG_INFINIBAND_IPOIB_CM diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index c0ee514..fddded7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1173,11 +1173,48 @@ int ipoib_add_pkey_attr(struct net_device *dev) return device_create_file(&dev->dev, &dev_attr_pkey); } +int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca) +{ + struct ib_device_attr *device_attr; + int result = -ENOMEM; + + device_attr = kmalloc(sizeof *device_attr, GFP_KERNEL); + if (!device_attr) { + printk(KERN_WARNING "%s: allocation of %zu bytes failed\n", + hca->name, sizeof *device_attr); + return result; + } + + result = ib_query_device(hca, device_attr); + if (result) { + printk(KERN_WARNING "%s: ib_query_device failed (ret = %d)\n", + hca->name, result); + kfree(device_attr); + return result; + } + priv->hca_caps = device_attr->device_cap_flags; + + kfree(device_attr); + + if (priv->hca_caps & IB_DEVICE_UD_IP_CSUM) { + set_bit(IPOIB_FLAG_CSUM, &priv->flags); + priv->dev->features |= NETIF_F_SG | NETIF_F_IP_CSUM; + } + + if (lro) + priv->dev->features |= NETIF_F_LRO; + + if (priv->dev->features & NETIF_F_SG && priv->hca_caps & IB_DEVICE_UD_TSO) + priv->dev->features |= NETIF_F_TSO; + + return 0; +} + + static struct net_device *ipoib_add_port(const char *format, struct ib_device *hca, u8 port) { struct ipoib_dev_priv *priv; - struct ib_device_attr *device_attr; struct ib_port_attr attr; int result = -ENOMEM; @@ -1206,31 +1243,8 @@ static struct net_device *ipoib_add_port(const char *format, goto device_init_failed; } - device_attr = kmalloc(sizeof *device_attr, GFP_KERNEL); - if (!device_attr) { - printk(KERN_WARNING "%s: allocation of %zu bytes failed\n", - hca->name, sizeof *device_attr); + if (ipoib_set_dev_features(priv, hca)) goto device_init_failed; - } - - result = ib_query_device(hca, device_attr); - if (result) { - printk(KERN_WARNING "%s: ib_query_device failed (ret = %d)\n", - hca->name, result); - kfree(device_attr); - goto device_init_failed; - } - priv->hca_caps = device_attr->device_cap_flags; - - kfree(device_attr); - - if (priv->hca_caps & IB_DEVICE_UD_IP_CSUM) { - set_bit(IPOIB_FLAG_CSUM, &priv->flags); - priv->dev->features |= NETIF_F_SG | NETIF_F_IP_CSUM; - } - - if (lro) - priv->dev->features |= NETIF_F_LRO; /* * Set the full membership bit, so that we join the right @@ -1266,9 +1280,6 @@ static struct net_device *ipoib_add_port(const char *format, goto event_failed; } - if (priv->dev->features & NETIF_F_SG && priv->hca_caps & IB_DEVICE_UD_TSO) - priv->dev->features |= NETIF_F_TSO; - result = register_netdev(priv->dev); if (result) { printk(KERN_WARNING "%s: couldn't register ipoib port %d; error %d\n", diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c index b08eb56..2cf1a40 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c @@ -93,6 +93,10 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey) priv->mcast_mtu = priv->admin_mtu = priv->dev->mtu; set_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags); + result = ipoib_set_dev_features(priv, ppriv->ca); + if (result) + goto device_init_failed; + priv->pkey = pkey; memcpy(priv->dev->dev_addr, ppriv->dev->dev_addr, INFINIBAND_ALEN); From Thomas.Talpey at netapp.com Thu Oct 23 08:15:37 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Thu, 23 Oct 2008 11:15:37 -0400 Subject: [ofa-general] NFS-RDMA compilation problem In-Reply-To: References: Message-ID: At 01:32 AM 10/23/2008, Amar Mudrankit wrote: >While I was trying to install OFED-1.4-rc3 over SLES 10 SP 2 with >NFS-RDMA selected for installation, I got the following error message: > >nfs-utils-1.1.1 rpm is required to build kernel-ib Do you need to run on SLES10? Because NFS/RDMA is part of mainline kernel.org since 2.6.24. A more recent SuSE kernel, or any distro updated since (roughly) early this year should support NFS/RDMA out of the box. BTW, there are some significant updates in the coming 2.6.28 rc's, many server improvements and the client now supports Fastreg (FRMR). These changes are queued in Linus' tree currently. I'll let Jeff Becker speak to the OFED packaging, which is what your dependency error seems to be stemming from. Tom. > >I have downloaded and installed successfully, the nfs-utils-1.1.4 >**source .tgz** from http://www.kernel.org/pub/linux/utils/nfs, >still I was hit with the same error message. > >I was not able to find out nfs-utils rpm that would install over SLES >10 SP 2. Can anybody please point me to the location of rpm? Why is >OFED installation unable to detect the latest installation of nfs >utils compiled from source and is fully dependent upon the rpm >installation? From Jeffrey.C.Becker at nasa.gov Thu Oct 23 10:20:16 2008 From: Jeffrey.C.Becker at nasa.gov (Jeff Becker) Date: Thu, 23 Oct 2008 10:20:16 -0700 Subject: [ofa-general] NFS-RDMA compilation problem In-Reply-To: References: Message-ID: <4900B250.4090909@nasa.gov> Talpey, Thomas wrote: > At 01:32 AM 10/23/2008, Amar Mudrankit wrote: > >> While I was trying to install OFED-1.4-rc3 over SLES 10 SP 2 with >> NFS-RDMA selected for installation, I got the following error message: >> >> nfs-utils-1.1.1 rpm is required to build kernel-ib >> > > Do you need to run on SLES10? Because NFS/RDMA is part of mainline > kernel.org since 2.6.24. A more recent SuSE kernel, or any distro > updated since (roughly) early this year should support NFS/RDMA out > of the box. > > BTW, there are some significant updates in the coming 2.6.28 rc's, > many server improvements and the client now supports Fastreg (FRMR). > These changes are queued in Linus' tree currently. > > I'll let Jeff Becker speak to the OFED packaging, which is what your > dependency error seems to be stemming from. > The reason for the error message is that the dependency check is looking for an RPM with the right version, irrespective of the version you installed (which RPM doesn't know about). I plan to fix the dependency check to look at the version of the installed nfs-utils commands. -jeff > Tom. > > >> I have downloaded and installed successfully, the nfs-utils-1.1.4 >> **source .tgz** from http://www.kernel.org/pub/linux/utils/nfs, >> still I was hit with the same error message. >> >> I was not able to find out nfs-utils rpm that would install over SLES >> 10 SP 2. Can anybody please point me to the location of rpm? Why is >> OFED installation unable to detect the latest installation of nfs >> utils compiled from source and is fully dependent upon the rpm >> installation? >> > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From vst at vlnb.net Thu Oct 23 10:55:47 2008 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Thu, 23 Oct 2008 21:55:47 +0400 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <49007D6D.1020100@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <48FF64EE.5050102@vlnb.net> <48FF8FF0.2050308@mellanox.com> <48FF91CA.1060603@harr.org> <490069AA.7010000@vlnb.net> <49007D6D.1020100@harr.org> Message-ID: <4900BAA3.4070601@vlnb.net> Cameron Harr wrote: > Vladislav Bolkhovitin wrote: >> Cameron Harr wrote: >>>>>> Also, as I mentioned before, my time is going to be fairly limited >>>>>> for a while, but I'll try to squeeze this in and will make sure I >>>>>> run for longer periods of time. I'll also try to calculate an >>>>>> exact end number based on iop/runtime. >>>>>>> Which backstorage do you use for BLOCKIO? >>>>>> Fusion IO ioDrive. I generally use 1 or 2, but somtimes up to 4 at >>>>>> a time. >>>>> It might be a reason of not stable results. Can you try with NULLIO >>>>> to narrow things down a bit? >>> I actually tested NULLIO a while back at Vu's suggestion, but I >>> believe I forgot to report on it. From my recollection, NULLIO gave >>> me pretty consistent block sizes of 512B, which is the BS I used in >>> the benchmark. >> Were the IOPS, CS and IRQ rates results stable and consistent between >> runs? > > I don't remember about the IOPs, but CS and IRQ rates were stable. > However, they've also become stable in my later runs with > scst_threads=[123]. But with srpt_thread=0 - not, right? >> Anyway, try with the latest code. There are noticeable changes there. >> >> BTW, since you use the solid state media, you don't need any IO >> scheduler. Hence, noop should be the best choice for you. > noop is what I've been running for the past while, before switching to > deadline yesterday. > From linux at celticblues.com Thu Oct 23 11:23:19 2008 From: linux at celticblues.com (linux at celticblues.com) Date: Thu, 23 Oct 2008 12:23:19 -0600 Subject: [ofa-general] Where to get started? Message-ID: <20081023122319.yyhzkbtw0840g4sg@celticblues.com> I am new, completely new, to the whole OpenFabrics thing. Can someone point me to some reading material on all this... Something explaining what is OpenSM, what is OpenIB, etc. How does all this stuff work together... What is necessary to get a linux system up and running. etc. I visited the OpenFabrics website, but did not find anything like this. Ed From sashak at voltaire.com Thu Oct 23 11:30:49 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 23 Oct 2008 20:30:49 +0200 Subject: [ofa-general] [PATCH] opensm/opens.spec: add -D option for logrotate file install command Message-ID: <20081023183049.GD25831@sashak.voltaire.com> This addresses bug #1294. 'install' doesn't create paths automatically on Ubuntu. Signed-off-by: Sasha Khapyorsky --- opensm/opensm.spec.in | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in index 325e35c..f8cecf1 100644 --- a/opensm/opensm.spec.in +++ b/opensm/opensm.spec.in @@ -96,7 +96,7 @@ else fi mkdir -p $etc/{init.d,logrotate.d} $etc/@OPENSM_CONFIG_SUB_DIR@ install -m 755 scripts/${REDHAT}opensm.init $etc/init.d/opensmd -install -m 644 scripts/opensm.logrotate $etc/logrotate.d/opensm +install -D -m 644 scripts/opensm.logrotate $etc/logrotate.d/opensm install -m 755 scripts/sldd.sh $RPM_BUILD_ROOT%{_sbindir}/sldd.sh %clean -- 1.6.0.1.196.g01914 From chu11 at llnl.gov Thu Oct 23 11:32:12 2008 From: chu11 at llnl.gov (Al Chu) Date: Thu, 23 Oct 2008 11:32:12 -0700 Subject: [ofa-general] opensm as service - cfg files In-Reply-To: <490073C0.70109@cea.fr> References: <48FF22FC.6000606@dev.mellanox.co.il> <490073C0.70109@cea.fr> Message-ID: <1224786733.1197.398.camel@cardanus.llnl.gov> On Thu, 2008-10-23 at 14:53 +0200, Philippe Gregoire wrote: > Hi Yevgeny, > > Is it possible to write this service so it will be able to manage multiple instances of opensm on the same node, I mean start and stop all instances at the same time or separately. > This will be very usefull when you have several Infiniband storage devices connected directly to one node, > so you have to run several opensm -g guid processes on this node. > > It is authorized to have a service that understand parameters like: > service start 0x8000010232 > or > service start ddn12.conf This doesn't sound like that bad of idea, although "what does the user expect" is a concern. My co-worker brought up the simple issue of the log files. Do you automatically pick a different log file to store to, or does it store to the same log, or is it the user's responsibility to pick a reasonable different log file name in the .conf file? I have no idea what other daemons/init scripts do. Al > Philippe Gregoire > CEA/DAM. > > Yevgeny Kliteynik a écrit : > > Hi Sasha, > > > > I was just trying to put some order in my head regarding > > the use of opensm as service, and I have couple of questions. > > Some of them might be dumb, so please bear with me... :) > > > > 1. OpenSM config file. > > Do we still need opensm/scripts/opensm.conf? > > I think it's not used any more. > > > > 2. From opensm/scripts/opensm.init.in: > > @sbindir@/opensm -B $OPTIONS > /dev/null > > Is someone setting the $OPTIONS variable? I think it was > > set in the config file in the past, but not now. > > > > 3. From opensm/scripts/redhat-opensm.init.in: > > CONFIG=@sysconfdir@/sysconfig/opensm.conf > > if [ -f $CONFIG ]; then > > . $CONFIG > > fi > > > > From opensm/scripts/opensm.init.in: > > if [[ -s /etc/sysconfig/opensm ]]; then > > . /etc/sysconfig/opensm > > fi > > > > If it's not some naming convention, perhaps we should use > > opensm.conf in both cases? > > > > 4. Logrotate: > > opensm/scripts/opensm.spec.in installs logrotate file as follows: > > install -m 644 scripts/opensm.logrotate $etc/logrotate.d/opensm > > I may be off here, but should the installed file name be opensmd > > to match the service name? > > > > -- Yevgeny > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http:// openib.org/mailman/listinfo/openib-general > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From dledford at redhat.com Thu Oct 23 11:58:01 2008 From: dledford at redhat.com (Doug Ledford) Date: Thu, 23 Oct 2008 14:58:01 -0400 Subject: [ofa-general] Re: [PATCH] opensm/scripts: handling opensm config file In-Reply-To: <1224694469.1197.367.camel@cardanus.llnl.gov> References: <48FF3F70.8000905@dev.mellanox.co.il> <20081022152246.GU20450@sashak.voltaire.com> <48FF51EE.4020804@dev.mellanox.co.il> <20081022162416.GY20450@sashak.voltaire.com> <1224694469.1197.367.camel@cardanus.llnl.gov> Message-ID: <1224788281.8879.100.camel@firewall.xsintricity.com> On Wed, 2008-10-22 at 09:54 -0700, Al Chu wrote: > You may wish to ping Redhat/Suse for what they think (b/c if it doesn't > meet their requirements, they will just add it back), but I don't think > it should be removed. Well, I don't use the OFED scripts anyway. They aren't LSB compliant in so many ways it's not worth discussing. Plus they do things that that should not be done in a production environment, or things that should be handled via other scripts. So, it's of little importance to me. However, if you guys have moved the opensm stuff to /etc/opensm just to have a single opensm.conf file in there, then I have to wonder why bother with an /etc/opensm directory? Are there other files in there by default now? You don't save any /etc/ directory namespace pollution if you create a subdirectory for a single file. Oh well, that doesn't matter too much to me either. All the packages I maintain use /etc/ofed for rhel4 and rhel5, and will use /etc/rdma for fedora and rhel6 and later. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From sashak at voltaire.com Thu Oct 23 12:34:50 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 23 Oct 2008 21:34:50 +0200 Subject: [ofa-general] Re: [PATCH] opensm/scripts: handling opensm config file In-Reply-To: <1224788281.8879.100.camel@firewall.xsintricity.com> References: <48FF3F70.8000905@dev.mellanox.co.il> <20081022152246.GU20450@sashak.voltaire.com> <48FF51EE.4020804@dev.mellanox.co.il> <20081022162416.GY20450@sashak.voltaire.com> <1224694469.1197.367.camel@cardanus.llnl.gov> <1224788281.8879.100.camel@firewall.xsintricity.com> Message-ID: <20081023193450.GF25831@sashak.voltaire.com> Hi Doug, On 14:58 Thu 23 Oct , Doug Ledford wrote: > > Well, I don't use the OFED scripts anyway. They aren't LSB compliant in > so many ways it's not worth discussing. Plus they do things that that > should not be done in a production environment, or things that should be > handled via other scripts. So, it's of little importance to me. Why to not help us to make it in a proper way? We are discussing this right now. > However, if you guys have moved the opensm stuff to /etc/opensm just to > have a single opensm.conf file in there, then I have to wonder why > bother with an /etc/opensm directory? It is not just a single file. In addition to opensm.conf OpenSM by default will look in this directory (directory name is configurable btw) for partition, route prefixes configuration and QoS policy files. > Are there other files in there by > default now? There are nothing by default, but user may put files there. > You don't save any /etc/ directory namespace pollution if > you create a subdirectory for a single file. Oh well, that doesn't > matter too much to me either. All the packages I maintain use /etc/ofed > for rhel4 and rhel5, and will use /etc/rdma for fedora and rhel6 and > later. Directory name can be configured if somebody cares. However OpenSM itself does not required OFED or RDMA to be installed, we are fine to run in stand-alone mode. Then '/etc/opensm' (or actually ${sysconfdir}/opensm) looks fine for me. Sasha From ralph.campbell at qlogic.com Thu Oct 23 12:50:01 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 23 Oct 2008 12:50:01 -0700 Subject: [ofa-general] [PATCH 0/3] IB/ipath -- fixes for 2.6.28 Message-ID: <20081023195001.10020.96260.stgit@eng-46.mv.qlogic.com> Following this message are three recent fixes for the QLogic IB driver: IB/ipath - fix the length returned in loopback UD completion queue entries IB/ipath - fix RDMA write with immediate copy of last packet IB/ipath - improve UD loopback performance by allocating temp array once From ralph.campbell at qlogic.com Thu Oct 23 12:50:07 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 23 Oct 2008 12:50:07 -0700 Subject: [ofa-general] [PATCH 1/3] IB/ipath - fix the length returned in loopback UD completion queue entries In-Reply-To: <20081023195001.10020.96260.stgit@eng-46.mv.qlogic.com> References: <20081023195001.10020.96260.stgit@eng-46.mv.qlogic.com> Message-ID: <20081023195006.10020.16845.stgit@eng-46.mv.qlogic.com> UD packets sent to the local IB port (loopback) have a zero length reported in the send work request completion entry. This fixes it by using a copy of the WQE to copy the data. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_ud.c | 10 +++++++--- 1 files changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_ud.c b/drivers/infiniband/hw/ipath/ipath_ud.c index 729446f..136dc4c 100644 --- a/drivers/infiniband/hw/ipath/ipath_ud.c +++ b/drivers/infiniband/hw/ipath/ipath_ud.c @@ -54,6 +54,7 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_swqe *swqe) unsigned long flags; struct ipath_rq *rq; struct ipath_srq *srq; + struct ipath_sge_state ssge; struct ipath_sge_state rsge; struct ipath_sge *sge; struct ipath_rwq *wq; @@ -196,7 +197,10 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_swqe *swqe) wc.wc_flags |= IB_WC_GRH; } else ipath_skip_sge(&rsge, sizeof(struct ib_grh)); - sge = swqe->sg_list; + ssge.sg_list = swqe->sg_list + 1; + ssge.sge = *swqe->sg_list; + ssge.num_sge = swqe->wr.num_sge; + sge = &ssge.sge; while (length) { u32 len = sge->length; @@ -210,8 +214,8 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_swqe *swqe) sge->length -= len; sge->sge_length -= len; if (sge->sge_length == 0) { - if (--swqe->wr.num_sge) - sge++; + if (--ssge.num_sge) + *sge = *ssge.sg_list++; } else if (sge->length == 0 && sge->mr != NULL) { if (++sge->n >= IPATH_SEGSZ) { if (++sge->m >= sge->mr->mapsz) From ralph.campbell at qlogic.com Thu Oct 23 12:50:12 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 23 Oct 2008 12:50:12 -0700 Subject: [ofa-general] [PATCH 2/3] IB/ipath - fix RDMA write with immediate copy of last packet In-Reply-To: <20081023195001.10020.96260.stgit@eng-46.mv.qlogic.com> References: <20081023195001.10020.96260.stgit@eng-46.mv.qlogic.com> Message-ID: <20081023195012.10020.18967.stgit@eng-46.mv.qlogic.com> When the last packet of a RDMA write with immediate is received, the next receive work queue entry ID should be used to generate a completion entry. The code was incorrectly resetting part of the state used to copy the last packet. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_ruc.c | 10 +++++----- 1 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_ruc.c b/drivers/infiniband/hw/ipath/ipath_ruc.c index fc0f6d9..2296832 100644 --- a/drivers/infiniband/hw/ipath/ipath_ruc.c +++ b/drivers/infiniband/hw/ipath/ipath_ruc.c @@ -156,7 +156,7 @@ bail: /** * ipath_get_rwqe - copy the next RWQE into the QP's RWQE * @qp: the QP - * @wr_id_only: update wr_id only, not SGEs + * @wr_id_only: update qp->r_wr_id only, not qp->r_sge * * Return 0 if no RWQE is available, otherwise return 1. * @@ -173,8 +173,6 @@ int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only) u32 tail; int ret; - qp->r_sge.sg_list = qp->r_sg_list; - if (qp->ibqp.srq) { srq = to_isrq(qp->ibqp.srq); handler = srq->ibsrq.event_handler; @@ -206,8 +204,10 @@ int ipath_get_rwqe(struct ipath_qp *qp, int wr_id_only) wqe = get_rwqe_ptr(rq, tail); if (++tail >= rq->size) tail = 0; - } while (!wr_id_only && !ipath_init_sge(qp, wqe, &qp->r_len, - &qp->r_sge)); + if (wr_id_only) + break; + qp->r_sge.sg_list = qp->r_sg_list; + } while (!ipath_init_sge(qp, wqe, &qp->r_len, &qp->r_sge)); qp->r_wr_id = wqe->wr_id; wq->tail = tail; From ralph.campbell at qlogic.com Thu Oct 23 12:50:17 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Thu, 23 Oct 2008 12:50:17 -0700 Subject: [ofa-general] [PATCH 3/3] IB/ipath - improve UD loopback performance by allocating temp array once In-Reply-To: <20081023195001.10020.96260.stgit@eng-46.mv.qlogic.com> References: <20081023195001.10020.96260.stgit@eng-46.mv.qlogic.com> Message-ID: <20081023195017.10020.33878.stgit@eng-46.mv.qlogic.com> Receive work queue entries are checked for LKey validity, and pointers to the memory region structure are saved in an allocated structure. For UD loopback packets, this structure is allocated and freed for each packet. This patch changes that to allocate/free during QP creation and destruction. Signed-off-by: Ralph Campbell --- drivers/infiniband/hw/ipath/ipath_qp.c | 32 ++++++++++++++++++++++------- drivers/infiniband/hw/ipath/ipath_ud.c | 19 +---------------- drivers/infiniband/hw/ipath/ipath_verbs.h | 1 + 3 files changed, 26 insertions(+), 26 deletions(-) diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c index 4715911..3a5a89b 100644 --- a/drivers/infiniband/hw/ipath/ipath_qp.c +++ b/drivers/infiniband/hw/ipath/ipath_qp.c @@ -745,6 +745,7 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd, struct ipath_swqe *swq = NULL; struct ipath_ibdev *dev; size_t sz; + size_t sg_list_sz; struct ib_qp *ret; if (init_attr->create_flags) { @@ -789,19 +790,31 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd, goto bail; } sz = sizeof(*qp); + sg_list_sz = 0; if (init_attr->srq) { struct ipath_srq *srq = to_isrq(init_attr->srq); - sz += sizeof(*qp->r_sg_list) * - srq->rq.max_sge; - } else - sz += sizeof(*qp->r_sg_list) * - init_attr->cap.max_recv_sge; - qp = kmalloc(sz, GFP_KERNEL); + if (srq->rq.max_sge > 1) + sg_list_sz = sizeof(*qp->r_sg_list) * + (srq->rq.max_sge - 1); + } else if (init_attr->cap.max_recv_sge > 1) + sg_list_sz = sizeof(*qp->r_sg_list) * + (init_attr->cap.max_recv_sge - 1); + qp = kmalloc(sz + sg_list_sz, GFP_KERNEL); if (!qp) { ret = ERR_PTR(-ENOMEM); goto bail_swq; } + if (sg_list_sz && (init_attr->qp_type == IB_QPT_UD || + init_attr->qp_type == IB_QPT_SMI || + init_attr->qp_type == IB_QPT_GSI)) { + qp->r_ud_sg_list = kmalloc(sg_list_sz, GFP_KERNEL); + if (!qp->r_ud_sg_list) { + ret = ERR_PTR(-ENOMEM); + goto bail_qp; + } + } else + qp->r_ud_sg_list = NULL; if (init_attr->srq) { sz = 0; qp->r_rq.size = 0; @@ -818,7 +831,7 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd, qp->r_rq.size * sz); if (!qp->r_rq.wq) { ret = ERR_PTR(-ENOMEM); - goto bail_qp; + goto bail_sg_list; } } @@ -848,7 +861,7 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd, if (err) { ret = ERR_PTR(err); vfree(qp->r_rq.wq); - goto bail_qp; + goto bail_sg_list; } qp->ip = NULL; qp->s_tx = NULL; @@ -925,6 +938,8 @@ bail_ip: vfree(qp->r_rq.wq); ipath_free_qp(&dev->qp_table, qp); free_qpn(&dev->qp_table, qp->ibqp.qp_num); +bail_sg_list: + kfree(qp->r_ud_sg_list); bail_qp: kfree(qp); bail_swq: @@ -989,6 +1004,7 @@ int ipath_destroy_qp(struct ib_qp *ibqp) kref_put(&qp->ip->ref, ipath_release_mmap_info); else vfree(qp->r_rq.wq); + kfree(qp->r_ud_sg_list); vfree(qp->s_wq); kfree(qp); return 0; diff --git a/drivers/infiniband/hw/ipath/ipath_ud.c b/drivers/infiniband/hw/ipath/ipath_ud.c index 136dc4c..048fbd0 100644 --- a/drivers/infiniband/hw/ipath/ipath_ud.c +++ b/drivers/infiniband/hw/ipath/ipath_ud.c @@ -71,8 +71,6 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_swqe *swqe) goto done; } - rsge.sg_list = NULL; - /* * Check that the qkey matches (except for QP0, see 9.6.1.4.1). * Qkeys with the high order bit set mean use the @@ -116,21 +114,6 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_swqe *swqe) rq = &qp->r_rq; } - if (rq->max_sge > 1) { - /* - * XXX We could use GFP_KERNEL if ipath_do_send() - * was always called from the tasklet instead of - * from ipath_post_send(). - */ - rsge.sg_list = kmalloc((rq->max_sge - 1) * - sizeof(struct ipath_sge), - GFP_ATOMIC); - if (!rsge.sg_list) { - dev->n_pkt_drops++; - goto drop; - } - } - /* * Get the next work request entry to find where to put the data. * Note that it is safe to drop the lock after changing rq->tail @@ -148,6 +131,7 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_swqe *swqe) goto drop; } wqe = get_rwqe_ptr(rq, tail); + rsge.sg_list = qp->r_ud_sg_list; if (!ipath_init_sge(qp, wqe, &rlen, &rsge)) { spin_unlock_irqrestore(&rq->lock, flags); dev->n_pkt_drops++; @@ -246,7 +230,6 @@ static void ipath_ud_loopback(struct ipath_qp *sqp, struct ipath_swqe *swqe) ipath_cq_enter(to_icq(qp->ibqp.recv_cq), &wc, swqe->wr.send_flags & IB_SEND_SOLICITED); drop: - kfree(rsge.sg_list); if (atomic_dec_and_test(&qp->refcount)) wake_up(&qp->wait); done:; diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h index 9d12ae8..11e3f61 100644 --- a/drivers/infiniband/hw/ipath/ipath_verbs.h +++ b/drivers/infiniband/hw/ipath/ipath_verbs.h @@ -431,6 +431,7 @@ struct ipath_qp { u32 s_lsn; /* limit sequence number (credit) */ struct ipath_swqe *s_wq; /* send work queue */ struct ipath_swqe *s_wqe; + struct ipath_sge *r_ud_sg_list; struct ipath_rq r_rq; /* receive work queue */ struct ipath_sge r_sg_list[0]; /* verified SGEs */ }; From vlad at lists.openfabrics.org Fri Oct 24 03:18:24 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 24 Oct 2008 03:18:24 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081024-0200 daily build status Message-ID: <20081024101824.5F262E60BA4@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From kelly at tradebotsystems.com Fri Oct 24 07:00:32 2008 From: kelly at tradebotsystems.com (Kelly Burkhart) Date: Fri, 24 Oct 2008 09:00:32 -0500 Subject: [ofa-general] ibverbs help Message-ID: <98B0CDCB28A5EE4CB3678CD99406644E343445@tbmail2.tradebot.com> Hello, I have written a simple program as my first foray into verbs programming. The intent is to create a connection and do a single send. The logic (ripped primarily from rc_pingpong) is this: get device ibv_open_device ibv_alloc_pd allocate buffer ibv_reg_mr ibv_create_cq ibv_create_qp ibv_modify_qp (set to INIT) ibv_query_port (exchange lid/qp_num with peer) ibv_modify_qp (set to RTR, set remote qpnum, lid) ibv_modify_qp (set to RTS) SERVER: ibv_post_recv ibv_poll_cq (loop) CLIENT: ibv_post_send ibv_poll_cq (loop) My assumption is that when I call ibv_post_send or ibv_post_recv, that I should be able to receive notification of the completion of that call by spinning on ibv_poll_cq until the work completion is available. However, my program spins on ibv_poll_cq indefinitely. Is there an error in my understanding? Or an error in my program (attached). To run: test_ibv_simple (run as server) test_ibv_simple asdf (any arg will cause client) Thanks, -Kelly -------------- next part -------------- A non-text attachment was scrubbed... Name: test_ibv_simple.cpp Type: application/octet-stream Size: 15690 bytes Desc: test_ibv_simple.cpp URL: From rdreier at cisco.com Fri Oct 24 08:40:15 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 24 Oct 2008 08:40:15 -0700 Subject: [ofa-general] ibverbs help In-Reply-To: <98B0CDCB28A5EE4CB3678CD99406644E343445@tbmail2.tradebot.com> (Kelly Burkhart's message of "Fri, 24 Oct 2008 09:00:32 -0500") References: <98B0CDCB28A5EE4CB3678CD99406644E343445@tbmail2.tradebot.com> Message-ID: > My assumption is that when I call ibv_post_send or ibv_post_recv, that I > should be able to receive notification of the completion of that call by > spinning on ibv_poll_cq until the work completion is available. > However, my program spins on ibv_poll_cq indefinitely. Is there an > error in my understanding? Or an error in my program (attached). You are correct that every send request should generate a completion queue entry (as long as you request signaling, and your code does send IBV_SEND_SIGNALED). Assuming that the tests like ibv_rc_pingpong work on your system, then a bug in your code is by far the most likely explanation. - R. From rdreier at cisco.com Fri Oct 24 08:41:07 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 24 Oct 2008 08:41:07 -0700 Subject: [ofa-general] Re: [PATCH] maintainers: moderated mailing list In-Reply-To: <20081023183805.92bc39ab.randy.dunlap@oracle.com> (Randy Dunlap's message of "Thu, 23 Oct 2008 18:38:05 -0700") References: <20081023183805.92bc39ab.randy.dunlap@oracle.com> Message-ID: > I got the "list is moderated message," so add it here. ugh, our list isn't supposed to be moderated. Anyone know who's in charge of the mail server now? Can we get this fixed? From dotanba at gmail.com Fri Oct 24 09:00:40 2008 From: dotanba at gmail.com (Dotan Barak) Date: Fri, 24 Oct 2008 18:00:40 +0200 Subject: [ofa-general] ibverbs help In-Reply-To: <98B0CDCB28A5EE4CB3678CD99406644E343445@tbmail2.tradebot.com> References: <98B0CDCB28A5EE4CB3678CD99406644E343445@tbmail2.tradebot.com> Message-ID: <4901F128.3050402@gmail.com> Kelly Burkhart wrote: > Hello, > > I have written a simple program as my first foray into verbs > programming. The intent is to create a connection and do a single send. > The logic (ripped primarily from rc_pingpong) is this: > > get device > ibv_open_device > ibv_alloc_pd > allocate buffer > ibv_reg_mr > ibv_create_cq > ibv_create_qp > ibv_modify_qp (set to INIT) > ibv_query_port (exchange lid/qp_num with peer) > ibv_modify_qp (set to RTR, set remote qpnum, lid) > ibv_modify_qp (set to RTS) > > SERVER: > ibv_post_recv > ibv_poll_cq (loop) > > CLIENT: > ibv_post_send > ibv_poll_cq (loop) > > My assumption is that when I call ibv_post_send or ibv_post_recv, that I > should be able to receive notification of the completion of that call by > spinning on ibv_poll_cq until the work completion is available. > However, my program spins on ibv_poll_cq indefinitely. Is there an > error in my understanding? Or an error in my program (attached). > > To run: > > test_ibv_simple (run as server) > test_ibv_simple asdf (any arg will cause client) > I think that i found a bug in you code: while( (ne = ibv_poll_cq( cq_, 1, &wc )) < 1 ); if (wc.status != IBV_WC_SUCCESS) throw std::runtime_error( "Channel::read -- ibv_poll_cq revealed failed recv" ); If there is a problem in the CQ and the return value is negative, you won't notice it ... I'm quite sure that this is not the only issue in your code. Dotan From cameron at harr.org Fri Oct 24 09:01:16 2008 From: cameron at harr.org (Cameron Harr) Date: Fri, 24 Oct 2008 10:01:16 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48FF60D3.9020809@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> Message-ID: <4901F14C.6000006@harr.org> Cameron Harr wrote: > > > Vladislav Bolkhovitin wrote: >> Cameron Harr wrote: >>> Vladislav Bolkhovitin wrote: >>>> Cameron Harr wrote: >>>>> Vladislav Bolkhovitin wrote: >>>>>> I guess, you use a regular caching IO? The lowest packet size it >>>>>> can produce is a PAGE_SIZE (4K). Target can't change it. You can >>>>>> have lower packets only with O_DIRECT or sg interface. But I'm >>>>>> not sure it will be performance effective. >>>>> I do everything with Direct IO, which is automatic when using the >>>>> BLOCKIO method in SCST. >>>> I meant on initiator(s), not on the target. >>>> >>> Sorry - but yes, I always run the benchmark apps with direct IO >> >> Then, there's one more reason why we should find out the cause of >> such a big variation between runs. Can you repeat all the tests with >> the latest SCST SVN trunk/ including SRPT driver with each run for at >> least few minutes? > > From a little testing, the updated SCST tree doesn't work with the > OFED-1.3.1 SRP stack, though I have gotten it working with the > infiniband drivers in the normal distribution kernel. Shall I use > those modules? Ok, I've done some testing with elevator=noop, with scst_threads=[123] and srpt thread=[01]. I ran with both 4k blocks and 512B blocks, random writes with 60s per test. Unfortunately, it looks like I can't seem to reproduce the numbers I had before - I believe the reporting mechanism I used earlier (script that uses /proc/diskstats) gave me invalid results. This time I have calculated iops straight from the FIO results. One interesting note is that in almost every case srpt thread=1 gives better performance. type=randwrite bs=4k drives=1 scst_threads=1 srptthread=0 iops=51134.20 type=randwrite bs=4k drives=1 scst_threads=1 srptthread=1 iops=63461.86 type=randwrite bs=4k drives=1 scst_threads=2 srptthread=0 iops=52383.10 type=randwrite bs=4k drives=1 scst_threads=2 srptthread=1 iops=54065.52 type=randwrite bs=4k drives=1 scst_threads=3 srptthread=0 iops=48827.27 type=randwrite bs=4k drives=1 scst_threads=3 srptthread=1 iops=52703.82 type=randwrite bs=4k drives=2 scst_threads=1 srptthread=0 iops=64619.11 type=randwrite bs=4k drives=2 scst_threads=1 srptthread=1 iops=62605.09 type=randwrite bs=4k drives=2 scst_threads=2 srptthread=0 iops=67961.56 type=randwrite bs=4k drives=2 scst_threads=2 srptthread=1 iops=78884.72 type=randwrite bs=4k drives=2 scst_threads=3 srptthread=0 iops=70340.04 type=randwrite bs=4k drives=2 scst_threads=3 srptthread=1 iops=76253.60 type=randwrite bs=4k drives=3 scst_threads=1 srptthread=0 iops=53777.02 type=randwrite bs=4k drives=3 scst_threads=1 srptthread=1 iops=64661.21 type=randwrite bs=4k drives=3 scst_threads=2 srptthread=0 iops=91073.05 type=randwrite bs=4k drives=3 scst_threads=2 srptthread=1 iops=90127.98 type=randwrite bs=4k drives=3 scst_threads=3 srptthread=0 iops=92012.13 type=randwrite bs=4k drives=3 scst_threads=3 srptthread=1 iops=96848.61 type=randwrite bs=512 drives=1 scst_threads=1 srptthread=0 iops=55040.20 type=randwrite bs=512 drives=1 scst_threads=1 srptthread=1 iops=62057.33 type=randwrite bs=512 drives=1 scst_threads=2 srptthread=0 iops=60237.05 type=randwrite bs=512 drives=1 scst_threads=2 srptthread=1 iops=63465.54 type=randwrite bs=512 drives=1 scst_threads=3 srptthread=0 iops=58716.01 type=randwrite bs=512 drives=1 scst_threads=3 srptthread=1 iops=60089.11 type=randwrite bs=512 drives=2 scst_threads=1 srptthread=0 iops=64978.41 type=randwrite bs=512 drives=2 scst_threads=1 srptthread=1 iops=64018.47 type=randwrite bs=512 drives=2 scst_threads=2 srptthread=0 iops=78128.56 type=randwrite bs=512 drives=2 scst_threads=2 srptthread=1 iops=94561.47 type=randwrite bs=512 drives=2 scst_threads=3 srptthread=0 iops=82526.52 type=randwrite bs=512 drives=2 scst_threads=3 srptthread=1 iops=105874.51 type=randwrite bs=512 drives=3 scst_threads=1 srptthread=0 iops=56730.70 type=randwrite bs=512 drives=3 scst_threads=1 srptthread=1 iops=62147.04 type=randwrite bs=512 drives=3 scst_threads=2 srptthread=0 iops=87507.15 type=randwrite bs=512 drives=3 scst_threads=2 srptthread=1 iops=95781.40 type=randwrite bs=512 drives=3 scst_threads=3 srptthread=0 iops=91645.99 type=randwrite bs=512 drives=3 scst_threads=3 srptthread=1 iops=114164.39 From Jeffrey.C.Becker at nasa.gov Fri Oct 24 09:05:29 2008 From: Jeffrey.C.Becker at nasa.gov (Jeff Becker) Date: Fri, 24 Oct 2008 09:05:29 -0700 Subject: [ofa-general] Re: [PATCH] maintainers: moderated mailing list In-Reply-To: References: <20081023183805.92bc39ab.randy.dunlap@oracle.com> Message-ID: <4901F249.1030206@nasa.gov> Hi Roland Roland Dreier wrote: > > I got the "list is moderated message," so add it here. > > ugh, our list isn't supposed to be moderated. Anyone know who's in > charge of the mail server now? Can we get this fixed? > Currently, I run the mail server. We switched to moderated lists in mid-August due to a huge increase in spam. -jeff > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From cameron at harr.org Fri Oct 24 09:04:04 2008 From: cameron at harr.org (Cameron Harr) Date: Fri, 24 Oct 2008 10:04:04 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <4900BAA3.4070601@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <48FF64EE.5050102@vlnb.net> <48FF8FF0.2050308@mellanox.com> <48FF91CA.1060603@harr.org> <490069AA.7010000@vlnb.net> <49007D6D.1020100@harr.org> <4900BAA3.4070601@vlnb.net> Message-ID: <4901F1F4.4000609@harr.org> Vladislav Bolkhovitin wrote: >>> Were the IOPS, CS and IRQ rates results stable and consistent >>> between runs? >> >> I don't remember about the IOPs, but CS and IRQ rates were stable. >> However, they've also become stable in my later runs with >> scst_threads=[123]. > > But with srpt_thread=0 - not, right? I don't remember unfortunately. And during the running of my last tests, I had a lot of badness on the initiator that is making reporting and running things difficult right now (main drive has symptoms of being full and it doesn't show up in df and it shouldn't be close to full). From kelly at tradebotsystems.com Fri Oct 24 09:17:04 2008 From: kelly at tradebotsystems.com (Kelly Burkhart) Date: Fri, 24 Oct 2008 11:17:04 -0500 Subject: [ofa-general] ibverbs help Message-ID: <98B0CDCB28A5EE4CB3678CD99406644E34344F@tbmail2.tradebot.com> > -----Original Message----- > From: Dotan Barak [mailto:dotanba at gmail.com] > I think that i found a bug in you code: > > while( (ne = ibv_poll_cq( cq_, 1, &wc )) < 1 ); > if (wc.status != IBV_WC_SUCCESS) > throw std::runtime_error( "Channel::read -- > ibv_poll_cq revealed > failed recv" ); > > > If there is a problem in the CQ and the return value is negative, you > won't notice it ... > Thanks for the reply Dotan, I've changed it to: for(;;) { ne = ibv_poll_cq( cq_, 5, wc ); if (ne < 0) throw std::runtime_error(global.errmsg("ibv_poll_cq failed: ")); if (ne == 0) continue; for( int idx = 0; idx < ne; ++idx ) { if (wc[idx].status != IBV_WC_SUCCESS) throw std::runtime_error( "Channel::write -- ibv_poll_cq revealed failed send" ); } break; } Same result; now we know that ibv_poll_cq is not failing. > I'm quite sure that this is not the only issue in your code. Since I've only been at this a few days, I'm quite certain you're right. Unfortunately I don't know what to look for to diagnose the problem. I'll continue to bang my head against this, in the mean time, if any of you could suggest possible places for me to look it would be much appreciated. Thanks, -Kelly From dotanba at gmail.com Fri Oct 24 10:06:23 2008 From: dotanba at gmail.com (Dotan Barak) Date: Fri, 24 Oct 2008 19:06:23 +0200 Subject: [ofa-general] ibverbs help In-Reply-To: <98B0CDCB28A5EE4CB3678CD99406644E34344F@tbmail2.tradebot.com> References: <98B0CDCB28A5EE4CB3678CD99406644E34344F@tbmail2.tradebot.com> Message-ID: <4902008F.3070707@gmail.com> > Since I've only been at this a few days, I'm quite certain you're right. > Unfortunately I don't know what to look for to diagnose the problem. > I'll continue to bang my head against this, in the mean time, if any of > you could suggest possible places for me to look it would be much > appreciated. > What exactly is the symptom of your problem? Dotan From dledford at redhat.com Fri Oct 24 10:09:53 2008 From: dledford at redhat.com (Doug Ledford) Date: Fri, 24 Oct 2008 13:09:53 -0400 Subject: [ofa-general] Re: [PATCH] opensm/scripts: handling opensm config file In-Reply-To: <20081023193450.GF25831@sashak.voltaire.com> References: <48FF3F70.8000905@dev.mellanox.co.il> <20081022152246.GU20450@sashak.voltaire.com> <48FF51EE.4020804@dev.mellanox.co.il> <20081022162416.GY20450@sashak.voltaire.com> <1224694469.1197.367.camel@cardanus.llnl.gov> <1224788281.8879.100.camel@firewall.xsintricity.com> <20081023193450.GF25831@sashak.voltaire.com> Message-ID: <1224868193.4845.16.camel@firewall.xsintricity.com> On Thu, 2008-10-23 at 21:34 +0200, Sasha Khapyorsky wrote: > Hi Doug, > > On 14:58 Thu 23 Oct , Doug Ledford wrote: > > > > Well, I don't use the OFED scripts anyway. They aren't LSB compliant in > > so many ways it's not worth discussing. Plus they do things that that > > should not be done in a production environment, or things that should be > > handled via other scripts. So, it's of little importance to me. > > Why to not help us to make it in a proper way? We are discussing this > right now. Well, for the most part, because your idea of the "proper way" and my idea of that are usually two different things. All the ofed scripts throw everything, including the kitchen sink, into one big script that does way more than it should (in fairness, opensm is much better than the openibd script). However, as far as requirements I have for the opensmd init script, they are basically to be LSB compliant (function names and return codes, as well as operation of the status function and return codes from the status function), to be clean and easy to read and understand (as much as is possible anyway), and to not assume things that aren't safe to assume (the whole copy around the guid2lid file using ssh is just a broken assumption and gets stripped from my init scripts). > > However, if you guys have moved the opensm stuff to /etc/opensm just to > > have a single opensm.conf file in there, then I have to wonder why > > bother with an /etc/opensm directory? > > It is not just a single file. In addition to opensm.conf OpenSM by > default will look in this directory (directory name is configurable btw) > for partition, route prefixes configuration and QoS policy files. > > > Are there other files in there by > > default now? > > There are nothing by default, but user may put files there. > > > You don't save any /etc/ directory namespace pollution if > > you create a subdirectory for a single file. Oh well, that doesn't > > matter too much to me either. All the packages I maintain use /etc/ofed > > for rhel4 and rhel5, and will use /etc/rdma for fedora and rhel6 and > > later. > > Directory name can be configured if somebody cares. However OpenSM itself > does not required OFED or RDMA to be installed, we are fine to run in > stand-alone mode. Then '/etc/opensm' (or actually ${sysconfdir}/opensm) > looks fine for me. Of course you require RDMA. You don't require libibverbs, but libibverbs != rdma. InfiniBand/iWARP = rdma hardware. The point of the /etc/rdma directory is to be a focal area for rdma hardware related files, and all of opensm's files certainly qualify as that. After all, it's not like opensm manages Cisco VLAN subnets on gigabit ethernet...it manages IB fabrics and that's it, and they are always RDMA fabrics. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From kelly at tradebotsystems.com Fri Oct 24 10:31:38 2008 From: kelly at tradebotsystems.com (Kelly Burkhart) Date: Fri, 24 Oct 2008 12:31:38 -0500 Subject: [ofa-general] ibverbs help Message-ID: <98B0CDCB28A5EE4CB3678CD99406644E343452@tbmail2.tradebot.com> > -----Original Message----- > From: Dotan Barak [mailto:dotanba at gmail.com] > > > Since I've only been at this a few days, I'm quite certain > you're right. > > Unfortunately I don't know what to look for to diagnose the problem. > > I'll continue to bang my head against this, in the mean > time, if any of > > you could suggest possible places for me to look it would be much > > appreciated. > > > What exactly is the symptom of your problem? I call ibv_post_send or ibv_post_recv, then loop calling ibv_poll_cq waiting for the completion event from the post. It never comes and I loop forever. None of the calls return an error status. -K From vst at vlnb.net Fri Oct 24 11:16:14 2008 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Fri, 24 Oct 2008 22:16:14 +0400 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <4901F14C.6000006@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> Message-ID: <490210EE.2070000@vlnb.net> Cameron Harr wrote: > Cameron Harr wrote: >> >> Vladislav Bolkhovitin wrote: >>> Cameron Harr wrote: >>>> Vladislav Bolkhovitin wrote: >>>>> Cameron Harr wrote: >>>>>> Vladislav Bolkhovitin wrote: >>>>>>> I guess, you use a regular caching IO? The lowest packet size it >>>>>>> can produce is a PAGE_SIZE (4K). Target can't change it. You can >>>>>>> have lower packets only with O_DIRECT or sg interface. But I'm >>>>>>> not sure it will be performance effective. >>>>>> I do everything with Direct IO, which is automatic when using the >>>>>> BLOCKIO method in SCST. >>>>> I meant on initiator(s), not on the target. >>>>> >>>> Sorry - but yes, I always run the benchmark apps with direct IO >>> Then, there's one more reason why we should find out the cause of >>> such a big variation between runs. Can you repeat all the tests with >>> the latest SCST SVN trunk/ including SRPT driver with each run for at >>> least few minutes? >> From a little testing, the updated SCST tree doesn't work with the >> OFED-1.3.1 SRP stack, though I have gotten it working with the >> infiniband drivers in the normal distribution kernel. Shall I use >> those modules? > > Ok, I've done some testing with elevator=noop, with scst_threads=[123] > and srpt thread=[01]. I ran with both 4k blocks and 512B blocks, random > writes with 60s per test. Unfortunately, it looks like I can't seem to > reproduce the numbers I had before - I believe the reporting mechanism I > used earlier (script that uses /proc/diskstats) gave me invalid results. > This time I have calculated iops straight from the FIO results. One > interesting note is that in almost every case srpt thread=1 gives better > performance. Strange, indeed. Do you use the latest SVN trunk? Did you use the real drives or NULLIO? What is your FIO script? How do you calculate IOPS rate? It would be interesting to know "vmstat 1" and "top d1" output during runs. Top should show stats for all CPUs, not only aggregate value. > type=randwrite bs=4k drives=1 scst_threads=1 srptthread=0 iops=51134.20 > type=randwrite bs=4k drives=1 scst_threads=1 srptthread=1 iops=63461.86 > type=randwrite bs=4k drives=1 scst_threads=2 srptthread=0 iops=52383.10 > type=randwrite bs=4k drives=1 scst_threads=2 srptthread=1 iops=54065.52 > type=randwrite bs=4k drives=1 scst_threads=3 srptthread=0 iops=48827.27 > type=randwrite bs=4k drives=1 scst_threads=3 srptthread=1 iops=52703.82 > type=randwrite bs=4k drives=2 scst_threads=1 srptthread=0 iops=64619.11 > type=randwrite bs=4k drives=2 scst_threads=1 srptthread=1 iops=62605.09 > type=randwrite bs=4k drives=2 scst_threads=2 srptthread=0 iops=67961.56 > type=randwrite bs=4k drives=2 scst_threads=2 srptthread=1 iops=78884.72 > type=randwrite bs=4k drives=2 scst_threads=3 srptthread=0 iops=70340.04 > type=randwrite bs=4k drives=2 scst_threads=3 srptthread=1 iops=76253.60 > type=randwrite bs=4k drives=3 scst_threads=1 srptthread=0 iops=53777.02 > type=randwrite bs=4k drives=3 scst_threads=1 srptthread=1 iops=64661.21 > type=randwrite bs=4k drives=3 scst_threads=2 srptthread=0 iops=91073.05 > type=randwrite bs=4k drives=3 scst_threads=2 srptthread=1 iops=90127.98 > type=randwrite bs=4k drives=3 scst_threads=3 srptthread=0 iops=92012.13 > type=randwrite bs=4k drives=3 scst_threads=3 srptthread=1 iops=96848.61 > type=randwrite bs=512 drives=1 scst_threads=1 srptthread=0 iops=55040.20 > type=randwrite bs=512 drives=1 scst_threads=1 srptthread=1 iops=62057.33 > type=randwrite bs=512 drives=1 scst_threads=2 srptthread=0 iops=60237.05 > type=randwrite bs=512 drives=1 scst_threads=2 srptthread=1 iops=63465.54 > type=randwrite bs=512 drives=1 scst_threads=3 srptthread=0 iops=58716.01 > type=randwrite bs=512 drives=1 scst_threads=3 srptthread=1 iops=60089.11 > type=randwrite bs=512 drives=2 scst_threads=1 srptthread=0 iops=64978.41 > type=randwrite bs=512 drives=2 scst_threads=1 srptthread=1 iops=64018.47 > type=randwrite bs=512 drives=2 scst_threads=2 srptthread=0 iops=78128.56 > type=randwrite bs=512 drives=2 scst_threads=2 srptthread=1 iops=94561.47 > type=randwrite bs=512 drives=2 scst_threads=3 srptthread=0 iops=82526.52 > type=randwrite bs=512 drives=2 scst_threads=3 srptthread=1 iops=105874.51 > type=randwrite bs=512 drives=3 scst_threads=1 srptthread=0 iops=56730.70 > type=randwrite bs=512 drives=3 scst_threads=1 srptthread=1 iops=62147.04 > type=randwrite bs=512 drives=3 scst_threads=2 srptthread=0 iops=87507.15 > type=randwrite bs=512 drives=3 scst_threads=2 srptthread=1 iops=95781.40 > type=randwrite bs=512 drives=3 scst_threads=3 srptthread=0 iops=91645.99 > type=randwrite bs=512 drives=3 scst_threads=3 srptthread=1 iops=114164.39 > > > From a.beregalov at gmail.com Thu Oct 23 16:32:55 2008 From: a.beregalov at gmail.com (Alexander Beregalov) Date: Fri, 24 Oct 2008 03:32:55 +0400 Subject: [ofa-general] ***SPAM*** [PATCH] mlx4/profile.c: fix warning res_name defined but not used Message-ID: <20081023233255.GB14519@orion> Signed-off-by: Alexander Beregalov --- drivers/net/mlx4/profile.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/drivers/net/mlx4/profile.c b/drivers/net/mlx4/profile.c index 9ca42b2..ec9383d 100644 --- a/drivers/net/mlx4/profile.c +++ b/drivers/net/mlx4/profile.c @@ -52,6 +52,7 @@ enum { MLX4_RES_NUM }; +#ifdef CONFIG_MLX4_DEBUG static const char *res_name[] = { [MLX4_RES_QP] = "QP", [MLX4_RES_RDMARC] = "RDMARC", @@ -65,6 +66,7 @@ static const char *res_name[] = { [MLX4_RES_MTT] = "MTT", [MLX4_RES_MCG] = "MCG", }; +#endif u64 mlx4_make_profile(struct mlx4_dev *dev, struct mlx4_profile *request, From randy.dunlap at oracle.com Thu Oct 23 18:04:12 2008 From: randy.dunlap at oracle.com (Randy Dunlap) Date: Thu, 23 Oct 2008 18:04:12 -0700 Subject: [ofa-general] Re: linux-next: Tree for October 23 In-Reply-To: <20081023213637.eff9b414.sfr@canb.auug.org.au> References: <20081023213637.eff9b414.sfr@canb.auug.org.au> Message-ID: <20081023180412.394d40c2.randy.dunlap@oracle.com> On Thu, 23 Oct 2008 21:36:37 +1100 Stephen Rothwell wrote: > Hi all, Building with CONFIG_INFINIBAND=m, kconfig allows CONFIG_NET_9P_RDMA=m, so one module wants symbols from the other (net/9p wants symbols from rmda_*). ERROR: "rdma_destroy_id" [net/9p/9pnet_rdma.ko] undefined! ERROR: "rdma_connect" [net/9p/9pnet_rdma.ko] undefined! ERROR: "rdma_create_id" [net/9p/9pnet_rdma.ko] undefined! ERROR: "rdma_create_qp" [net/9p/9pnet_rdma.ko] undefined! ERROR: "rdma_resolve_route" [net/9p/9pnet_rdma.ko] undefined! ERROR: "rdma_disconnect" [net/9p/9pnet_rdma.ko] undefined! ERROR: "rdma_resolve_addr" [net/9p/9pnet_rdma.ko] undefined! Is this supposed to be allowed/possible? Otherwise NET_9P_RDMA might have to depend on INFINBAND=y... --- ~Randy From randy.dunlap at oracle.com Thu Oct 23 18:38:05 2008 From: randy.dunlap at oracle.com (Randy Dunlap) Date: Thu, 23 Oct 2008 18:38:05 -0700 Subject: [ofa-general] [PATCH] maintainers: moderated mailing list Message-ID: <20081023183805.92bc39ab.randy.dunlap@oracle.com> From: Randy Dunlap I got the "list is moderated message," so add it here. Signed-off-by: Randy Dunlap --- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- linux-next-20081023.orig/MAINTAINERS +++ linux-next-20081023/MAINTAINERS @@ -2144,7 +2144,7 @@ P: Sean Hefty M: sean.hefty at intel.com P: Hal Rosenstock M: hal.rosenstock at gmail.com -L: general at lists.openfabrics.org +L: general at lists.openfabrics.org (moderated for non-subscribers) W: http://www.openib.org/ T: git kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git S: Supported From cameron at harr.org Fri Oct 24 12:38:32 2008 From: cameron at harr.org (Cameron Harr) Date: Fri, 24 Oct 2008 13:38:32 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <48F79CA9.8090806@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48F79CA9.8090806@vlnb.net> Message-ID: <49022438.9030903@harr.org> Vladislav Bolkhovitin wrote: >> ** Sometimes the benchmark "zombied" (process doing no work, but >> process can't be killed) after running a certain amount of time. >> However, it wasn't repeatable in a reliable way, so I mark that this >> particular run has zombied before. > > That means that there is a bug somewhere. Usually such bugs are found > in few hours of code auditing (srpt driver is pretty simple) or by > using kernel debug facilities (example diff to .config attached). I > personally always prefer put my effort on fixing real things, not > inventing various workarounds, like srpt_thread in this case. > > So I would: > > 1. Completely remove srpt thread and all related code. It doesn't do > anything, which can't be done in SIRQ context (tasklet) > > 2. Audit the code to check if it does any action, which it shouldn't > do on SIRQ and fix it. This step isn't required, but usually it saves > a lot of time of puzzled debugging in the future. > > 3. Change in srpt_handle_rdma_comp() and srpt_handle_new_iu() > SCST_CONTEXT_THREAD to SCST_CONTEXT_DIRECT_ATOMIC. I also changed it in srpt_handle_err_comp() > > Then I would run the problematic tests (heavy tpc-h workload, e.g.) on > debug kernel and fix found problems. > > Anyway, Cameron, can you get the latest code from SCST trunk and try > with it? It was recently updated. Also please add the case with > changes from (3) above. This is all with version 1.0.1 of SCST (v532). In my fio test, I do runs with srpt thread=1 and then =0. When it was set to zero during the test, I got many errors printed out by FIO, and the target eventually crashed. This is the first part of a long call trace. NMI Watchdog detected LOCKUP on CPU 0 CPU 0 Modules linked in: ib_srpt(U) scst_vdisk(U) scst(U) fio_driver(PU) fio_port(PU) autofs4 hidp rfcomm l2cap bluetooth sunrpc ib_ipoib mlx4_ib ib_cm ib_sa ib_mad ib_core ipv6 xfrm_nalgo crypto_api nls_utf8 hfsplus dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport i2c_i801 shpchp i2c_core e1000e mlx4_core i5000_edac edac_mc pcspkr ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 25732, comm: scsi_tgt0 Tainted: P 2.6.18-92.1.13.el5 #1 RIP: 0010:[] [] .text.lock.spinlock+0x29/0x30 RSP: 0018:ffffffff80418a88 EFLAGS: 00000086 RAX: ffff810785307fd8 RBX: ffffffff884e68a0 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffff884e68a0 RBP: ffffffff884e62a0 R08: ffff810790926900 R09: ffff8107909268e8 R10: 0000000000000018 R11: ffffffff884fcab3 R12: 0000000000000001 R13: 0000000000000001 R14: 0000000000000000 R15: ffff8107f0f374c0 FS: 0000000000000000(0000) GS:ffffffff803a0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00000037bc0986d0 CR3: 0000000000201000 CR4: 00000000000006e0 Process scsi_tgt0 (pid: 25732, threadinfo ffff810785306000, task ffff810810852100) Stack: 0000000000000000 ffffffff884c509d ffff8107909268e8 ffff810790926900 00000002071dd688 0000020000000220 0000000000000200 00000000da984c08 0000000000000000 ffff8107909267f0 ffff810806ceee20 0000000000000001 Call Trace: [] :scst:sgv_pool_alloc+0x10c/0x5d3 [] :scst:scst_alloc_space+0x5b/0x106 [] :scst:scst_process_active_cmd+0x4fc/0x131c [] :scst:scst_cmd_init_done+0x17f/0x3ef [] :ib_srpt:srpt_handle_new_iu+0x281/0x4e7 [] :mlx4_ib:mlx4_ib_free_srq_wqe+0x27/0x4f [] :mlx4_ib:get_sw_cqe+0x12/0x30 [] :mlx4_ib:mlx4_ib_poll_cq+0x432/0x48f [] :ib_srpt:srpt_completion+0x190/0x250 [] :mlx4_core:mlx4_eq_int+0x3b/0x26f [] :mlx4_core:mlx4_msi_x_interrupt+0xf/0x17 From cameron at harr.org Fri Oct 24 12:43:15 2008 From: cameron at harr.org (Cameron Harr) Date: Fri, 24 Oct 2008 13:43:15 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <490210EE.2070000@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> Message-ID: <49022553.1020804@harr.org> Vladislav Bolkhovitin wrote: > Cameron Harr wrote: >> >> Ok, I've done some testing with elevator=noop, with >> scst_threads=[123] and srpt thread=[01]. I ran with both 4k blocks >> and 512B blocks, random writes with 60s per test. Unfortunately, it >> looks like I can't seem to reproduce the numbers I had before - I >> believe the reporting mechanism I used earlier (script that uses >> /proc/diskstats) gave me invalid results. This time I have calculated >> iops straight from the FIO results. One interesting note is that in >> almost every case srpt thread=1 gives better performance. > > Strange, indeed. > > Do you use the latest SVN trunk? Almost - it was svn rev 532. > > Did you use the real drives or NULLIO? Real drives > > What is your FIO script? A variation on this: fio/fio --rw=randwrite --bs=512 --size=20G --loops=10 --name=randwrite_512_sdc --numjobs=64 --runtime=60 --direct=1 --group_reporting --randrepeat=0 --softrandommap=1 --ioengine=libaio --iodepth=16 --filename=/dev/sdb --filename=/dev/sdc > > > How do you calculate IOPS rate? I divide the sum (if more than 1) of the "ios=" from a particular test by the runtime. > > It would be interesting to know "vmstat 1" and "top d1" output during > runs. Top should show stats for all CPUs, not only aggregate value. > >> type=randwrite bs=4k drives=1 scst_threads=1 srptthread=0 >> iops=51134.20 >> type=randwrite bs=4k drives=1 scst_threads=1 srptthread=1 >> iops=63461.86 >> type=randwrite bs=4k drives=1 scst_threads=2 srptthread=0 >> iops=52383.10 >> type=randwrite bs=4k drives=1 scst_threads=2 srptthread=1 >> iops=54065.52 >> type=randwrite bs=4k drives=1 scst_threads=3 srptthread=0 >> iops=48827.27 >> type=randwrite bs=4k drives=1 scst_threads=3 srptthread=1 >> iops=52703.82 >> type=randwrite bs=4k drives=2 scst_threads=1 srptthread=0 >> iops=64619.11 >> type=randwrite bs=4k drives=2 scst_threads=1 srptthread=1 >> iops=62605.09 >> type=randwrite bs=4k drives=2 scst_threads=2 srptthread=0 >> iops=67961.56 >> type=randwrite bs=4k drives=2 scst_threads=2 srptthread=1 >> iops=78884.72 >> type=randwrite bs=4k drives=2 scst_threads=3 srptthread=0 >> iops=70340.04 >> type=randwrite bs=4k drives=2 scst_threads=3 srptthread=1 >> iops=76253.60 >> type=randwrite bs=4k drives=3 scst_threads=1 srptthread=0 >> iops=53777.02 >> type=randwrite bs=4k drives=3 scst_threads=1 srptthread=1 >> iops=64661.21 >> type=randwrite bs=4k drives=3 scst_threads=2 srptthread=0 >> iops=91073.05 >> type=randwrite bs=4k drives=3 scst_threads=2 srptthread=1 >> iops=90127.98 >> type=randwrite bs=4k drives=3 scst_threads=3 srptthread=0 >> iops=92012.13 >> type=randwrite bs=4k drives=3 scst_threads=3 srptthread=1 >> iops=96848.61 >> type=randwrite bs=512 drives=1 scst_threads=1 srptthread=0 >> iops=55040.20 >> type=randwrite bs=512 drives=1 scst_threads=1 srptthread=1 >> iops=62057.33 >> type=randwrite bs=512 drives=1 scst_threads=2 srptthread=0 >> iops=60237.05 >> type=randwrite bs=512 drives=1 scst_threads=2 srptthread=1 >> iops=63465.54 >> type=randwrite bs=512 drives=1 scst_threads=3 srptthread=0 >> iops=58716.01 >> type=randwrite bs=512 drives=1 scst_threads=3 srptthread=1 >> iops=60089.11 >> type=randwrite bs=512 drives=2 scst_threads=1 srptthread=0 >> iops=64978.41 >> type=randwrite bs=512 drives=2 scst_threads=1 srptthread=1 >> iops=64018.47 >> type=randwrite bs=512 drives=2 scst_threads=2 srptthread=0 >> iops=78128.56 >> type=randwrite bs=512 drives=2 scst_threads=2 srptthread=1 >> iops=94561.47 >> type=randwrite bs=512 drives=2 scst_threads=3 srptthread=0 >> iops=82526.52 >> type=randwrite bs=512 drives=2 scst_threads=3 srptthread=1 >> iops=105874.51 >> type=randwrite bs=512 drives=3 scst_threads=1 srptthread=0 >> iops=56730.70 >> type=randwrite bs=512 drives=3 scst_threads=1 srptthread=1 >> iops=62147.04 >> type=randwrite bs=512 drives=3 scst_threads=2 srptthread=0 >> iops=87507.15 >> type=randwrite bs=512 drives=3 scst_threads=2 srptthread=1 >> iops=95781.40 >> type=randwrite bs=512 drives=3 scst_threads=3 srptthread=0 >> iops=91645.99 >> type=randwrite bs=512 drives=3 scst_threads=3 srptthread=1 >> iops=114164.39 >> >> >> > From ofed at kononov.ftml.net Fri Oct 24 12:43:25 2008 From: ofed at kononov.ftml.net (Roman Kononov) Date: Fri, 24 Oct 2008 14:43:25 -0500 Subject: [ofa-general] ibverbs help In-Reply-To: <98B0CDCB28A5EE4CB3678CD99406644E343445@tbmail2.tradebot.com> References: <98B0CDCB28A5EE4CB3678CD99406644E343445@tbmail2.tradebot.com> Message-ID: <4902255D.3090105@kononov.ftml.net> On 2008-10-24 09:00, Kelly Burkhart wrote: > However, my program spins on ibv_poll_cq indefinitely. Is there an Your difficulty is because doServer() makes two QP: in "Channel *ch( new Channel );" in "Channel *cch = ch->accept();". Then the server passes the first QP's QPN to the client. Then the server "connects" the second QP with the client's QP. Then the server tries to receive from the second QP. I would recommend you to avoid creating another Channel in Channel::accept() (and close peerSock and sock_). Regards, Roman From kelly at tradebotsystems.com Fri Oct 24 13:44:38 2008 From: kelly at tradebotsystems.com (Kelly Burkhart) Date: Fri, 24 Oct 2008 15:44:38 -0500 Subject: [ofa-general] ibverbs help Message-ID: <98B0CDCB28A5EE4CB3678CD99406644E343456@tbmail2.tradebot.com> > -----Original Message----- > From: Roman Kononov [mailto:ofed at kononov.ftml.net] > > Your difficulty is because doServer() makes two QP: > in "Channel *ch( new Channel );" > in "Channel *cch = ch->accept();". Ah, I read from my "listening" channel instead of my connected channel. Thank you for finding my silly mistake. -K From rdreier at cisco.com Fri Oct 24 15:59:41 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 24 Oct 2008 15:59:41 -0700 Subject: [ofa-general] Re: [PATCH] maintainers: moderated mailing list In-Reply-To: <4901F249.1030206@nasa.gov> (Jeff Becker's message of "Fri, 24 Oct 2008 09:05:29 -0700") References: <20081023183805.92bc39ab.randy.dunlap@oracle.com> <4901F249.1030206@nasa.gov> Message-ID: > Currently, I run the mail server. We switched to moderated lists in > mid-August due to a huge increase in spam. It was just that one backscatter flood targeting the ewg list, right? It's really annoying to people trying to report bugs etc to get the bounce about subscribers-only lists. Would it make sense to move the hosting of the general list to vger.kernel.org, since they have a lot more resources/experience as mail admins and seem to be able to run open but relatively spam-free lists? - R. From Jeffrey.C.Becker at nasa.gov Fri Oct 24 16:13:17 2008 From: Jeffrey.C.Becker at nasa.gov (Jeff Becker) Date: Fri, 24 Oct 2008 16:13:17 -0700 Subject: [ofa-general] Re: [PATCH] maintainers: moderated mailing list In-Reply-To: References: <20081023183805.92bc39ab.randy.dunlap@oracle.com> <4901F249.1030206@nasa.gov> Message-ID: <4902568D.1090107@nasa.gov> Roland Dreier wrote: > > Currently, I run the mail server. We switched to moderated lists in > > mid-August due to a huge increase in spam. > > It was just that one backscatter flood targeting the ewg list, right? > > It's really annoying to people trying to report bugs etc to get the > bounce about subscribers-only lists. Would it make sense to move the > hosting of the general list to vger.kernel.org, since they have a lot > more resources/experience as mail admins and seem to be able to run open > but relatively spam-free lists? > That's OK with me. -jeff > - R. > From cameron at harr.org Fri Oct 24 16:59:01 2008 From: cameron at harr.org (Cameron Harr) Date: Fri, 24 Oct 2008 17:59:01 -0600 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <490210EE.2070000@vlnb.net> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> Message-ID: <49026145.4050006@harr.org> Vladislav Bolkhovitin wrote: > > Strange, indeed. > > Did you use the real drives or NULLIO? > Here are some results with NULLIO, but they seem to hang when srpt thread is set to 0 (this got a few runs in). Note that to get things even running when srpt thread=0, I had to put the ib_srpt code back to it's original state. type=randwrite bs=512 drives=1 scst_threads=1 srptthread=1 iops=113418.36 type=randwrite bs=4k drives=1 scst_threads=1 srptthread=1 iops=107773.57 type=randwrite bs=512 drives=2 scst_threads=1 srptthread=1 iops=147188.09 type=randwrite bs=4k drives=2 scst_threads=1 srptthread=1 iops=170401.06 type=randwrite bs=512 drives=3 scst_threads=1 srptthread=1 iops=194783.09 type=randwrite bs=4k drives=3 scst_threads=1 srptthread=1 iops=112113.57 type=randwrite bs=512 drives=1 scst_threads=2 srptthread=1 iops=88371.81 type=randwrite bs=4k drives=1 scst_threads=2 srptthread=1 iops=86334.84 type=randwrite bs=512 drives=2 scst_threads=2 srptthread=1 iops=177128.90 type=randwrite bs=4k drives=2 scst_threads=2 srptthread=1 iops=105784.42 type=randwrite bs=512 drives=3 scst_threads=2 srptthread=1 iops=125456.49 type=randwrite bs=4k drives=3 scst_threads=2 srptthread=1 iops=93726.40 type=randwrite bs=512 drives=1 scst_threads=3 srptthread=1 iops=137550.91 type=randwrite bs=4k drives=1 scst_threads=3 srptthread=1 iops=90684.18 type=randwrite bs=512 drives=2 scst_threads=3 srptthread=1 iops=182657.96 type=randwrite bs=4k drives=2 scst_threads=3 srptthread=1 iops=95166.77 type=randwrite bs=512 drives=3 scst_threads=3 srptthread=1 iops=184928.53 type=randwrite bs=4k drives=3 scst_threads=3 srptthread=1 iops=84169.93 type=randwrite bs=512 drives=1 scst_threads=1 srptthread=0 iops=139561.62 type=randwrite bs=4k drives=1 scst_threads=1 srptthread=0 iops=100328.18 type=randwrite bs=512 drives=2 scst_threads=1 srptthread=0 iops=206477.91 type=randwrite bs=4k drives=2 scst_threads=1 srptthread=0 iops=99723.22 > >> type=randwrite bs=4k drives=1 scst_threads=1 srptthread=0 >> iops=51134.20 >> type=randwrite bs=4k drives=1 scst_threads=1 srptthread=1 >> iops=63461.86 >> type=randwrite bs=4k drives=1 scst_threads=2 srptthread=0 >> iops=52383.10 >> type=randwrite bs=4k drives=1 scst_threads=2 srptthread=1 >> iops=54065.52 >> type=randwrite bs=4k drives=1 scst_threads=3 srptthread=0 >> iops=48827.27 >> type=randwrite bs=4k drives=1 scst_threads=3 srptthread=1 >> iops=52703.82 >> type=randwrite bs=4k drives=2 scst_threads=1 srptthread=0 >> iops=64619.11 >> type=randwrite bs=4k drives=2 scst_threads=1 srptthread=1 >> iops=62605.09 >> type=randwrite bs=4k drives=2 scst_threads=2 srptthread=0 >> iops=67961.56 >> type=randwrite bs=4k drives=2 scst_threads=2 srptthread=1 >> iops=78884.72 >> type=randwrite bs=4k drives=2 scst_threads=3 srptthread=0 >> iops=70340.04 >> type=randwrite bs=4k drives=2 scst_threads=3 srptthread=1 >> iops=76253.60 >> type=randwrite bs=4k drives=3 scst_threads=1 srptthread=0 >> iops=53777.02 >> type=randwrite bs=4k drives=3 scst_threads=1 srptthread=1 >> iops=64661.21 >> type=randwrite bs=4k drives=3 scst_threads=2 srptthread=0 >> iops=91073.05 >> type=randwrite bs=4k drives=3 scst_threads=2 srptthread=1 >> iops=90127.98 >> type=randwrite bs=4k drives=3 scst_threads=3 srptthread=0 >> iops=92012.13 >> type=randwrite bs=4k drives=3 scst_threads=3 srptthread=1 >> iops=96848.61 >> type=randwrite bs=512 drives=1 scst_threads=1 srptthread=0 >> iops=55040.20 >> type=randwrite bs=512 drives=1 scst_threads=1 srptthread=1 >> iops=62057.33 >> type=randwrite bs=512 drives=1 scst_threads=2 srptthread=0 >> iops=60237.05 >> type=randwrite bs=512 drives=1 scst_threads=2 srptthread=1 >> iops=63465.54 >> type=randwrite bs=512 drives=1 scst_threads=3 srptthread=0 >> iops=58716.01 >> type=randwrite bs=512 drives=1 scst_threads=3 srptthread=1 >> iops=60089.11 >> type=randwrite bs=512 drives=2 scst_threads=1 srptthread=0 >> iops=64978.41 >> type=randwrite bs=512 drives=2 scst_threads=1 srptthread=1 >> iops=64018.47 >> type=randwrite bs=512 drives=2 scst_threads=2 srptthread=0 >> iops=78128.56 >> type=randwrite bs=512 drives=2 scst_threads=2 srptthread=1 >> iops=94561.47 >> type=randwrite bs=512 drives=2 scst_threads=3 srptthread=0 >> iops=82526.52 >> type=randwrite bs=512 drives=2 scst_threads=3 srptthread=1 >> iops=105874.51 >> type=randwrite bs=512 drives=3 scst_threads=1 srptthread=0 >> iops=56730.70 >> type=randwrite bs=512 drives=3 scst_threads=1 srptthread=1 >> iops=62147.04 >> type=randwrite bs=512 drives=3 scst_threads=2 srptthread=0 >> iops=87507.15 >> type=randwrite bs=512 drives=3 scst_threads=2 srptthread=1 >> iops=95781.40 >> type=randwrite bs=512 drives=3 scst_threads=3 srptthread=0 >> iops=91645.99 >> type=randwrite bs=512 drives=3 scst_threads=3 srptthread=1 >> iops=114164.39 >> From chu11 at llnl.gov Fri Oct 24 17:08:28 2008 From: chu11 at llnl.gov (Al Chu) Date: Fri, 24 Oct 2008 17:08:28 -0700 Subject: [ofa-general] [opensm][trivial] fix manpage typos Message-ID: <1224893308.1197.437.camel@cardanus.llnl.gov> Fix manpage typos. For those of us who cut & paste goofed recently :-) Al -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-fix-manpage-typos.patch Type: text/x-patch Size: 1239 bytes Desc: not available URL: From vlad at lists.openfabrics.org Sat Oct 25 03:15:36 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 25 Oct 2008 03:15:36 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081025-0200 daily build status Message-ID: <20081025101537.37356E60B1D@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From olga.shern at gmail.com Sat Oct 25 06:53:55 2008 From: olga.shern at gmail.com (Olga Shern (Voltaire)) Date: Sat, 25 Oct 2008 15:53:55 +0200 Subject: [ofa-general] ***SPAM*** Re: [ewg] ***SPAM*** NFS-RDMA compilation problem In-Reply-To: References: Message-ID: Hi Amar, I suggest you to open bug in openbabrics bugzilla: https://bugs.openfabrics.org/. Thanks Olga On Thu, Oct 23, 2008 at 4:50 PM, Amar Mudrankit wrote: > While I was trying to install OFED-1.4-rc3 over SLES 10 SP 2 with > NFS-RDMA selected for installation, I got the following error message: > > nfs-utils-1.1.1 rpm is required to build kernel-ib > > I have downloaded and installed successfully, the nfs-utils-1.1.4 > **source .tgz** from http://www.kernel.org/pub/linux/utils/nfs, > still I was hit with the same error message. > > I was not able to find out nfs-utils rpm that would install over SLES > 10 SP 2. Can anybody please point me to the location of rpm? Why is > OFED installation unable to detect the latest installation of nfs > utils compiled from source and is fully dependent upon the rpm > installation? > > Regards, > Amar > _______________________________________________ > ewg mailing list > ewg at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > From sashak at voltaire.com Sat Oct 25 08:01:47 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Oct 2008 17:01:47 +0200 Subject: [ofa-general] Re: [opensm][trivial] fix manpage typos In-Reply-To: <1224893308.1197.437.camel@cardanus.llnl.gov> References: <1224893308.1197.437.camel@cardanus.llnl.gov> Message-ID: <20081025150147.GK28713@sashak.voltaire.com> On 17:08 Fri 24 Oct , Al Chu wrote: > Fix manpage typos. For those of us who cut & paste goofed recently :-) > > Al > > -- > Albert Chu > chu11 at llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > From 86ff61b970b82ba49ef1cc0eec854896720fc19b Mon Sep 17 00:00:00 2001 > From: Albert Chu > Date: Fri, 24 Oct 2008 17:04:51 -0700 > Subject: [PATCH] fix manpage typos > > > Signed-off-by: Albert Chu Applied. Thanks. Sasha From sashak at voltaire.com Sat Oct 25 08:04:25 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Oct 2008 17:04:25 +0200 Subject: [ofa-general] [PATCH] opensm: remove update_master_sm_base_lid field in PortInfo madw context Message-ID: <20081025150425.GL28713@sashak.voltaire.com> remove unused (always FALSE) update_master_sm_base_lid field from PortInfo mad wrapper context. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_madw.h | 1 - opensm/opensm/osm_lid_mgr.c | 1 - opensm/opensm/osm_link_mgr.c | 1 - opensm/opensm/osm_node_info_rcv.c | 2 -- opensm/opensm/osm_pkey_mgr.c | 1 - opensm/opensm/osm_port_info_rcv.c | 15 --------------- opensm/opensm/osm_state_mgr.c | 1 - opensm/opensm/osm_sw_info_rcv.c | 1 - opensm/opensm/osm_trap_rcv.c | 3 --- 9 files changed, 0 insertions(+), 26 deletions(-) diff --git a/opensm/include/opensm/osm_madw.h b/opensm/include/opensm/osm_madw.h index 2843736..f47142d 100644 --- a/opensm/include/opensm/osm_madw.h +++ b/opensm/include/opensm/osm_madw.h @@ -172,7 +172,6 @@ typedef struct osm_pi_context { ib_net64_t port_guid; boolean_t set_method; boolean_t light_sweep; - boolean_t update_master_sm_base_lid; boolean_t active_transition; } osm_pi_context_t; /*********/ diff --git a/opensm/opensm/osm_lid_mgr.c b/opensm/opensm/osm_lid_mgr.c index e0a6639..0c536a8 100644 --- a/opensm/opensm/osm_lid_mgr.c +++ b/opensm/opensm/osm_lid_mgr.c @@ -1086,7 +1086,6 @@ __osm_lid_mgr_set_physp_pi(IN osm_lid_mgr_t * const p_mgr, context.pi_context.node_guid = osm_node_get_node_guid(p_node); context.pi_context.port_guid = osm_physp_get_port_guid(p_physp); context.pi_context.set_method = TRUE; - context.pi_context.update_master_sm_base_lid = FALSE; context.pi_context.light_sweep = FALSE; context.pi_context.active_transition = FALSE; diff --git a/opensm/opensm/osm_link_mgr.c b/opensm/opensm/osm_link_mgr.c index d60d60e..37e3e1b 100644 --- a/opensm/opensm/osm_link_mgr.c +++ b/opensm/opensm/osm_link_mgr.c @@ -302,7 +302,6 @@ __osm_link_mgr_set_physp_pi(osm_sm_t * sm, context.pi_context.node_guid = osm_node_get_node_guid(p_node); context.pi_context.port_guid = osm_physp_get_port_guid(p_physp); context.pi_context.set_method = TRUE; - context.pi_context.update_master_sm_base_lid = FALSE; context.pi_context.light_sweep = FALSE; /* We need to send the PortInfoSet request with the new sm_lid diff --git a/opensm/opensm/osm_node_info_rcv.c b/opensm/opensm/osm_node_info_rcv.c index 984a8dd..20b16d1 100644 --- a/opensm/opensm/osm_node_info_rcv.c +++ b/opensm/opensm/osm_node_info_rcv.c @@ -288,7 +288,6 @@ __osm_ni_rcv_process_new_node(IN osm_sm_t * sm, context.pi_context.node_guid = p_ni->node_guid; context.pi_context.port_guid = p_ni->port_guid; context.pi_context.set_method = FALSE; - context.pi_context.update_master_sm_base_lid = FALSE; context.pi_context.light_sweep = FALSE; context.pi_context.active_transition = FALSE; @@ -478,7 +477,6 @@ __osm_ni_rcv_process_existing_ca_or_router(IN osm_sm_t * sm, context.pi_context.node_guid = p_ni->node_guid; context.pi_context.port_guid = p_ni->port_guid; context.pi_context.set_method = FALSE; - context.pi_context.update_master_sm_base_lid = FALSE; context.pi_context.light_sweep = FALSE; status = osm_req_get(sm, osm_physp_get_dr_path_ptr(p_physp), diff --git a/opensm/opensm/osm_pkey_mgr.c b/opensm/opensm/osm_pkey_mgr.c index 925c1c7..9df8c85 100644 --- a/opensm/opensm/osm_pkey_mgr.c +++ b/opensm/opensm/osm_pkey_mgr.c @@ -226,7 +226,6 @@ pkey_mgr_enforce_partition(IN osm_log_t * p_log, osm_sm_t * sm, osm_node_get_node_guid(osm_physp_get_node_ptr(p_physp)); context.pi_context.port_guid = osm_physp_get_port_guid(p_physp); context.pi_context.set_method = TRUE; - context.pi_context.update_master_sm_base_lid = FALSE; context.pi_context.light_sweep = FALSE; context.pi_context.active_transition = FALSE; diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c index efb8830..47eb457 100644 --- a/opensm/opensm/osm_port_info_rcv.c +++ b/opensm/opensm/osm_port_info_rcv.c @@ -638,21 +638,6 @@ void osm_pi_rcv_process(IN void *context, IN void *data) p_smp->hop_count, p_smp->initial_path); } - /* - Check if the update_sm_base_lid in the context is TRUE. - If it is - then update the master_sm_base_lid of the variable - in the subnet. - */ - if (p_context->update_master_sm_base_lid == TRUE) { - OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, - "update_master_sm is TRUE. " - "Updating master_sm_base_lid to:%u\n", - p_pi->master_sm_base_lid); - - sm->p_subn->master_sm_base_lid = - p_pi->master_sm_base_lid; - } - /* if port just inited or reached INIT state (external reset) request update for port related tables */ p_physp->need_update = diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index ce010cb..174cee6 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -179,7 +179,6 @@ __osm_state_mgr_get_remote_port_info(IN osm_sm_t * sm, mad_context.pi_context.port_guid = p_physp->port_guid; mad_context.pi_context.set_method = FALSE; mad_context.pi_context.light_sweep = TRUE; - mad_context.pi_context.update_master_sm_base_lid = FALSE; mad_context.pi_context.active_transition = FALSE; /* note that with some negative logic - if the query failed it means that diff --git a/opensm/opensm/osm_sw_info_rcv.c b/opensm/opensm/osm_sw_info_rcv.c index 99315b2..6ee1538 100644 --- a/opensm/opensm/osm_sw_info_rcv.c +++ b/opensm/opensm/osm_sw_info_rcv.c @@ -94,7 +94,6 @@ __osm_si_rcv_get_port_info(IN osm_sm_t * sm, context.pi_context.node_guid = osm_node_get_node_guid(p_node); context.pi_context.port_guid = osm_physp_get_port_guid(p_physp); context.pi_context.set_method = FALSE; - context.pi_context.update_master_sm_base_lid = FALSE; context.pi_context.light_sweep = FALSE; context.pi_context.active_transition = FALSE; diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c index e9a9c22..cf5e8a5 100644 --- a/opensm/opensm/osm_trap_rcv.c +++ b/opensm/opensm/osm_trap_rcv.c @@ -500,9 +500,6 @@ __osm_trap_rcv_process_request(IN osm_sm_t * sm, (p_physp); context.pi_context.set_method = TRUE; - context.pi_context. - update_master_sm_base_lid = - FALSE; context.pi_context.light_sweep = FALSE; context.pi_context. -- 1.6.0.3.517.g759a From sashak at voltaire.com Sat Oct 25 08:05:09 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Oct 2008 17:05:09 +0200 Subject: [ofa-general] [PATCH] libibmad/dump: print more PortInfo:CapabilityMask bits Message-ID: <20081025150509.GM28713@sashak.voltaire.com> Support (show) new PortInfo:CapabilityMask bits - IsOtherLocalChangesNoticeSupported and IsLinkSpeedWidthPairsTabaleSupported. Signed-off-by: Sasha Khapyorsky --- libibmad/src/dump.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/libibmad/src/dump.c b/libibmad/src/dump.c index 4a780a7..de05d29 100644 --- a/libibmad/src/dump.c +++ b/libibmad/src/dump.c @@ -529,6 +529,10 @@ mad_dump_portcapmask(char *buf, int bufsz, void *val, int valsz) s += sprintf(s, "\t\t\t\tIsLinkRoundTripLatencySupported\n"); if (mask & (1 << 25)) s += sprintf(s, "\t\t\t\tIsClientRegistrationSupported\n"); + if (mask & (1 << 26)) + s += sprintf(s, "\t\t\t\tIsOtherLocalChangesNoticeSupported\n"); + if (mask & (1 << 27)) + s += sprintf(s, "\t\t\t\tIsLinkSpeedWidthPairsTabaleSupported\n"); if (s != buf) *(--s) = 0; -- 1.6.0.3.517.g759a From sashak at voltaire.com Sat Oct 25 08:06:15 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Oct 2008 17:06:15 +0200 Subject: [ofa-general] [PATCH] opensm: support more PortInfo:CapabilityMask bits Message-ID: <20081025150615.GN28713@sashak.voltaire.com> Support new PortInfo:CapabilityMask bits - IsOtherLocalChangesNoticeSupported and IsLinkSpeedWidthPairsTabaleSupported. Signed-off-by: Sasha Khapyorsky --- opensm/include/iba/ib_types.h | 4 ++-- opensm/opensm/osm_helper.c | 8 ++++---- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index 6ca3f3e..257b19c 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -4470,8 +4470,8 @@ typedef struct _ib_port_info { #define IB_PORT_CAP_HAS_BM (CL_HTON32(0x00800000)) #define IB_PORT_CAP_HAS_LINK_RT_LATENCY (CL_HTON32(0x01000000)) #define IB_PORT_CAP_HAS_CLIENT_REREG (CL_HTON32(0x02000000)) -#define IB_PORT_CAP_RESV26 (CL_HTON32(0x04000000)) -#define IB_PORT_CAP_RESV27 (CL_HTON32(0x08000000)) +#define IB_PORT_CAP_HAS_OTHER_LOCAL_CHANGES_NTC (CL_HTON32(0x04000000)) +#define IB_PORT_CAP_HAS_LINK_SPEED_WIDTH_PAIRS_TBL (CL_HTON32(0x08000000)) #define IB_PORT_CAP_RESV28 (CL_HTON32(0x10000000)) #define IB_PORT_CAP_RESV29 (CL_HTON32(0x20000000)) #define IB_PORT_CAP_RESV30 (CL_HTON32(0x40000000)) diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c index 73c0462..2ed0011 100644 --- a/opensm/opensm/osm_helper.c +++ b/opensm/opensm/osm_helper.c @@ -727,15 +727,15 @@ dbg_get_capabilities_str(IN char *p_buf, &total_len) != IB_SUCCESS) return; } - if (p_pi->capability_mask & IB_PORT_CAP_RESV26) { + if (p_pi->capability_mask & IB_PORT_CAP_HAS_OTHER_LOCAL_CHANGES_NTC) { if (dbg_do_line(&p_local, buf_size, p_prefix_str, - "IB_PORT_CAP_RESV26\n", + "IB_PORT_CAP_HAS_OTHER_LOCAL_CHANGES_NTC\n", &total_len) != IB_SUCCESS) return; } - if (p_pi->capability_mask & IB_PORT_CAP_RESV27) { + if (p_pi->capability_mask & IB_PORT_CAP_HAS_LINK_SPEED_WIDTH_PAIRS_TBL) { if (dbg_do_line(&p_local, buf_size, p_prefix_str, - "IB_PORT_CAP_RESV27\n", + "IB_PORT_CAP_HAS_LINK_SPEED_WIDTH_PAIRS_TBL\n", &total_len) != IB_SUCCESS) return; } -- 1.6.0.3.517.g759a From hal.rosenstock at gmail.com Sat Oct 25 08:27:21 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Sat, 25 Oct 2008 11:27:21 -0400 Subject: ***SPAM*** Re: [ofa-general] [PATCH] libibmad/dump: print more PortInfo:CapabilityMask bits In-Reply-To: <20081025150509.GM28713@sashak.voltaire.com> References: <20081025150509.GM28713@sashak.voltaire.com> Message-ID: On Sat, Oct 25, 2008 at 11:05 AM, Sasha Khapyorsky wrote: > > Support (show) new PortInfo:CapabilityMask bits - > IsOtherLocalChangesNoticeSupported and > IsLinkSpeedWidthPairsTabaleSupported. > > Signed-off-by: Sasha Khapyorsky > --- > libibmad/src/dump.c | 4 ++++ > 1 files changed, 4 insertions(+), 0 deletions(-) > > diff --git a/libibmad/src/dump.c b/libibmad/src/dump.c > index 4a780a7..de05d29 100644 > --- a/libibmad/src/dump.c > +++ b/libibmad/src/dump.c > @@ -529,6 +529,10 @@ mad_dump_portcapmask(char *buf, int bufsz, void *val, int valsz) > s += sprintf(s, "\t\t\t\tIsLinkRoundTripLatencySupported\n"); > if (mask & (1 << 25)) > s += sprintf(s, "\t\t\t\tIsClientRegistrationSupported\n"); > + if (mask & (1 << 26)) > + s += sprintf(s, "\t\t\t\tIsOtherLocalChangesNoticeSupported\n"); > + if (mask & (1 << 27)) > + s += sprintf(s, "\t\t\t\tIsLinkSpeedWidthPairsTabaleSupported\n"); ^^^^^^^ typo > > if (s != buf) > *(--s) = 0; > -- > 1.6.0.3.517.g759a > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Sat Oct 25 08:52:18 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Oct 2008 17:52:18 +0200 Subject: [ofa-general] [PATCH v2] libibmad/dump: print more PortInfo:CapabilityMask bits In-Reply-To: References: <20081025150509.GM28713@sashak.voltaire.com> Message-ID: <20081025155218.GO28713@sashak.voltaire.com> Support (show) new PortInfo:CapabilityMask bits - IsOtherLocalChangesNoticeSupported and IsLinkSpeedWidthPairsTabaleSupported. Signed-off-by: Sasha Khapyorsky --- Change against original version - fix typo found by Hal. libibmad/src/dump.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/libibmad/src/dump.c b/libibmad/src/dump.c index 4a780a7..052127f 100644 --- a/libibmad/src/dump.c +++ b/libibmad/src/dump.c @@ -529,6 +529,10 @@ mad_dump_portcapmask(char *buf, int bufsz, void *val, int valsz) s += sprintf(s, "\t\t\t\tIsLinkRoundTripLatencySupported\n"); if (mask & (1 << 25)) s += sprintf(s, "\t\t\t\tIsClientRegistrationSupported\n"); + if (mask & (1 << 26)) + s += sprintf(s, "\t\t\t\tIsOtherLocalChangesNoticeSupported\n"); + if (mask & (1 << 27)) + s += sprintf(s, "\t\t\t\tIsLinkSpeedWidthPairsTableSupported\n"); if (s != buf) *(--s) = 0; -- 1.6.0.3.517.g759a From sashak at voltaire.com Sat Oct 25 10:52:01 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Oct 2008 19:52:01 +0200 Subject: [ofa-general] [PATCH] opensm: osm_send_trap144() function Message-ID: <20081025175201.GP28713@sashak.voltaire.com> Add ability to send trap 144 - osm_send_trap144() function. This can be useful when SMA doesn't support trap sending on some events, such as CapabilityMask change (ConnectX), OtherLocalChanges (no one supports this AFAIK). Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_sm.h | 23 +++++++++++++ opensm/opensm/osm_req.c | 68 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 91 insertions(+), 0 deletions(-) diff --git a/opensm/include/opensm/osm_sm.h b/opensm/include/opensm/osm_sm.h index 5d46246..bc87ea6 100644 --- a/opensm/include/opensm/osm_sm.h +++ b/opensm/include/opensm/osm_sm.h @@ -769,5 +769,28 @@ ib_api_status_t osm_sm_state_mgr_check_legality(IN osm_sm_t *sm, void osm_report_sm_state(osm_sm_t *sm); +/****f* OpenSM: SM State Manager/osm_send_trap144 +* NAME +* osm_send_trap144 +* +* DESCRIPTION +* Send trap 144 to the master SM. +* +* SYNOPSIS +*/ +int osm_send_trap144(osm_sm_t *sm, ib_net16_t local); +/* +* PARAMETERS +* sm +* [in] Pointer to an osm_sm_t object. +* +* local +* [in] OtherLocalChanges mask in network byte order. +* +* RETURN VALUES +* 0 on success, non-zero value otherwise. +* +*********/ + END_C_DECLS #endif /* _OSM_SM_H_ */ diff --git a/opensm/opensm/osm_req.c b/opensm/opensm/osm_req.c index 5f93551..0865ce5 100644 --- a/opensm/opensm/osm_req.c +++ b/opensm/opensm/osm_req.c @@ -210,3 +210,71 @@ Exit: OSM_LOG_EXIT(sm->p_log); return (status); } + +int osm_send_trap144(osm_sm_t *sm, ib_net16_t local) +{ + osm_madw_t *madw; + ib_smp_t *smp; + ib_mad_notice_attr_t *ntc; + osm_port_t *port; + ib_port_info_t *pi; + + port = osm_get_port_by_guid(sm->p_subn, sm->p_subn->sm_port_guid); + if (!port) { + OSM_LOG(sm->p_log, OSM_LOG_ERROR, + "ERR 1104: cannot find SM port by guid 0x%" PRIx64 "\n", + cl_ntoh64(sm->p_subn->sm_port_guid)); + return -1; + } + + pi = &port->p_physp->port_info; + + /* don't bother with sending trap when SMA supports this */ + if (!local && + pi->capability_mask&(IB_PORT_CAP_HAS_TRAP|IB_PORT_CAP_HAS_CAP_NTC)) + return 0; + + madw = osm_mad_pool_get(sm->p_mad_pool, + osm_sm_mad_ctrl_get_bind_handle(&sm->mad_ctrl), + MAD_BLOCK_SIZE, NULL); + if (madw == NULL) { + OSM_LOG(sm->p_log, OSM_LOG_ERROR, + "ERR 1105: Unable to acquire MAD\n"); + return -1; + } + + madw->mad_addr.dest_lid = pi->master_sm_base_lid; + madw->mad_addr.addr_type.smi.source_lid = pi->base_lid; + madw->fail_msg = CL_DISP_MSGID_NONE; + + smp = osm_madw_get_smp_ptr(madw); + memset(smp, 0, sizeof(*smp)); + + smp->base_ver = 1; + smp->mgmt_class = IB_MCLASS_SUBN_LID; + smp->class_ver = 1; + smp->method = IB_MAD_METHOD_TRAP; + smp->trans_id = cl_hton64((uint64_t)cl_atomic_inc(&sm->sm_trans_id)); + smp->attr_id = IB_MAD_ATTR_NOTICE; + smp->m_key = sm->p_subn->opt.m_key; + + ntc = (ib_mad_notice_attr_t *)smp->data; + + ntc->generic_type = 0x80 | IB_NOTICE_TYPE_INFO; + ib_notice_set_prod_type_ho(ntc, IB_NODE_TYPE_CA); + ntc->g_or_v.generic.trap_num = cl_hton16(144); + ntc->issuer_lid = pi->base_lid; + ntc->data_details.ntc_144.lid = pi->base_lid; + ntc->data_details.ntc_144.local_changes = local ? + TRAP_144_MASK_OTHER_LOCAL_CHANGES : 0; + ntc->data_details.ntc_144.new_cap_mask = pi->capability_mask; + ntc->data_details.ntc_144.change_flgs = local; + + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, + "Sending Trap 144, TID 0x%" PRIx64 " to SM lid %u\n", + cl_ntoh64(smp->trans_id), cl_ntoh16(pi->master_sm_base_lid)); + + osm_vl15_post(sm->p_vl15, madw); + + return 0; +} -- 1.6.0.3.517.g759a From sashak at voltaire.com Sat Oct 25 11:22:56 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Oct 2008 20:22:56 +0200 Subject: [ofa-general] [PATCH] opensm: send trap144 to master SM when priority is raised In-Reply-To: <20081025175201.GP28713@sashak.voltaire.com> References: <20081025175201.GP28713@sashak.voltaire.com> Message-ID: <20081025182256.GQ28713@sashak.voltaire.com> When our SM is in Standby state and its priority is increased (via console command), notify master SM by sending Trap 144. This trap 144 extension is not in the IBA spec yet, so formally the feature is not IBA complaint yet. Still be pretty useful in some cases - for instance when Standby SM is started (with low priority) after Master and its port doesn't support traps (such as ConnectX), the master will never pull SMInfo there. In over cases this will speed up handover - Master SM may be unaware about Standby SM priority changes. Signed-off-by: Sasha Khapyorsky --- opensm/include/iba/ib_types.h | 1 + opensm/include/opensm/osm_sm.h | 2 ++ opensm/opensm/osm_console.c | 3 +-- opensm/opensm/osm_sm.c | 11 +++++++++++ 4 files changed, 15 insertions(+), 2 deletions(-) diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index 257b19c..6412ea9 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -7215,6 +7215,7 @@ typedef struct _ib_mad_notice_attr // Total Size calc Accumulated * Trap 144 masks */ #define TRAP_144_MASK_OTHER_LOCAL_CHANGES 0x01 +#define TRAP_144_MASK_SM_PRIORITY_CHANGE (CL_HTON16(0x0008)) #define TRAP_144_MASK_LINK_SPEED_ENABLE_CHANGE (CL_HTON16(0x0004)) #define TRAP_144_MASK_LINK_WIDTH_ENABLE_CHANGE (CL_HTON16(0x0002)) #define TRAP_144_MASK_NODE_DESCRIPTION_CHANGE (CL_HTON16(0x0001)) diff --git a/opensm/include/opensm/osm_sm.h b/opensm/include/opensm/osm_sm.h index bc87ea6..ebe3dc3 100644 --- a/opensm/include/opensm/osm_sm.h +++ b/opensm/include/opensm/osm_sm.h @@ -792,5 +792,7 @@ int osm_send_trap144(osm_sm_t *sm, ib_net16_t local); * *********/ +void osm_set_sm_priority(osm_sm_t *sm, uint8_t priority); + END_C_DECLS #endif /* _OSM_SM_H_ */ diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index e3a673a..18168ff 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -283,8 +283,7 @@ static void priority_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) priority); else { fprintf(out, "Setting sm-priority to %d\n", priority); - p_osm->subn.opt.sm_priority = (uint8_t) priority; - /* Does the SM state machine need a kick now ? */ + osm_set_sm_priority(&p_osm->sm, (uint8_t)priority); } } } diff --git a/opensm/opensm/osm_sm.c b/opensm/opensm/osm_sm.c index dff5409..efebf4a 100644 --- a/opensm/opensm/osm_sm.c +++ b/opensm/opensm/osm_sm.c @@ -658,3 +658,14 @@ Exit: OSM_LOG_EXIT(p_sm->p_log); return (status); } + +void osm_set_sm_priority(osm_sm_t *sm, uint8_t priority) +{ + uint8_t old_pri = sm->p_subn->opt.sm_priority; + + sm->p_subn->opt.sm_priority = priority; + + if (old_pri < priority && + sm->p_subn->sm_state == IB_SMINFO_STATE_STANDBY) + osm_send_trap144(sm, TRAP_144_MASK_SM_PRIORITY_CHANGE); +} -- 1.6.0.3.517.g759a From sashak at voltaire.com Sat Oct 25 13:01:27 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Oct 2008 22:01:27 +0200 Subject: [ofa-general] [PATCH] opensm: notify master SM with trap 144 (not finished) In-Reply-To: <20081025175201.GP28713@sashak.voltaire.com> References: <20081025175201.GP28713@sashak.voltaire.com> Message-ID: <20081025200127.GR28713@sashak.voltaire.com> When entering standby state (after discovery) notify master SM about us. In case when SMA doesn't support trap sending (specifically trap 144 on PortInfo:CapabilityMask change - isSM bit, example is current ConnectX firmware - 2.5.0) this is only way to notify the current master SM that another SM is running. See also bug#1183. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/osm_state_mgr.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index 174cee6..1576c42 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -1142,6 +1142,8 @@ _repeat_discovery: OSM_SM_SIGNAL_MASTER_OR_HIGHER_SM_DETECTED_DONE); osm_log_msg_box(sm->p_log, OSM_LOG_VERBOSE, __FUNCTION__, "ENTERING STANDBY STATE"); + /* notify master SM about us */ + osm_send_trap144(sm, 0); return; } -- 1.6.0.3.517.g759a From sashak at voltaire.com Sat Oct 25 13:03:14 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Oct 2008 22:03:14 +0200 Subject: [ofa-general] Re: [PATCH] opensm: notify master SM with trap 144 In-Reply-To: <20081025200127.GR28713@sashak.voltaire.com> References: <20081025175201.GP28713@sashak.voltaire.com> <20081025200127.GR28713@sashak.voltaire.com> Message-ID: <20081025200314.GS28713@sashak.voltaire.com> On 22:01 Sat 25 Oct , Sasha Khapyorsky wrote: > > Subject: Re: [PATCH] opensm: notify master SM with trap 144 (not finished) Sorry, bad subject "(not finished)" should not be here. Sasha From sashak at voltaire.com Sat Oct 25 13:04:53 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Oct 2008 22:04:53 +0200 Subject: [ofa-general] [PATCH] opensm: hide function name with OSM_LOG_MSG_BOX() macro Message-ID: <20081025200453.GT28713@sashak.voltaire.com> osm_log_msg_box() function get function name (automated with __FUNCTION__ macro) as parameter. Hide this with OSM_LOG_MSG_BOX() macro - it is just similar to OSM_LOG(). Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_log.h | 3 +++ opensm/opensm/osm_sm_state_mgr.c | 2 +- opensm/opensm/osm_state_mgr.c | 37 +++++++++++++++++-------------------- 3 files changed, 21 insertions(+), 21 deletions(-) diff --git a/opensm/include/opensm/osm_log.h b/opensm/include/opensm/osm_log.h index 741cef4..20999d9 100644 --- a/opensm/include/opensm/osm_log.h +++ b/opensm/include/opensm/osm_log.h @@ -395,6 +395,9 @@ extern void osm_log_raw(IN osm_log_t * const p_log, osm_log(log, level, "%s: " fmt, __func__, ##arg); \ } while (0) +#define OSM_LOG_MSG_BOX(log, level, msg) \ + osm_log_msg_box(log, level, __func__, msg) + #define DBG_CL_LOCK 0 #define CL_PLOCK_EXCL_ACQUIRE( __exp__ ) \ diff --git a/opensm/opensm/osm_sm_state_mgr.c b/opensm/opensm/osm_sm_state_mgr.c index 5736b8c..9f66cb4 100644 --- a/opensm/opensm/osm_sm_state_mgr.c +++ b/opensm/opensm/osm_sm_state_mgr.c @@ -70,7 +70,7 @@ void osm_report_sm_state(osm_sm_t * sm) osm_log(sm->p_log, OSM_LOG_SYS, "Entering %s state\n", state_str); snprintf(buf, sizeof(buf), "ENTERING SM %s STATE", state_str); - osm_log_msg_box(sm->p_log, OSM_LOG_VERBOSE, __FUNCTION__, buf); + OSM_LOG_MSG_BOX(sm->p_log, OSM_LOG_VERBOSE, buf); } /********************************************************************** diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index 1576c42..e548e5b 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -83,7 +83,7 @@ static void __osm_state_mgr_up_msg(IN const osm_sm_t * sm) osm_log(sm->p_log, sm->p_subn->first_time_master_sweep ? OSM_LOG_SYS : OSM_LOG_INFO, "SUBNET UP\n"); - osm_log_msg_box(sm->p_log, OSM_LOG_VERBOSE, __FUNCTION__, + OSM_LOG_MSG_BOX(sm->p_log, OSM_LOG_VERBOSE, sm->p_subn->opt.sweep_interval ? "SUBNET UP" : "SUBNET UP (sweep disabled)"); } @@ -214,7 +214,7 @@ static ib_api_status_t __osm_state_mgr_sweep_hop_0(IN osm_sm_t * sm) */ h_bind = osm_sm_mad_ctrl_get_bind_handle(&sm->mad_ctrl); if (h_bind != OSM_BIND_INVALID_HANDLE) { - osm_log_msg_box(sm->p_log, OSM_LOG_VERBOSE, __FUNCTION__, + OSM_LOG_MSG_BOX(sm->p_log, OSM_LOG_VERBOSE, "INITIATING HEAVY SWEEP"); /* * Start the sweep by clearing the port counts, then @@ -586,8 +586,7 @@ static ib_api_status_t __osm_state_mgr_light_sweep_start(IN osm_sm_t * sm) goto _exit; } - osm_log_msg_box(sm->p_log, OSM_LOG_VERBOSE, __FUNCTION__, - "INITIATING LIGHT SWEEP"); + OSM_LOG_MSG_BOX(sm->p_log, OSM_LOG_VERBOSE, "INITIATING LIGHT SWEEP"); CL_PLOCK_ACQUIRE(sm->p_lock); cl_qmap_apply_func(p_sw_tbl, __osm_state_mgr_get_sw_info, sm); CL_PLOCK_RELEASE(sm->p_lock); @@ -1052,8 +1051,8 @@ static void do_sweep(osm_sm_t * sm) if (wait_for_pending_transactions(&sm->p_subn->p_osm->stats)) return; if (!sm->p_subn->force_heavy_sweep) { - osm_log_msg_box(sm->p_log, OSM_LOG_VERBOSE, - __FUNCTION__, "LIGHT SWEEP COMPLETE"); + OSM_LOG_MSG_BOX(sm->p_log, OSM_LOG_VERBOSE, + "LIGHT SWEEP COMPLETE"); return; } } @@ -1087,8 +1086,8 @@ static void do_sweep(osm_sm_t * sm) return; if (!sm->p_subn->subnet_initialization_error) { - osm_log_msg_box(sm->p_log, OSM_LOG_VERBOSE, - __FUNCTION__, "REROUTE COMPLETE"); + OSM_LOG_MSG_BOX(sm->p_log, OSM_LOG_VERBOSE, + "REROUTE COMPLETE"); return; } } @@ -1116,8 +1115,7 @@ _repeat_discovery: if (__osm_state_mgr_is_sm_port_down(sm) == TRUE) { osm_log(sm->p_log, OSM_LOG_SYS, "SM port is down\n"); - osm_log_msg_box(sm->p_log, OSM_LOG_VERBOSE, __FUNCTION__, - "SM PORT DOWN"); + OSM_LOG_MSG_BOX(sm->p_log, OSM_LOG_VERBOSE, "SM PORT DOWN"); /* Run the drop manager - we want to clear all records */ osm_drop_mgr_process(sm); @@ -1140,7 +1138,7 @@ _repeat_discovery: */ osm_sm_state_mgr_process(sm, OSM_SM_SIGNAL_MASTER_OR_HIGHER_SM_DETECTED_DONE); - osm_log_msg_box(sm->p_log, OSM_LOG_VERBOSE, __FUNCTION__, + OSM_LOG_MSG_BOX(sm->p_log, OSM_LOG_VERBOSE, "ENTERING STANDBY STATE"); /* notify master SM about us */ osm_send_trap144(sm, 0); @@ -1151,8 +1149,7 @@ _repeat_discovery: if (sm->p_subn->force_heavy_sweep) goto _repeat_discovery; - osm_log_msg_box(sm->p_log, OSM_LOG_VERBOSE, __FUNCTION__, - "HEAVY SWEEP COMPLETE"); + OSM_LOG_MSG_BOX(sm->p_log, OSM_LOG_VERBOSE, "HEAVY SWEEP COMPLETE"); /* If we are MASTER - get the highest remote_sm, and * see if it is higher than our local sm. @@ -1214,7 +1211,7 @@ _repeat_discovery: if (wait_for_pending_transactions(&sm->p_subn->p_osm->stats)) return; - osm_log_msg_box(sm->p_log, OSM_LOG_VERBOSE, __FUNCTION__, + OSM_LOG_MSG_BOX(sm->p_log, OSM_LOG_VERBOSE, "SM LID ASSIGNMENT COMPLETE - STARTING SUBNET LID CONFIG"); __osm_state_mgr_notify_lid_change(sm); @@ -1228,7 +1225,7 @@ _repeat_discovery: * their destination. */ __osm_state_mgr_check_tbl_consistency(sm); - osm_log_msg_box(sm->p_log, OSM_LOG_VERBOSE, __FUNCTION__, + OSM_LOG_MSG_BOX(sm->p_log, OSM_LOG_VERBOSE, "LID ASSIGNMENT COMPLETE - STARTING SWITCH TABLE CONFIG"); /* @@ -1247,14 +1244,14 @@ _repeat_discovery: * take into account these lfts. */ sm->p_subn->ignore_existing_lfts = FALSE; - osm_log_msg_box(sm->p_log, OSM_LOG_VERBOSE, __FUNCTION__, + OSM_LOG_MSG_BOX(sm->p_log, OSM_LOG_VERBOSE, "SWITCHES CONFIGURED FOR UNICAST"); if (!sm->p_subn->opt.disable_multicast) { osm_mcast_mgr_process(sm); if (wait_for_pending_transactions(&sm->p_subn->p_osm->stats)) return; - osm_log_msg_box(sm->p_log, OSM_LOG_VERBOSE, __FUNCTION__, + OSM_LOG_MSG_BOX(sm->p_log, OSM_LOG_VERBOSE, "SWITCHES CONFIGURED FOR MULTICAST"); } @@ -1270,14 +1267,14 @@ _repeat_discovery: if (wait_for_pending_transactions(&sm->p_subn->p_osm->stats)) return; - osm_log_msg_box(sm->p_log, OSM_LOG_VERBOSE, __FUNCTION__, + OSM_LOG_MSG_BOX(sm->p_log, OSM_LOG_VERBOSE, "LINKS PORTS CONFIGURED - SET LINKS TO ARMED STATE"); osm_link_mgr_process(sm, IB_LINK_ARMED); if (wait_for_pending_transactions(&sm->p_subn->p_osm->stats)) return; - osm_log_msg_box(sm->p_log, OSM_LOG_VERBOSE, __FUNCTION__, + OSM_LOG_MSG_BOX(sm->p_log, OSM_LOG_VERBOSE, "LINKS ARMED - SET LINKS TO ACTIVE STATE"); osm_link_mgr_process(sm, IB_LINK_ACTIVE); @@ -1300,7 +1297,7 @@ _repeat_discovery: if (sm->p_subn->subnet_initialization_error == TRUE) { osm_log(sm->p_log, OSM_LOG_SYS, "Errors during initialization\n"); - osm_log_msg_box(sm->p_log, OSM_LOG_ERROR, __FUNCTION__, + OSM_LOG_MSG_BOX(sm->p_log, OSM_LOG_ERROR, "ERRORS DURING INITIALIZATION"); } else { sm->p_subn->need_update = 0; -- 1.6.0.3.517.g759a From sashak at voltaire.com Sat Oct 25 13:05:22 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Oct 2008 22:05:22 +0200 Subject: [ofa-general] [PATCH] opensm: rename sm signal Message-ID: <20081025200522.GU28713@sashak.voltaire.com> Rename sm signal OSM_SM_SIGNAL_MASTER_OR_HIGHER_SM_DETECTED_DONE to shorter OSM_SM_SIGNAL_MASTER_OR_HIGHER_SM_DETECTED to be consistent with other sm signal names. Signed-off-by: Sasha Khapyorsky --- opensm/include/opensm/osm_base.h | 2 +- opensm/opensm/osm_helper.c | 2 +- opensm/opensm/osm_sm_state_mgr.c | 4 ++-- opensm/opensm/osm_state_mgr.c | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h index 8e52ee8..54df41e 100644 --- a/opensm/include/opensm/osm_base.h +++ b/opensm/include/opensm/osm_base.h @@ -822,7 +822,7 @@ typedef enum _osm_sm_signal { OSM_SM_SIGNAL_HANDOVER_SENT, OSM_SM_SIGNAL_ACKNOWLEDGE, OSM_SM_SIGNAL_STANDBY, - OSM_SM_SIGNAL_MASTER_OR_HIGHER_SM_DETECTED_DONE, + OSM_SM_SIGNAL_MASTER_OR_HIGHER_SM_DETECTED, OSM_SM_SIGNAL_WAIT_FOR_HANDOVER, OSM_SM_SIGNAL_MAX } osm_sm_signal_t; diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c index 2ed0011..0443987 100644 --- a/opensm/opensm/osm_helper.c +++ b/opensm/opensm/osm_helper.c @@ -2303,7 +2303,7 @@ static const char *const __osm_sm_mgr_signal_str[] = { "OSM_SM_SIGNAL_HANDOVER_SENT", /* 7 */ "OSM_SM_SIGNAL_ACKNOWLEDGE", /* 8 */ "OSM_SM_SIGNAL_STANDBY", /* 9 */ - "OSM_SM_SIGNAL_MASTER_OR_HIGHER_SM_DETECTED_DONE", /* 10 */ + "OSM_SM_SIGNAL_MASTER_OR_HIGHER_SM_DETECTED", /* 10 */ "OSM_SM_SIGNAL_WAIT_FOR_HANDOVER", /* 11 */ "UNKNOWN STATE!!" /* 12 */ }; diff --git a/opensm/opensm/osm_sm_state_mgr.c b/opensm/opensm/osm_sm_state_mgr.c index 9f66cb4..343a9e3 100644 --- a/opensm/opensm/osm_sm_state_mgr.c +++ b/opensm/opensm/osm_sm_state_mgr.c @@ -280,7 +280,7 @@ ib_api_status_t osm_sm_state_mgr_process(osm_sm_t * sm, sm->p_subn->master_sm_base_lid = sm->p_subn->sm_base_lid; break; - case OSM_SM_SIGNAL_MASTER_OR_HIGHER_SM_DETECTED_DONE: + case OSM_SM_SIGNAL_MASTER_OR_HIGHER_SM_DETECTED: /* * Finished all discovery actions - move to STANDBY * start the polling @@ -484,7 +484,7 @@ ib_api_status_t osm_sm_state_mgr_check_legality(osm_sm_t * sm, case IB_SMINFO_STATE_DISCOVERING: switch (signal) { case OSM_SM_SIGNAL_DISCOVERY_COMPLETED: - case OSM_SM_SIGNAL_MASTER_OR_HIGHER_SM_DETECTED_DONE: + case OSM_SM_SIGNAL_MASTER_OR_HIGHER_SM_DETECTED: case OSM_SM_SIGNAL_HANDOVER: status = IB_SUCCESS; break; diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index e548e5b..ba3b6bf 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -1137,7 +1137,7 @@ _repeat_discovery: * MASTER_OR_HIGHER_SM_DETECTED_DONE */ osm_sm_state_mgr_process(sm, - OSM_SM_SIGNAL_MASTER_OR_HIGHER_SM_DETECTED_DONE); + OSM_SM_SIGNAL_MASTER_OR_HIGHER_SM_DETECTED); OSM_LOG_MSG_BOX(sm->p_log, OSM_LOG_VERBOSE, "ENTERING STANDBY STATE"); /* notify master SM about us */ -- 1.6.0.3.517.g759a From sashak at voltaire.com Sat Oct 25 14:04:04 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Oct 2008 23:04:04 +0200 Subject: [ofa-general] [PATCH] opensm: sweep on SIGCONT Message-ID: <20081025210404.GV28713@sashak.voltaire.com> When OpenSM is suspended by SIGSTOP it can miss subnet changes. Example of known scenario is when another Stanby SM becomes master due to unresponsive suspended master SM, when suspended SM continues to run (after SIGCONT) we have two master SMs in a subnet. To fix this we will schedule heavy sweep when SIGCONT is received. Signed-off-by: Sasha Khapyorsky --- opensm/opensm/main.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c index fe4262b..53648d6 100644 --- a/opensm/opensm/main.c +++ b/opensm/opensm/main.c @@ -111,6 +111,7 @@ static void setup_signals() sigaction(SIGTERM, &act, NULL); act.sa_handler = mark_hup_flag; sigaction(SIGHUP, &act, NULL); + sigaction(SIGCONT, &act, NULL); #ifndef HAVE_OLD_LINUX_THREADS act.sa_handler = mark_usr1_flag; sigaction(SIGUSR1, &act, NULL); -- 1.6.0.3.517.g759a From chu11 at llnl.gov Sat Oct 25 13:37:50 2008 From: chu11 at llnl.gov (Al Chu) Date: Sat, 25 Oct 2008 16:37:50 -0400 Subject: [ofa-general] [opensm][trivial] fix documentation typos Message-ID: <1224967070.19083.8.camel@whatsup> Hey Sasha, Saw a few more of the invalid '=' in the examples. Al -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-fix-doc-typos.patch Type: application/mbox Size: 4378 bytes Desc: not available URL: From sashak at voltaire.com Sat Oct 25 17:13:30 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 26 Oct 2008 02:13:30 +0200 Subject: [ofa-general] Re: [opensm][trivial] fix documentation typos In-Reply-To: <1224967070.19083.8.camel@whatsup> References: <1224967070.19083.8.camel@whatsup> Message-ID: <20081026001330.GW28713@sashak.voltaire.com> On 16:37 Sat 25 Oct , Al Chu wrote: > Hey Sasha, > > Saw a few more of the invalid '=' in the examples. Applied. Thanks. Sasha From ogerlitz at voltaire.com Sun Oct 26 00:08:10 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 26 Oct 2008 09:08:10 +0200 Subject: [ofa-general] Re: [PATCH] maintainers: moderated mailing list In-Reply-To: References: <20081023183805.92bc39ab.randy.dunlap@oracle.com> <4901F249.1030206@nasa.gov> Message-ID: <4904175A.10705@voltaire.com> Roland Dreier wrote: > Would it make sense to move the hosting of the general list to vger.kernel.org, since they have a lot more resources/experience as mail admins and seem to be able to run open > but relatively spam-free lists? > yes, sounds good idea, lets do that. Or. From ogerlitz at voltaire.com Sun Oct 26 00:19:59 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Sun, 26 Oct 2008 09:19:59 +0200 Subject: [ofa-general] Re: linux-next: Tree for October 23 In-Reply-To: <20081023180412.394d40c2.randy.dunlap@oracle.com> References: <20081023213637.eff9b414.sfr@canb.auug.org.au> <20081023180412.394d40c2.randy.dunlap@oracle.com> Message-ID: <49041A1F.3070806@voltaire.com> Randy Dunlap wrote: > Building with CONFIG_INFINIBAND=m, kconfig allows CONFIG_NET_9P_RDMA=m, > so one module wants symbols from the other (net/9p wants symbols from rmda_*). > > ERROR: "rdma_destroy_id" [net/9p/9pnet_rdma.ko] undefined! > ERROR: "rdma_connect" [net/9p/9pnet_rdma.ko] undefined! > ... > Is this supposed to be allowed/possible? Otherwise NET_9P_RDMA might have to depend on INFINBAND=y... No, there's no need to config INFINIBAND at built it. What's the value of CONFIG_INFINIBAND_ADDR_TRANS ? Or. From amirv at mellanox.co.il Sun Oct 26 02:23:14 2008 From: amirv at mellanox.co.il (Amir Vadai) Date: Sun, 26 Oct 2008 11:23:14 +0200 Subject: [ofa-general] Bug with SDP on IA64 In-Reply-To: <48F891D2.4010502@ext.bull.net> References: <48F891D2.4010502@ext.bull.net> Message-ID: <49043702.4070806@mellanox.co.il> Hi, Please open a bug in https://bugs.openfabrics.org/ (make sure it is not a duplicate) I guess you have some endianess problem since ia64 is big endian and x86 is little endian. Try running the test on a stock Redhat/SLES kernel. - Amir Nicolas Morey Chaisemartin wrote: > Hi, > > I am stuck with a bug from ofa-kernel 1.3.1 on an IA64 running a Bull > 2.6.18 kernel. > When doing SDP transfers from an IA64 to any other host (IA64, x86, > x86_64) through ttcp, I got this message: > > [root at h2 ~]# LD_PRELOAD=/usr/lib/libsdp.so.1 ~/ttcp/ttcp -t -s > 192.168.0.10 > ttcp-t: buflen=8192, nbuf=2048, align=16384/0, port=5001 tcp -> > 192.168.0.10 > ttcp-t: socket > ttcp-t: tcp_maxseg > ttcp-t: connect > ttcp-t: IO: Connection reset by peer > errno=104 > [root at h2 ~]# > > And the same error on the other side. > I activated the debug mode for sdp module and found out than on the > receiver side a completion error 1 shows up: > Oct 16 12:40:43 s_kernel at yack0 kernel: sdp_sock(5001:36814): Recv > completion with error. Status 1 > Oct 16 12:40:43 s_kernel at yack0 kernel: sdp_sock(5001:36814): sdp_reset > state=1 > Oct 16 12:40:44 s_kernel at yack0 kernel: sdp_sock(5001:36814): > sdp_cma_handler event 10 id 0000010425120600 > Oct 16 12:40:44 s_kernel at yack0 kernel: sdp_sock(5001:36814): > RDMA_CM_EVENT_DISCONNECTED > > The error triggers a socket reset which terminates the connection. > According to the docs I could find, Status 1 is a local length error, > meaning the size written in the packet doesn't match the payload. > > I've noticed that with few packets (<= 100) or when ttcp is slowed > down (started through strace) transfers seem to work. > > I've tried to update to the latest ofa-kernel (1.4.1 from 10/16/2008) > and the bug is still there. > > Has anyone seen this problem before? What can I do to locate where > things go wrong? > > Regards > > Nicolas > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From eli at mellanox.co.il Sun Oct 26 02:32:53 2008 From: eli at mellanox.co.il (Eli Cohen) Date: Sun, 26 Oct 2008 11:32:53 +0200 Subject: [ofa-general] [PATCH] ib_core: Use weak ordering for data registered memory Message-ID: <20081026093253.GA11974@mtls03> Some architectures support weak ordering in which case better performance is possible. IB registered memory used for data can be weakly ordered becuase the the completion queues' buffers are registered as strongly ordered. This will result in flushing all data related outstanding DMA requests by the HCA when a completion is DMAed to a completion queue buffer. This patch will allow weak ordering for data if ib_core is loaded with the module parameter, allow_weak_ordering, set to a none zero value. Signed-off-by: Eli Cohen Signed-off-by: Arnd Bergmann --- Roland, this patch has a fix to bug inserted while I recreated the patch from another one so please use this one. Also, are you going to push this to 2.6.28? drivers/infiniband/core/umem.c | 12 ++++++++++-- include/rdma/ib_umem.h | 2 ++ 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 6f7c096..da5e247 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -40,6 +40,10 @@ #include "uverbs.h" +static int allow_weak_ordering; +module_param(allow_weak_ordering, bool, 0444); +MODULE_PARM_DESC(allow_weak_ordering, "Allow weak ordering for data registered memory"); + #define IB_UMEM_MAX_PAGE_CHUNK \ ((PAGE_SIZE - offsetof(struct ib_umem_chunk, page_list)) / \ ((void *) &((struct ib_umem_chunk *) 0)->page_list[1] - \ @@ -51,8 +55,8 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d int i; list_for_each_entry_safe(chunk, tmp, &umem->chunk_list, list) { - ib_dma_unmap_sg(dev, chunk->page_list, - chunk->nents, DMA_BIDIRECTIONAL); + ib_dma_unmap_sg_attrs(dev, chunk->page_list, + chunk->nents, DMA_BIDIRECTIONAL, &chunk->attrs); for (i = 0; i < chunk->nents; ++i) { struct page *page = sg_page(&chunk->page_list[i]); @@ -91,6 +95,9 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, if (dmasync) dma_set_attr(DMA_ATTR_WRITE_BARRIER, &attrs); + else if (allow_weak_ordering) + dma_set_attr(DMA_ATTR_WEAK_ORDERING, &attrs); + if (!can_do_mlock()) return ERR_PTR(-EPERM); @@ -169,6 +176,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, goto out; } + chunk->attrs = attrs; chunk->nents = min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK); sg_init_table(chunk->page_list, chunk->nents); for (i = 0; i < chunk->nents; ++i) { diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index 9ee0d2e..90f3712 100644 --- a/include/rdma/ib_umem.h +++ b/include/rdma/ib_umem.h @@ -36,6 +36,7 @@ #include #include #include +#include struct ib_ucontext; @@ -56,6 +57,7 @@ struct ib_umem_chunk { struct list_head list; int nents; int nmap; + struct dma_attrs attrs; struct scatterlist page_list[0]; }; -- 1.6.0.2 From vlad at mellanox.co.il Sun Oct 26 03:02:02 2008 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Sun, 26 Oct 2008 12:02:02 +0200 Subject: [ofa-general] [PATCH] IB/sysfs: Add port_xmit_wait counter. Message-ID: <20081026100202.GA15179@mellanox.co.il> Signed-off-by: Vladimir Sokolovsky --- drivers/infiniband/core/sysfs.c | 7 +++++++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c index 4d10421..6b4c592 100644 --- a/drivers/infiniband/core/sysfs.c +++ b/drivers/infiniband/core/sysfs.c @@ -374,6 +374,12 @@ static PORT_PMA_ATTR(port_xmit_data , 12, 32, 192); static PORT_PMA_ATTR(port_rcv_data , 13, 32, 224); static PORT_PMA_ATTR(port_xmit_packets , 14, 32, 256); static PORT_PMA_ATTR(port_rcv_packets , 15, 32, 288); +/* + * There is no bit allocated for port_xmit_wait in the CounterSelect field + * (IB spec). However, since this bit is ignored when reading + * (show_pma_counter), the _counter field of port_xmit_wait can be set to zero. + */ +static PORT_PMA_ATTR(port_xmit_wait , 0, 32, 320); static struct attribute *pma_attrs[] = { &port_pma_attr_symbol_error.attr.attr, @@ -392,6 +398,7 @@ static struct attribute *pma_attrs[] = { &port_pma_attr_port_rcv_data.attr.attr, &port_pma_attr_port_xmit_packets.attr.attr, &port_pma_attr_port_rcv_packets.attr.attr, + &port_pma_attr_port_xmit_wait.attr.attr, NULL }; -- 1.6.0.2.307.gc427 From vlad at lists.openfabrics.org Sun Oct 26 03:18:10 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 26 Oct 2008 03:18:10 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081026-0200 daily build status Message-ID: <20081026101810.7F754E60C5E@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From amirv at mellanox.co.il Sun Oct 26 03:27:23 2008 From: amirv at mellanox.co.il (Amir Vadai) Date: Sun, 26 Oct 2008 12:27:23 +0200 Subject: [ofa-general] [PATCH 1/1] sdp: Limit skb frag size to 64K-1 In-Reply-To: <> References: <> Message-ID: <1225016843-17971-1-git-send-email-amirv@mellanox.co.il> When 64K pages are in use, the skb_frag size can become larger than the skb_frag can address. An skb_frag's max size is 64K-1. This patch defines SDP_MAX_PAYLOAD as 64K - SDP_HEADER_SIZE. The patch changes sdp_post_recv() and sdp_sendmsg() to use the smaller of PAGE_SIZE or SDP_MAX_PAYLOAD as it segment size. This fix the bug here: https://bugs.openfabrics.org/show_bug.cgi?id=1300 Signed-off-by: David Wilder Signed-off-by: Amir Vadai --- drivers/infiniband/ulp/sdp/sdp.h | 1 + drivers/infiniband/ulp/sdp/sdp_bcopy.c | 8 ++++---- drivers/infiniband/ulp/sdp/sdp_main.c | 4 ++++ 3 files changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/ulp/sdp/sdp.h b/drivers/infiniband/ulp/sdp/sdp.h index 13cc42d..8638422 100644 --- a/drivers/infiniband/ulp/sdp/sdp.h +++ b/drivers/infiniband/ulp/sdp/sdp.h @@ -82,6 +82,7 @@ extern int sdp_data_debug_level; #define SDP_MAX_SEND_SKB_FRAGS (PAGE_SIZE > 0x8000 ? 1 : 0x8000 / PAGE_SIZE) #define SDP_HEAD_SIZE (PAGE_SIZE / 2 + sizeof(struct sdp_bsdh)) #define SDP_NUM_WC 4 +#define SDP_MAX_PAYLOAD ((1 << 16) - SDP_HEAD_SIZE) #define SDP_MIN_ZCOPY_THRESH 1024 #define SDP_MAX_ZCOPY_THRESH 1048576 diff --git a/drivers/infiniband/ulp/sdp/sdp_bcopy.c b/drivers/infiniband/ulp/sdp/sdp_bcopy.c index 20f6a33..4677df0 100644 --- a/drivers/infiniband/ulp/sdp/sdp_bcopy.c +++ b/drivers/infiniband/ulp/sdp/sdp_bcopy.c @@ -322,11 +322,11 @@ static void sdp_post_recv(struct sdp_sock *ssk) frag = &skb_shinfo(skb)->frags[i]; frag->page = page; frag->page_offset = 0; - frag->size = PAGE_SIZE; + frag->size = min(PAGE_SIZE, SDP_MAX_PAYLOAD); ++skb_shinfo(skb)->nr_frags; - skb->len += PAGE_SIZE; - skb->data_len += PAGE_SIZE; - skb->truesize += PAGE_SIZE; + skb->len += frag->size; + skb->data_len += frag->size; + skb->truesize += frag->size; } rx_req = ssk->rx_ring + (id & (SDP_RX_SIZE - 1)); diff --git a/drivers/infiniband/ulp/sdp/sdp_main.c b/drivers/infiniband/ulp/sdp/sdp_main.c index dfbe724..32833cd 100644 --- a/drivers/infiniband/ulp/sdp/sdp_main.c +++ b/drivers/infiniband/ulp/sdp/sdp_main.c @@ -1637,6 +1637,10 @@ int sdp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, iov++; + /* Limmiting the size_goal is reqired when using 64K pages*/ + if (size_goal > SDP_MAX_PAYLOAD) + size_goal = SDP_MAX_PAYLOAD; + bz = sdp_bz_setup(ssk, from, seglen, size_goal); while (seglen > 0) { -- 1.5.5.GIT From tziporet at dev.mellanox.co.il Sun Oct 26 04:50:41 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 26 Oct 2008 13:50:41 +0200 Subject: [ofa-general] Where to get started? In-Reply-To: <20081023122319.yyhzkbtw0840g4sg@celticblues.com> References: <20081023122319.yyhzkbtw0840g4sg@celticblues.com> Message-ID: <49045991.4060404@mellanox.co.il> linux at celticblues.com wrote: > I am new, completely new, to the whole OpenFabrics thing. Can someone > point me to some reading material on all this... Something explaining > what is OpenSM, what is OpenIB, etc. How does all this stuff work > together... What is necessary to get a linux system up and running. > etc. I visited the OpenFabrics website, but did not find anything > like this. > > Ed > You can read some things on Mellanox web site of the OFED release: http://www.mellanox.com/products/ofed.php Tziporet From tziporet at mellanox.co.il Sun Oct 26 06:05:46 2008 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Sun, 26 Oct 2008 15:05:46 +0200 Subject: [ofa-general] OFED October 22 2008 meeting summary on OFED 1.4 status Message-ID: <5D49E7A8952DC44FB38C38FA0D758EADC72CAD@mtlexch01.mtl.com> OFED October 22 2008 meeting summary on OFED 1.4 status: Meeting minutes on the web: http://www.openfabrics.org/txt/documentation/linux/EWG_meeting_minutes/ Meeting Summary: ============== 1. All should review the normal severity bugs to make sure they are not critical for the GA 2. RC4 date decision will be done on Monday (Oct 27) 3. Reviewed Linux portion of the BOF at SC08 Details: ====== Agenda for OFED meeting today on OFED 1.4 status: 1. OFED 1.4 status: - RC3 was done on Monday Oct 20 - main changes: - Kernel base updated to 2.6.27 - NFS-RDMA is NOT enabled by default. To enable it one must chose it using custom installation, or add it to ofed.conf file. - Updated MPI packages: mvapich-1.1.0-3064, mvapich2-trunk-3073, openmpi-1.2.8-1 - Updated bonding package: ib-bonding-0.9.0-31 - Updated uDAPL: compat-dapl-1.2.11-1, dapl-2.0.14-1 - NFS-RDMA to work on RHEL 5.1 - OSM: Cashed routing 2. Bugs review: 1283 blo jeremy.brown at qlogic.com Intel MPI fails on Qlogc HCA - on work 1242 cri yannick.cote at qlogic.com kernel panic while running mpi2007 against ofed1.4 -- ib_... - should be fixed 1257 cri eli at mellanox.co.il Severe performance penalty for PCIe strict ordering - fix on work with Roland 1262 cri andy.grover at oracle.com congestion hang with RDS - ?? 1282 cri amirv at mellanox.co.il Kernel panic during Netperf run - on work 1164 maj yosefe at voltaire.com iperf over IPoIB fails for 100 tcp connections - on work 1221 maj Jeffrey.C.Becker at nasa.gov SLES10 sp2: remote logins via ssh fail due to rpcbind and... - ?? 1284 maj monis at voltaire.com Bonding - when eth bonding and IB bonding are configuerd,... - make sure its fixed in RC3 3. Reviewed OFED BOF slides - Woody should send the new version for review in next meeting Tziporet From amirv at mellanox.co.il Sun Oct 26 06:37:23 2008 From: amirv at mellanox.co.il (Amir Vadai) Date: Sun, 26 Oct 2008 15:37:23 +0200 Subject: [ofa-general] sdp: do nothing when getting FIN after IB teardown started In-Reply-To: <> References: <> Message-ID: <1225028243-32490-1-git-send-email-amirv@mellanox.co.il> Fixes bug at: https://bugs.openfabrics.org/show_bug.cgi?id=1302 Signed-off-by: Amir Vadai --- drivers/infiniband/ulp/sdp/sdp_bcopy.c | 5 +++++ 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/sdp/sdp_bcopy.c b/drivers/infiniband/ulp/sdp/sdp_bcopy.c index 4677df0..d60b257 100644 --- a/drivers/infiniband/ulp/sdp/sdp_bcopy.c +++ b/drivers/infiniband/ulp/sdp/sdp_bcopy.c @@ -132,6 +132,11 @@ static void sdp_fin(struct sock *sk) /* This is a mutual close situation and we've got the DREQ from the peer before the SDP_MID_DISCONNECT */ break; + case TCP_CLOSE: + /* FIN arrived after IB teardown started - do nothing */ + sdp_dbg(sk, "%s: fin in state %s\n", + __func__, sdp_state_str(sk->sk_state)); + return; default: sdp_warn(sk, "%s: FIN in unexpected state. sk->sk_state=%d\n", __func__, sk->sk_state); -- 1.5.4.3 From weiyi.huang at gmail.com Sun Oct 26 08:05:42 2008 From: weiyi.huang at gmail.com (Huang Weiyi) Date: Sun, 26 Oct 2008 23:05:42 +0800 Subject: [ofa-general] ***SPAM*** mlx4_en: remove duplicated #include Message-ID: <2a61071b0810260805o17e78c73q38d133d83e2f5ea8@mail.gmail.com> Removed duplicated #include in drivers/net/mlx4/en_main.c. Signed-off-by: Huang Weiyi diff --git a/drivers/net/mlx4/en_main.c b/drivers/net/mlx4/en_main.c index 1b0eebf..4b9794e 100644 --- a/drivers/net/mlx4/en_main.c +++ b/drivers/net/mlx4/en_main.c @@ -35,7 +35,6 @@ #include #include #include -#include #include #include -------------- next part -------------- An HTML attachment was scrubbed... URL: From yevgenyp at mellanox.co.il Sun Oct 26 08:13:24 2008 From: yevgenyp at mellanox.co.il (Yevgeny Petrilin) Date: Sun, 26 Oct 2008 17:13:24 +0200 Subject: [ofa-general][PATCH] mlx4: Setting the correct offset for default mac address Message-ID: <49048914.1050000@mellanox.co.il> Signed-off-by: Yevgeny Petrilin --- drivers/net/mlx4/fw.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/net/mlx4/fw.c b/drivers/net/mlx4/fw.c index be09fdb..cee199c 100644 --- a/drivers/net/mlx4/fw.c +++ b/drivers/net/mlx4/fw.c @@ -360,9 +360,9 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap) #define QUERY_PORT_ETH_MTU_OFFSET 0x02 #define QUERY_PORT_WIDTH_OFFSET 0x06 #define QUERY_PORT_MAX_GID_PKEY_OFFSET 0x07 -#define QUERY_PORT_MAC_OFFSET 0x08 #define QUERY_PORT_MAX_MACVLAN_OFFSET 0x0a #define QUERY_PORT_MAX_VL_OFFSET 0x0b +#define QUERY_PORT_MAC_OFFSET 0x10 for (i = 1; i <= dev_cap->num_ports; ++i) { err = mlx4_cmd_box(dev, 0, mailbox->dma, i, 0, MLX4_CMD_QUERY_PORT, -- 1.5.4 From randy.dunlap at oracle.com Sun Oct 26 11:49:03 2008 From: randy.dunlap at oracle.com (Randy Dunlap) Date: Sun, 26 Oct 2008 11:49:03 -0700 Subject: [ofa-general] Re: linux-next: Tree for October 23 In-Reply-To: <49041A1F.3070806@voltaire.com> References: <20081023213637.eff9b414.sfr@canb.auug.org.au> <20081023180412.394d40c2.randy.dunlap@oracle.com> <49041A1F.3070806@voltaire.com> Message-ID: <20081026114903.05f88ca8.randy.dunlap@oracle.com> On Sun, 26 Oct 2008 09:19:59 +0200 Or Gerlitz wrote: > Randy Dunlap wrote: > > Building with CONFIG_INFINIBAND=m, kconfig allows CONFIG_NET_9P_RDMA=m, > > so one module wants symbols from the other (net/9p wants symbols from rmda_*). > > > > ERROR: "rdma_destroy_id" [net/9p/9pnet_rdma.ko] undefined! > > ERROR: "rdma_connect" [net/9p/9pnet_rdma.ko] undefined! > > ... > > Is this supposed to be allowed/possible? Otherwise NET_9P_RDMA might have to depend on INFINBAND=y... > No, there's no need to config INFINIBAND at built it. What's the value > of CONFIG_INFINIBAND_ADDR_TRANS ? That should teach me to include the .config file. However, it's not difficult to recreate. CONFIG_INFINIBAND_ADDR_TRANS=n because it depends on INFINIBAND && INET, (INET being TCP/IP) which =n. NET_9P_RDMA depends on NET && NET_9P && INFINIBAND && EXPERIMENTAL, but this problem config has NET=y and INET=n. And NET != INET. So INFINIBAND_ADDR_TRANS could depend on NET instead of INET (maybe; I don't know what interfaces it really needs) or NET_9P_RDMA could depend on INET && . But I don't know which change makes the most sense, or if some other change does. HTH. config attached. --- ~Randy -------------- next part -------------- A non-text attachment was scrubbed... Name: config-rdma Type: application/octet-stream Size: 46227 bytes Desc: not available URL: From amirv at mellanox.co.il Sun Oct 26 23:55:40 2008 From: amirv at mellanox.co.il (Amir Vadai) Date: Mon, 27 Oct 2008 08:55:40 +0200 Subject: [ofa-general] Bug with SDP on IA64 In-Reply-To: <49043702.4070806@mellanox.co.il> References: <48F891D2.4010502@ext.bull.net> <49043702.4070806@mellanox.co.il> Message-ID: <490565EC.7000507@mellanox.co.il> I asked our IB expert Jack for hints and he told me this: >From Section 11.6.2 (COMPLETION RETURN STATUS0 of the IB Spec volume 1, revision 1.2.1 * Local Length Error - ... Generated for a Work Request posted to the local Receive Queue when the sum of the Data Segment lengths is too small to receive a valid incoming message or the length of the incoming message is greater than the maximum message size supported by the HCA port that received the message. There seem to be 2 possibilities: 1. The receiver did not post enough/large-enough scatter gather entries in the receive queue. or 2. The sender sent a 0-length packet, but did so incorrectly. (if any of the s/g entries (i.e., data segment entries) have a zero byte count, this results in 2 GigaBytes of data being sent over the wire). I note that SDP does not check for this (see sdp_post_send() in file sdp_bcopy.c: the sge->length field is not checked for zero length). Regarding how to debug this, you need to talk with an sdp expert to see if sdp may try to send 0-length packets under stress ([Amir]: I can help you with this). This is NOT an endianness problem -- it occurs also when he tries to send between ia64 hosts: "> When doing SDP transfers from an IA64 to any other host (IA64, x86, > x86_64) through ttcp, I got this message:" - Amir Amir Vadai wrote: > Hi, > > > Please open a bug in https://bugs.openfabrics.org/ (make sure it is not > a duplicate) > > I guess you have some endianess problem since ia64 is big endian and x86 is little endian. > > Try running the test on a stock Redhat/SLES kernel. > > - Amir > > > Nicolas Morey Chaisemartin wrote: > > >> Hi, >> >> I am stuck with a bug from ofa-kernel 1.3.1 on an IA64 running a Bull >> 2.6.18 kernel. >> When doing SDP transfers from an IA64 to any other host (IA64, x86, >> x86_64) through ttcp, I got this message: >> >> [root at h2 ~]# LD_PRELOAD=/usr/lib/libsdp.so.1 ~/ttcp/ttcp -t -s >> 192.168.0.10 >> ttcp-t: buflen=8192, nbuf=2048, align=16384/0, port=5001 tcp -> >> 192.168.0.10 >> ttcp-t: socket >> ttcp-t: tcp_maxseg >> ttcp-t: connect >> ttcp-t: IO: Connection reset by peer >> errno=104 >> [root at h2 ~]# >> >> And the same error on the other side. >> I activated the debug mode for sdp module and found out than on the >> receiver side a completion error 1 shows up: >> Oct 16 12:40:43 s_kernel at yack0 kernel: sdp_sock(5001:36814): Recv >> completion with error. Status 1 >> Oct 16 12:40:43 s_kernel at yack0 kernel: sdp_sock(5001:36814): sdp_reset >> state=1 >> Oct 16 12:40:44 s_kernel at yack0 kernel: sdp_sock(5001:36814): >> sdp_cma_handler event 10 id 0000010425120600 >> Oct 16 12:40:44 s_kernel at yack0 kernel: sdp_sock(5001:36814): >> RDMA_CM_EVENT_DISCONNECTED >> >> The error triggers a socket reset which terminates the connection. >> According to the docs I could find, Status 1 is a local length error, >> meaning the size written in the packet doesn't match the payload. >> >> I've noticed that with few packets (<= 100) or when ttcp is slowed >> down (started through strace) transfers seem to work. >> >> I've tried to update to the latest ofa-kernel (1.4.1 from 10/16/2008) >> and the bug is still there. >> >> Has anyone seen this problem before? What can I do to locate where >> things go wrong? >> >> Regards >> >> Nicolas >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > > > From celine.bourde at ext.bull.net Mon Oct 27 01:07:24 2008 From: celine.bourde at ext.bull.net (Celine Bourde) Date: Mon, 27 Oct 2008 09:07:24 +0100 Subject: [ofa-general][perftest]service level implementation Message-ID: <490576BC.30909@ext.bull.net> Hi, We are testing QoS. We have defined service level rules in opensm and implemented qos-policy. Implementation wasn't fully done in perftest tools. So, I've implemented pertest tools by adding -L option to set service levels. I took OFED 1.3 git version. The lastest commit is : commit 6321b5468f7293088cc003809049c02b176130d8 Author: Oren Meron Date: Tue Apr 1 08:22:48 2008 +0000 The patch is below --- Celine Bourde. -------------- next part -------------- A non-text attachment was scrubbed... Name: service_level.patch Type: text/x-diff Size: 23401 bytes Desc: not available URL: From tziporet at dev.mellanox.co.il Mon Oct 27 01:17:23 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 27 Oct 2008 10:17:23 +0200 Subject: [ofa-general][perftest]service level implementation In-Reply-To: <490576BC.30909@ext.bull.net> References: <490576BC.30909@ext.bull.net> Message-ID: <49057913.4060304@mellanox.co.il> Oren is perftest maintainer Tziporet Celine Bourde wrote: > Hi, > > We are testing QoS. > We have defined service level rules in opensm and implemented qos-policy. > Implementation wasn't fully done in perftest tools. > > So, I've implemented pertest tools by adding -L option to set service > levels. > > I took OFED 1.3 git version. The lastest commit is : > commit 6321b5468f7293088cc003809049c02b176130d8 > Author: Oren Meron > Date: Tue Apr 1 08:22:48 2008 +0000 > > > The patch is below > > --- > > Celine Bourde. > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ogerlitz at voltaire.com Mon Oct 27 01:27:58 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 27 Oct 2008 10:27:58 +0200 Subject: [ofa-general][perftest]service level implementation In-Reply-To: <490576BC.30909@ext.bull.net> References: <490576BC.30909@ext.bull.net> Message-ID: <49057B8E.7000102@voltaire.com> Celine Bourde wrote: > We are testing QoS. We have defined service level rules in opensm and > implemented qos-policy. Implementation wasn't fully done in perftest > tools. So, I've implemented pertest tools by adding -L option to set > service levels. I found qperf to be very useful for QoS testing, it has -sl option. Or. From nicolas.morey-chaisemartin at ext.bull.net Mon Oct 27 02:09:33 2008 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Mon, 27 Oct 2008 10:09:33 +0100 Subject: [ofa-general] Bug with SDP on IA64 In-Reply-To: <490565EC.7000507@mellanox.co.il> References: <48F891D2.4010502@ext.bull.net> <49043702.4070806@mellanox.co.il> <490565EC.7000507@mellanox.co.il> Message-ID: <4905854D.30607@ext.bull.net> Amir Vadai a écrit : > I asked our IB expert Jack for hints and he told me this: > > > >From Section 11.6.2 (COMPLETION RETURN STATUS0 of the IB Spec volume 1, revision 1.2.1 > * Local Length Error - ... Generated for a > Work Request posted to the local Receive Queue when the sum of > the Data Segment lengths is too small to receive a valid incoming > message or the length of the incoming message is greater than the > maximum message size supported by the HCA port that received the > message. > > > There seem to be 2 possibilities: > 1. The receiver did not post enough/large-enough scatter gather entries in > the receive queue. > > > or > 2. The sender sent a 0-length packet, but did so incorrectly. > (if any of the s/g entries (i.e., data segment entries) have a zero > byte count, this results in 2 GigaBytes of data being sent over the wire). > > > I note that SDP does not check for this (see sdp_post_send() in file sdp_bcopy.c: > the sge->length field is not checked for zero length). > > > Regarding how to debug this, you need to talk with an sdp expert to see if sdp may try > to send 0-length packets under stress ([Amir]: I can help you with this). > > I've just run a few more tests. I added a test in sdp_post_send to check to sge->length field: if(sge->length == 0){printk(KERN_ERR "SDP sending 0bytes packet\n");} In the case of IA64-> IA64 transfer (it is in fact on the same server), the message shows up in the syslog just before the connection crashes. However on IA64->x86_64 transfer, it doesn't show up, so I doubt it comes from here. I also doubt it comes from the buffer on the receiving end as sdp transfers fail from IA64 to x86 but they are successful on x86 to x86, and on RDMA transfer (using perftest tools), x86 to x86 transfer have shown higher performances due to better PCI bus. I tried to follow the packet/frag size from in sdb_post_send (sdb_bcopy.c) and it appears there are packet over 4k going through: Oct 27 09:05:03 s_kernel at h2 kernel: SDP sending 30720 bytes packet on frag 0 Isn't a packet size supposed to be <= to the MTU at this point? I added the same line on x86_64 and all fragments have size <= 4096, so my guess is there is a problem there on IA64 Nicolas Morey-Chaisemartin From philippe.gregoire at cea.fr Mon Oct 27 02:40:17 2008 From: philippe.gregoire at cea.fr (Philippe Gregoire) Date: Mon, 27 Oct 2008 10:40:17 +0100 Subject: [ofa-general] opensm as service - cfg files In-Reply-To: <1224786733.1197.398.camel@cardanus.llnl.gov> References: <48FF22FC.6000606@dev.mellanox.co.il> <490073C0.70109@cea.fr> <1224786733.1197.398.camel@cardanus.llnl.gov> Message-ID: <49058C81.6000007@cea.fr> Al Chu a écrit : > On Thu, 2008-10-23 at 14:53 +0200, Philippe Gregoire wrote: > >> Hi Yevgeny, >> >> Is it possible to write this service so it will be able to manage multiple instances of opensm on the same node, I mean start and stop all instances at the same time or separately. >> This will be very usefull when you have several Infiniband storage devices connected directly to one node, >> so you have to run several opensm -g guid processes on this node. >> >> It is authorized to have a service that understand parameters like: >> service start 0x8000010232 >> or >> service start ddn12.conf >> > > This doesn't sound like that bad of idea, although "what does the user > expect" is a concern. My co-worker brought up the simple issue of the > log files. Do you automatically pick a different log file to store to, > or does it store to the same log, or is it the user's responsibility to > pick a reasonable different log file name in the .conf file? I have no > idea what other daemons/init scripts do. > > Al > > init scripts generally execute/source some configuration file located in /etc/sysconfig/ to set some variables used in the script. These variables can be used to distinguish pid filename and log filename for different opensm instances. If these variables are not defined in the conf file, they should be build from the parameter value e.g : opensm.log.ddn12 or opensm.pid.ddn12 >> Philippe Gregoire >> CEA/DAM. >> >> Yevgeny Kliteynik a écrit : >> >>> Hi Sasha, >>> >>> I was just trying to put some order in my head regarding >>> the use of opensm as service, and I have couple of questions. >>> Some of them might be dumb, so please bear with me... :) >>> >>> 1. OpenSM config file. >>> Do we still need opensm/scripts/opensm.conf? >>> I think it's not used any more. >>> >>> 2. From opensm/scripts/opensm.init.in: >>> @sbindir@/opensm -B $OPTIONS > /dev/null >>> Is someone setting the $OPTIONS variable? I think it was >>> set in the config file in the past, but not now. >>> >>> 3. From opensm/scripts/redhat-opensm.init.in: >>> CONFIG=@sysconfdir@/sysconfig/opensm.conf >>> if [ -f $CONFIG ]; then >>> . $CONFIG >>> fi >>> >>> From opensm/scripts/opensm.init.in: >>> if [[ -s /etc/sysconfig/opensm ]]; then >>> . /etc/sysconfig/opensm >>> fi >>> >>> If it's not some naming convention, perhaps we should use >>> opensm.conf in both cases? >>> >>> 4. Logrotate: >>> opensm/scripts/opensm.spec.in installs logrotate file as follows: >>> install -m 644 scripts/opensm.logrotate $etc/logrotate.d/opensm >>> I may be off here, but should the installed file name be opensmd >>> to match the service name? >>> >>> -- Yevgeny >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http:// openib.org/mailman/listinfo/openib-general >>> >>> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general >> >> From vlad at lists.openfabrics.org Mon Oct 27 03:29:34 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 27 Oct 2008 03:29:34 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081027-0200 daily build status Message-ID: <20081027102934.72606E609D0@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From nicolas.morey-chaisemartin at ext.bull.net Mon Oct 27 03:43:40 2008 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Mon, 27 Oct 2008 11:43:40 +0100 Subject: [ofa-general] Bug with SDP on IA64 In-Reply-To: <490565EC.7000507@mellanox.co.il> References: <48F891D2.4010502@ext.bull.net> <49043702.4070806@mellanox.co.il> <490565EC.7000507@mellanox.co.il> Message-ID: <49059B5C.40605@ext.bull.net> Amir Vadai a écrit : > I asked our IB expert Jack for hints and he told me this: > > > >From Section 11.6.2 (COMPLETION RETURN STATUS0 of the IB Spec volume 1, revision 1.2.1 > * Local Length Error - ... Generated for a > Work Request posted to the local Receive Queue when the sum of > the Data Segment lengths is too small to receive a valid incoming > message or the length of the incoming message is greater than the > maximum message size supported by the HCA port that received the > message. > > > There seem to be 2 possibilities: > 1. The receiver did not post enough/large-enough scatter gather entries in > the receive queue. > > > or > 2. The sender sent a 0-length packet, but did so incorrectly. > (if any of the s/g entries (i.e., data segment entries) have a zero > byte count, this results in 2 GigaBytes of data being sent over the wire). > > > I note that SDP does not check for this (see sdp_post_send() in file sdp_bcopy.c: > the sge->length field is not checked for zero length). > > I think I got it. In sdp_cma.c/sdp_response_handler, the fragment size is retrieved through sdp_sk(sk)->xmit_size_goal = ntohl(h->actrcvsz) - sizeof(struct sdp_bsdh); The dmesg messages shows : sdp_sock(41820:0): sdp_response_handler bufs 64 xmit_size_goal 34816 send trigger 16 I forced this value to 2048 and then it works. On Xeon this size is 2048 by default. In my understanding the xmit_size_goal is the size of the receiving buffer for buffered copies, isn't it? So it shouldn't really matters as long as the packet is properly split at the MTu size to be sent over the network, right? Could it be only working from x86/x86_64 working because the buffer size is smaller than the MTU? Nicolas From dotanba at gmail.com Mon Oct 27 05:08:25 2008 From: dotanba at gmail.com (Dotan Barak) Date: Mon, 27 Oct 2008 14:08:25 +0200 Subject: [ofa-general] Bug with SDP on IA64 In-Reply-To: <4905854D.30607@ext.bull.net> References: <48F891D2.4010502@ext.bull.net> <49043702.4070806@mellanox.co.il> <490565EC.7000507@mellanox.co.il> <4905854D.30607@ext.bull.net> Message-ID: <2f3bf9a60810270508q6919e145r532030f7a226a194@mail.gmail.com> On Mon, Oct 27, 2008 at 11:09 AM, Nicolas Morey Chaisemartin wrote: > Amir Vadai a écrit : >> >> I asked our IB expert Jack for hints and he told me this: >> >> >> >From Section 11.6.2 (COMPLETION RETURN STATUS0 of the IB Spec volume 1, >> revision 1.2.1 >> * Local Length Error - ... Generated for a >> Work Request posted to the local Receive Queue when the sum of >> the Data Segment lengths is too small to receive a valid incoming >> message or the length of the incoming message is greater than the >> maximum message size supported by the HCA port that received the >> message. >> >> >> There seem to be 2 possibilities: >> 1. The receiver did not post enough/large-enough scatter gather entries in >> the receive queue. >> >> >> or 2. The sender sent a 0-length packet, but did so incorrectly. >> (if any of the s/g entries (i.e., data segment entries) have a zero >> byte count, this results in 2 GigaBytes of data being sent over the >> wire). >> >> >> I note that SDP does not check for this (see sdp_post_send() in file >> sdp_bcopy.c: >> the sge->length field is not checked for zero length). >> >> >> Regarding how to debug this, you need to talk with an sdp expert to see if >> sdp may try >> to send 0-length packets under stress ([Amir]: I can help you with this). >> >> > > I've just run a few more tests. > I added a test in sdp_post_send to check to sge->length field: > if(sge->length == 0){printk(KERN_ERR "SDP sending 0bytes packet\n");} Please pay attension: sge->length of 0 means that you send 2GB and not 0 bytes. If you want to send 0 bytes, the sg_list should be empty (0 entries). This is why you have a length violation ... Dotan From nicolas.morey-chaisemartin at ext.bull.net Mon Oct 27 05:32:04 2008 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Mon, 27 Oct 2008 13:32:04 +0100 Subject: [ofa-general] Bug with SDP on IA64 In-Reply-To: <2f3bf9a60810270508q6919e145r532030f7a226a194@mail.gmail.com> References: <48F891D2.4010502@ext.bull.net> <49043702.4070806@mellanox.co.il> <490565EC.7000507@mellanox.co.il> <4905854D.30607@ext.bull.net> <2f3bf9a60810270508q6919e145r532030f7a226a194@mail.gmail.com> Message-ID: <4905B4C4.6090606@ext.bull.net> Dotan Barak a écrit : > On Mon, Oct 27, 2008 at 11:09 AM, Nicolas Morey Chaisemartin > wrote: > >> Amir Vadai a écrit : >> >>> I asked our IB expert Jack for hints and he told me this: >>> >>> >>> >From Section 11.6.2 (COMPLETION RETURN STATUS0 of the IB Spec volume 1, >>> revision 1.2.1 >>> * Local Length Error - ... Generated for a >>> Work Request posted to the local Receive Queue when the sum of >>> the Data Segment lengths is too small to receive a valid incoming >>> message or the length of the incoming message is greater than the >>> maximum message size supported by the HCA port that received the >>> message. >>> >>> >>> There seem to be 2 possibilities: >>> 1. The receiver did not post enough/large-enough scatter gather entries in >>> the receive queue. >>> >>> >>> or 2. The sender sent a 0-length packet, but did so incorrectly. >>> (if any of the s/g entries (i.e., data segment entries) have a zero >>> byte count, this results in 2 GigaBytes of data being sent over the >>> wire). >>> >>> >>> I note that SDP does not check for this (see sdp_post_send() in file >>> sdp_bcopy.c: >>> the sge->length field is not checked for zero length). >>> >>> >>> Regarding how to debug this, you need to talk with an sdp expert to see if >>> sdp may try >>> to send 0-length packets under stress ([Amir]: I can help you with this). >>> >>> >>> >> I've just run a few more tests. >> I added a test in sdp_post_send to check to sge->length field: >> if(sge->length == 0){printk(KERN_ERR "SDP sending 0bytes packet\n");} >> > > Please pay attension: sge->length of 0 means that you send 2GB and not 0 bytes. > If you want to send 0 bytes, the sg_list should be empty (0 entries). > > This is why you have a length violation ... > > > Dotan > > This is just a debug message. And I only have a 0 sge->length in the case of IA64 to IA64 transfer. When transferring to IA64 to x86_64 I don't have any problem with this. As I said in my last message, the problem seems to be linked to the bcopy buffer size. Nicolas From kliteyn at dev.mellanox.co.il Mon Oct 27 05:46:09 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 27 Oct 2008 14:46:09 +0200 Subject: [ofa-general] opensm as service - cfg files In-Reply-To: <49058C81.6000007@cea.fr> References: <48FF22FC.6000606@dev.mellanox.co.il> <490073C0.70109@cea.fr> <1224786733.1197.398.camel@cardanus.llnl.gov> <49058C81.6000007@cea.fr> Message-ID: <4905B811.6020601@dev.mellanox.co.il> Philippe Gregoire wrote: > Al Chu a écrit : >> On Thu, 2008-10-23 at 14:53 +0200, Philippe Gregoire wrote: >> >>> Hi Yevgeny, >>> >>> Is it possible to write this service so it will be able to manage >>> multiple instances of opensm on the same node, I mean start and stop >>> all instances at the same time or separately. >>> This will be very usefull when you have several Infiniband storage >>> devices connected directly to one node, >>> so you have to run several opensm -g guid processes on this node. >>> >>> It is authorized to have a service that understand parameters like: >>> service start 0x8000010232 >>> or service start ddn12.conf >>> >> >> This doesn't sound like that bad of idea, although "what does the user >> expect" is a concern. My co-worker brought up the simple issue of the >> log files. Do you automatically pick a different log file to store to, >> or does it store to the same log, or is it the user's responsibility to >> pick a reasonable different log file name in the .conf file? I have no >> idea what other daemons/init scripts do. >> >> Al >> >> > > init scripts generally execute/source some configuration file located in > /etc/sysconfig/ to set some variables used in the script. These > variables can be used to distinguish pid filename and log filename for > different opensm instances. If these variables are not defined in the > conf file, they should be build from the parameter value e.g : > opensm.log.ddn12 or opensm.pid.ddn12 It is possible to make init script understand parameters, so that you will be able to run "opensmd start -guid " or "opensmd start -conf ", but I think that there will be some problems monitoring these opensm instances once they started. For instance, how would you run "opensmd status" on a specific opensm instance? Other approach is to have some variable in the conf file, e.g. GUIDS="guid1 guid2 ...", and then in the init script it will iterate through all the guids and run opensm instances for all of them, but then you'll be able to manage these processes together, not one by one. For instance, "opensmd stop" will kill all the opensm processes. -- Yevgeny >>> Philippe Gregoire >>> CEA/DAM. >>> >>> Yevgeny Kliteynik a écrit : >>> >>>> Hi Sasha, >>>> >>>> I was just trying to put some order in my head regarding >>>> the use of opensm as service, and I have couple of questions. >>>> Some of them might be dumb, so please bear with me... :) >>>> >>>> 1. OpenSM config file. >>>> Do we still need opensm/scripts/opensm.conf? >>>> I think it's not used any more. >>>> >>>> 2. From opensm/scripts/opensm.init.in: >>>> @sbindir@/opensm -B $OPTIONS > /dev/null >>>> Is someone setting the $OPTIONS variable? I think it was >>>> set in the config file in the past, but not now. >>>> >>>> 3. From opensm/scripts/redhat-opensm.init.in: >>>> CONFIG=@sysconfdir@/sysconfig/opensm.conf >>>> if [ -f $CONFIG ]; then >>>> . $CONFIG >>>> fi >>>> >>>> From opensm/scripts/opensm.init.in: >>>> if [[ -s /etc/sysconfig/opensm ]]; then >>>> . /etc/sysconfig/opensm >>>> fi >>>> >>>> If it's not some naming convention, perhaps we should use >>>> opensm.conf in both cases? >>>> >>>> 4. Logrotate: >>>> opensm/scripts/opensm.spec.in installs logrotate file as follows: >>>> install -m 644 scripts/opensm.logrotate $etc/logrotate.d/opensm >>>> I may be off here, but should the installed file name be opensmd >>>> to match the service name? >>>> >>>> -- Yevgeny >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit http:// >>>> openib.org/mailman/listinfo/openib-general >>>> >>>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit http:// >>> openib.org/mailman/listinfo/openib-general >>> >>> > > From amirv at mellanox.co.il Mon Oct 27 06:32:55 2008 From: amirv at mellanox.co.il (Amir Vadai) Date: Mon, 27 Oct 2008 15:32:55 +0200 Subject: [ofa-general] Bug with SDP on IA64 In-Reply-To: <49059B5C.40605@ext.bull.net> References: <48F891D2.4010502@ext.bull.net> <49043702.4070806@mellanox.co.il> <490565EC.7000507@mellanox.co.il> <49059B5C.40605@ext.bull.net> Message-ID: <4905C307.3020801@mellanox.co.il> I opened a bug in bugzilla with your research: https://bugs.openfabrics.org/show_bug.cgi?id=1311 Nicolas Morey Chaisemartin wrote: > Amir Vadai a écrit : >> I asked our IB expert Jack for hints and he told me this: >> >> >> >From Section 11.6.2 (COMPLETION RETURN STATUS0 of the IB Spec volume >> 1, revision 1.2.1 >> * Local Length Error - ... Generated for a >> Work Request posted to the local Receive Queue when the sum of >> the Data Segment lengths is too small to receive a valid incoming >> message or the length of the incoming message is greater than the >> maximum message size supported by the HCA port that received the >> message. >> >> >> There seem to be 2 possibilities: >> 1. The receiver did not post enough/large-enough scatter gather >> entries in >> the receive queue. >> >> >> or 2. The sender sent a 0-length packet, but did so incorrectly. >> (if any of the s/g entries (i.e., data segment entries) have a zero >> byte count, this results in 2 GigaBytes of data being sent over >> the wire). >> >> >> I note that SDP does not check for this (see sdp_post_send() in >> file sdp_bcopy.c: >> the sge->length field is not checked for zero length). >> >> > > I think I got it. > In sdp_cma.c/sdp_response_handler, > the fragment size is retrieved through > sdp_sk(sk)->xmit_size_goal = ntohl(h->actrcvsz) - > sizeof(struct sdp_bsdh); > The dmesg messages shows : > sdp_sock(41820:0): sdp_response_handler bufs 64 xmit_size_goal 34816 > send trigger 16 > > I forced this value to 2048 and then it works. > On Xeon this size is 2048 by default. > > In my understanding the xmit_size_goal is the size of the receiving > buffer for buffered copies, isn't it? > So it shouldn't really matters as long as the packet is properly split > at the MTu size to be sent over the network, right? > Could it be only working from x86/x86_64 working because the buffer > size is smaller than the MTU? > > Nicolas > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From philippe.gregoire at cea.fr Mon Oct 27 06:51:17 2008 From: philippe.gregoire at cea.fr (Philippe Gregoire) Date: Mon, 27 Oct 2008 14:51:17 +0100 Subject: [ofa-general] opensm as service - cfg files In-Reply-To: <4905B811.6020601@dev.mellanox.co.il> References: <48FF22FC.6000606@dev.mellanox.co.il> <490073C0.70109@cea.fr> <1224786733.1197.398.camel@cardanus.llnl.gov> <49058C81.6000007@cea.fr> <4905B811.6020601@dev.mellanox.co.il> Message-ID: <4905C755.6010901@cea.fr> Yevgeny Kliteynik a écrit : > Philippe Gregoire wrote: >> Al Chu a écrit : >>> On Thu, 2008-10-23 at 14:53 +0200, Philippe Gregoire wrote: >>> >>>> Hi Yevgeny, >>>> >>>> Is it possible to write this service so it will be able to manage >>>> multiple instances of opensm on the same node, I mean start and >>>> stop all instances at the same time or separately. >>>> This will be very usefull when you have several Infiniband storage >>>> devices connected directly to one node, >>>> so you have to run several opensm -g guid processes on this node. >>>> >>>> It is authorized to have a service that understand parameters like: >>>> service start 0x8000010232 >>>> or service start ddn12.conf >>>> >>> >>> This doesn't sound like that bad of idea, although "what does the user >>> expect" is a concern. My co-worker brought up the simple issue of the >>> log files. Do you automatically pick a different log file to store to, >>> or does it store to the same log, or is it the user's responsibility to >>> pick a reasonable different log file name in the .conf file? I have no >>> idea what other daemons/init scripts do. >>> >>> Al >>> >>> >> >> init scripts generally execute/source some configuration file located >> in /etc/sysconfig/ to set some variables used in the script. These >> variables can be used to distinguish pid filename and log filename >> for different opensm instances. If these variables are not defined in >> the conf file, they should be build from the parameter value e.g : >> opensm.log.ddn12 or opensm.pid.ddn12 > > It is possible to make init script understand parameters, so that > you will be able to run "opensmd start -guid " or > "opensmd start -conf ", but I think that there will be > some problems monitoring these opensm instances once they started. > For instance, how would you run "opensmd status" on a specific > opensm instance? No, I you do the job for the start part, it should be easy to do the same for the status part to allow service opensmd status -guid or -conf . On I/O node connected directly to an IB storage, you must be able to manage separately each IB port. > Other approach is to have some variable in the conf file, e.g. > GUIDS="guid1 guid2 ...", and then in the init script it will > iterate through all the guids and run opensm instances for all of > them, but then you'll be able to manage these processes together, > not one by one. For instance, "opensmd stop" will kill all the > opensm processes. > > -- Yevgeny > >>>> Philippe Gregoire >>>> CEA/DAM. >>>> >>>> Yevgeny Kliteynik a écrit : >>>> >>>>> Hi Sasha, >>>>> >>>>> I was just trying to put some order in my head regarding >>>>> the use of opensm as service, and I have couple of questions. >>>>> Some of them might be dumb, so please bear with me... :) >>>>> >>>>> 1. OpenSM config file. >>>>> Do we still need opensm/scripts/opensm.conf? >>>>> I think it's not used any more. >>>>> >>>>> 2. From opensm/scripts/opensm.init.in: >>>>> @sbindir@/opensm -B $OPTIONS > /dev/null >>>>> Is someone setting the $OPTIONS variable? I think it was >>>>> set in the config file in the past, but not now. >>>>> >>>>> 3. From opensm/scripts/redhat-opensm.init.in: >>>>> CONFIG=@sysconfdir@/sysconfig/opensm.conf >>>>> if [ -f $CONFIG ]; then >>>>> . $CONFIG >>>>> fi >>>>> >>>>> From opensm/scripts/opensm.init.in: >>>>> if [[ -s /etc/sysconfig/opensm ]]; then >>>>> . /etc/sysconfig/opensm >>>>> fi >>>>> >>>>> If it's not some naming convention, perhaps we should use >>>>> opensm.conf in both cases? >>>>> >>>>> 4. Logrotate: >>>>> opensm/scripts/opensm.spec.in installs logrotate file as follows: >>>>> install -m 644 scripts/opensm.logrotate $etc/logrotate.d/opensm >>>>> I may be off here, but should the installed file name be opensmd >>>>> to match the service name? >>>>> >>>>> -- Yevgeny >>>>> _______________________________________________ >>>>> general mailing list >>>>> general at lists.openfabrics.org >>>>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>> >>>>> To unsubscribe, please visit http:// >>>>> openib.org/mailman/listinfo/openib-general >>>>> >>>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit http:// >>>> openib.org/mailman/listinfo/openib-general >>>> >>>> >> >> > > From amirv at mellanox.co.il Mon Oct 27 08:11:53 2008 From: amirv at mellanox.co.il (Amir Vadai) Date: Mon, 27 Oct 2008 17:11:53 +0200 Subject: [ofa-general] [PATCH] sdp: fixed sparse warning In-Reply-To: <> References: <> Message-ID: <1225120313-17067-1-git-send-email-amirv@mellanox.co.il> --- drivers/infiniband/ulp/sdp/sdp_bcopy.c | 4 ++-- drivers/infiniband/ulp/sdp/sdp_cma.c | 8 ++++---- drivers/infiniband/ulp/sdp/sdp_main.c | 11 +++++------ 3 files changed, 11 insertions(+), 12 deletions(-) diff --git a/drivers/infiniband/ulp/sdp/sdp_bcopy.c b/drivers/infiniband/ulp/sdp/sdp_bcopy.c index d60b257..a2472e9 100644 --- a/drivers/infiniband/ulp/sdp/sdp_bcopy.c +++ b/drivers/infiniband/ulp/sdp/sdp_bcopy.c @@ -303,7 +303,7 @@ static void sdp_post_recv(struct sdp_sock *ssk) skb_frag_t *frag; struct sdp_bsdh *h; int id = ssk->rx_head; - unsigned int gfp_page; + gfp_t gfp_page; /* Now, allocate and repost recv */ /* TODO: allocate from cache */ @@ -496,7 +496,7 @@ void sdp_post_sends(struct sdp_sock *ssk, int nonagle) /* TODO: nonagle? */ struct sk_buff *skb; int c; - int gfp_page; + gfp_t gfp_page; if (unlikely(!ssk->id)) { if (ssk->isk.sk.sk_send_head) { diff --git a/drivers/infiniband/ulp/sdp/sdp_cma.c b/drivers/infiniband/ulp/sdp/sdp_cma.c index 6659d28..6206835 100644 --- a/drivers/infiniband/ulp/sdp/sdp_cma.c +++ b/drivers/infiniband/ulp/sdp/sdp_cma.c @@ -93,7 +93,7 @@ static void sdp_qp_event_handler(struct ib_event *event, void *data) { } -int sdp_init_qp(struct sock *sk, struct rdma_cm_id *id) +static int sdp_init_qp(struct sock *sk, struct rdma_cm_id *id) { struct ib_qp_init_attr qp_init_attr = { .event_handler = sdp_qp_event_handler, @@ -193,7 +193,7 @@ err_tx: return rc; } -int sdp_connect_handler(struct sock *sk, struct rdma_cm_id *id, +static int sdp_connect_handler(struct sock *sk, struct rdma_cm_id *id, struct rdma_cm_event *event) { struct sockaddr_in *dst_addr; @@ -303,7 +303,7 @@ static int sdp_response_handler(struct sock *sk, struct rdma_cm_id *id, return 0; } -int sdp_connected_handler(struct sock *sk, struct rdma_cm_event *event) +static int sdp_connected_handler(struct sock *sk, struct rdma_cm_event *event) { struct sock *parent; sdp_dbg(sk, "%s\n", __func__); @@ -345,7 +345,7 @@ done: return 0; } -int sdp_disconnected_handler(struct sock *sk) +static int sdp_disconnected_handler(struct sock *sk) { struct sdp_sock *ssk = sdp_sk(sk); diff --git a/drivers/infiniband/ulp/sdp/sdp_main.c b/drivers/infiniband/ulp/sdp/sdp_main.c index 32833cd..17e98bb 100644 --- a/drivers/infiniband/ulp/sdp/sdp_main.c +++ b/drivers/infiniband/ulp/sdp/sdp_main.c @@ -141,7 +141,7 @@ struct workqueue_struct *sdp_workqueue; static struct list_head sock_list; static spinlock_t sock_list_lock; -DEFINE_RWLOCK(device_removal_lock); +static DEFINE_RWLOCK(device_removal_lock); static inline unsigned int sdp_keepalive_time_when(const struct sdp_sock *ssk) { @@ -1227,7 +1227,7 @@ static inline void skb_entail(struct sock *sk, struct sdp_sock *ssk, ssk->nonagle &= ~TCP_NAGLE_PUSH; } -void sdp_push_one(struct sock *sk, unsigned int mss_now) +static void sdp_push_one(struct sock *sk, unsigned int mss_now) { } @@ -1593,7 +1593,7 @@ void sdp_bzcopy_write_space(struct sdp_sock *ssk) /* Like tcp_sendmsg */ /* TODO: check locking */ -int sdp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, +static int sdp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, size_t size) { struct iovec *iov; @@ -1939,7 +1939,6 @@ static int sdp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, } } if (!(flags & MSG_TRUNC)) { - int err; err = skb_copy_datagram_iovec(skb, offset, /* TODO: skip header? */ msg->msg_iov, used); @@ -2018,7 +2017,7 @@ static int sdp_listen(struct sock *sk, int backlog) /* We almost could use inet_listen, but that calls inet_csk_listen_start. Longer term we'll want to add a listen callback to struct proto, similiar to bind. */ -int sdp_inet_listen(struct socket *sock, int backlog) +static int sdp_inet_listen(struct socket *sock, int backlog) { struct sock *sk = sock->sk; unsigned char old_state; @@ -2453,7 +2452,7 @@ static struct net_proto_family sdp_net_proto = { .owner = THIS_MODULE, }; -struct ib_client sdp_client = { +static struct ib_client sdp_client = { .name = "sdp", .add = sdp_add_device, .remove = sdp_remove_device -- 1.5.3 From tziporet at mellanox.co.il Mon Oct 27 08:21:05 2008 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 27 Oct 2008 17:21:05 +0200 Subject: [ofa-general] OFED meeting agenda for today (Oct 27) Message-ID: <5D49E7A8952DC44FB38C38FA0D758EADCC5D51@mtlexch01.mtl.com> This is the agenda for the OFED meeting today: 1. Review bugs status and decide on their priority 1262 cri andy.grover at oracle.com congestion hang with RDS 1298 cri Jeffrey.C.Becker at nasa.gov nfsrdma rh5.1 causes kernel panic 1299 cri Jeffrey.C.Becker at nasa.gov nfs module is missing symbols in rh5.1 1287 cri vlad at mellanox.co.il IPoIB datagram mode initial packet loss 1242 cri yannick.cote at qlogic.com kernel panic while running mpi2007 against ofed1.4 -- ib_... 1301 maj andy.grover at oracle.com Can not load rds module on RH4 up7 1221 maj Jeffrey.C.Becker at nasa.gov SLES10 sp2: remote logins via ssh fail due to rpcbind and... 1284 maj monis at voltaire.com Bonding - when eth bonding and IB bonding are configuerd,... 1286 maj monis at voltaire.com bond does not failover correctly 1308 maj vlad at mellanox.co.il path_rec_completion [ib_ipoib] kernel Oops (Unable to han... 1164 maj yosefe at voltaire.com iperf over IPoIB UD fails for 100 tcp connections 1288 maj yosefe at voltaire.com bug warning while disabling/enabling ports from he swith 2. Decide on RC4 target date 3. Review BOF presentation - Woody if you wish to review the latest version you can send it Tziporet Koren Software Director Mellanox Technologies mailto: tziporet at mellanox.co.il Tel +972-4-9097200, ext 380 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Mon Oct 27 09:12:20 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 27 Oct 2008 09:12:20 -0700 Subject: [ofa-general] Re: [PATCH] IB/sysfs: Add port_xmit_wait counter. In-Reply-To: <20081026100202.GA15179@mellanox.co.il> (Vladimir Sokolovsky's message of "Sun, 26 Oct 2008 12:02:02 +0200") References: <20081026100202.GA15179@mellanox.co.il> Message-ID: Looks OK... probably not worth checking ClassPortInfo:CapabilityMask.PortCountersXmitWaitSupported to make sure this field is defined, although it is unfortunate that the IB spec says that PortXmitWait is undefined rather than 0 when it isn't supported. Anyway, one question: > static PORT_PMA_ATTR(port_xmit_packets , 14, 32, 256); > static PORT_PMA_ATTR(port_rcv_packets , 15, 32, 288); > +/* > + * There is no bit allocated for port_xmit_wait in the CounterSelect field > + * (IB spec). However, since this bit is ignored when reading > + * (show_pma_counter), the _counter field of port_xmit_wait can be set to zero. > + */ > +static PORT_PMA_ATTR(port_xmit_wait , 0, 32, 320); I actually can't find any place where we look at the _counter field that is passed into PORT_PMA_ATTR(), and this codfe was written so long ago that I can't remember what reason (if any) we had for including it. Do you know if there's any reason why not to just delete the whole _counter thing entirely? From rdreier at cisco.com Mon Oct 27 09:19:20 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 27 Oct 2008 09:19:20 -0700 Subject: [ofa-general] [PATCH] ib_core: Use weak ordering for data registered memory In-Reply-To: <20081026093253.GA11974@mtls03> (Eli Cohen's message of "Sun, 26 Oct 2008 11:32:53 +0200") References: <20081026093253.GA11974@mtls03> Message-ID: > Some architectures support weak ordering in which case better > performance is possible. IB registered memory used for data can be > weakly ordered becuase the the completion queues' buffers are > registered as strongly ordered. This will result in flushing all data > related outstanding DMA requests by the HCA when a completion is DMAed > to a completion queue buffer. > This patch will allow weak ordering for data if ib_core is loaded with > the module parameter, allow_weak_ordering, set to a none zero value. Hmm, I guess this is OK, although I wish there were a good way for users and applications to know whether the ordering of RDMA can be relied on or not. > Also, are you going to push this to 2.6.28? No, since this appeared late in the 2.6.28 merge window, it's way too late for 2.6.28 -- things need to be submitted before the merge window to have a chance of going in. We'll get some fix for Cell performance into 2.6.29. - R. From rdreier at cisco.com Mon Oct 27 09:21:16 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 27 Oct 2008 09:21:16 -0700 Subject: [ofa-general] [PATCH 3/3] IB/ipath - improve UD loopback performance by allocating temp array once In-Reply-To: <20081023195017.10020.33878.stgit@eng-46.mv.qlogic.com> (Ralph Campbell's message of "Thu, 23 Oct 2008 12:50:17 -0700") References: <20081023195001.10020.96260.stgit@eng-46.mv.qlogic.com> <20081023195017.10020.33878.stgit@eng-46.mv.qlogic.com> Message-ID: The first two look OK for 2.6.28 I guess, although they don't seem to be regression fixes and appeared at the tail end of the merge window. I'll probably sneak them into -rc3. But this patch is just an optimization, right? So I'll wait for 2.6.29 for this one. From Thomas.Talpey at netapp.com Mon Oct 27 09:32:06 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Mon, 27 Oct 2008 12:32:06 -0400 Subject: [ofa-general] [PATCH] ib_core: Use weak ordering for data registered memory In-Reply-To: References: <20081026093253.GA11974@mtls03> Message-ID: At 12:19 PM 10/27/2008, Roland Dreier wrote: > > > Some architectures support weak ordering in which case better > > performance is possible. IB registered memory used for data can be > > weakly ordered becuase the the completion queues' buffers are > > registered as strongly ordered. This will result in flushing all data > > related outstanding DMA requests by the HCA when a completion is DMAed > > to a completion queue buffer. > > This patch will allow weak ordering for data if ib_core is loaded with > > the module parameter, allow_weak_ordering, set to a none zero value. > >Hmm, I guess this is OK, although I wish there were a good way for users >and applications to know whether the ordering of RDMA can be relied on >or not. They can't, right? RDMA operations aren't ordered at all per spec, though there are some architectures/implementations that do. In fact, one might argue that weak ordering should be the _default_ setting here. It would certainly prevent surprise later. Eli - is there some reason you chose mode 0444 to protect against writing the setting after module loading? It looks like the value is inspected dynamically. Tom. > > > Also, are you going to push this to 2.6.28? > >No, since this appeared late in the 2.6.28 merge window, it's way too >late for 2.6.28 -- things need to be submitted before the merge window >to have a chance of going in. > >We'll get some fix for Cell performance into 2.6.29. > > - R. >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From weiny2 at llnl.gov Mon Oct 27 09:54:30 2008 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 27 Oct 2008 09:54:30 -0700 Subject: [ofa-general] opensm as service - cfg files In-Reply-To: <49058C81.6000007@cea.fr> References: <48FF22FC.6000606@dev.mellanox.co.il> <490073C0.70109@cea.fr> <1224786733.1197.398.camel@cardanus.llnl.gov> <49058C81.6000007@cea.fr> Message-ID: <20081027095430.3655e863.weiny2@llnl.gov> On Mon, 27 Oct 2008 10:40:17 +0100 Philippe Gregoire wrote: > Al Chu a écrit : > > On Thu, 2008-10-23 at 14:53 +0200, Philippe Gregoire wrote: > > > >> Hi Yevgeny, > >> > >> Is it possible to write this service so it will be able to manage multiple instances of opensm on the same node, I mean start and stop all instances at the same time or separately. > >> This will be very usefull when you have several Infiniband storage devices connected directly to one node, > >> so you have to run several opensm -g guid processes on this node. > >> > >> It is authorized to have a service that understand parameters like: > >> service start 0x8000010232 > >> or > >> service start ddn12.conf > >> > > > > This doesn't sound like that bad of idea, although "what does the user > > expect" is a concern. My co-worker brought up the simple issue of the > > log files. Do you automatically pick a different log file to store to, > > or does it store to the same log, or is it the user's responsibility to > > pick a reasonable different log file name in the .conf file? I have no > > idea what other daemons/init scripts do. > > > > Al > > > > > > init scripts generally execute/source some configuration file located in > /etc/sysconfig/ to set some variables used in the script. These variables can > be used to distinguish pid filename and log filename for different opensm > instances. If these variables are not defined in the conf file, they should > be build from the parameter value e.g : opensm.log.ddn12 or opensm.pid.ddn12 I hate to throw fuel on the fire but the console port would have to be changed for each instance as well. If you want to use the console on each SM. I don't know if there are other things but I think it will take some work to get it sorted out. Ira > > > > >> Philippe Gregoire > >> CEA/DAM. > >> > >> Yevgeny Kliteynik a écrit : > >> > >>> Hi Sasha, > >>> > >>> I was just trying to put some order in my head regarding > >>> the use of opensm as service, and I have couple of questions. > >>> Some of them might be dumb, so please bear with me... :) > >>> > >>> 1. OpenSM config file. > >>> Do we still need opensm/scripts/opensm.conf? > >>> I think it's not used any more. > >>> > >>> 2. From opensm/scripts/opensm.init.in: > >>> @sbindir@/opensm -B $OPTIONS > /dev/null > >>> Is someone setting the $OPTIONS variable? I think it was > >>> set in the config file in the past, but not now. > >>> > >>> 3. From opensm/scripts/redhat-opensm.init.in: > >>> CONFIG=@sysconfdir@/sysconfig/opensm.conf > >>> if [ -f $CONFIG ]; then > >>> . $CONFIG > >>> fi > >>> > >>> From opensm/scripts/opensm.init.in: > >>> if [[ -s /etc/sysconfig/opensm ]]; then > >>> . /etc/sysconfig/opensm > >>> fi > >>> > >>> If it's not some naming convention, perhaps we should use > >>> opensm.conf in both cases? > >>> > >>> 4. Logrotate: > >>> opensm/scripts/opensm.spec.in installs logrotate file as follows: > >>> install -m 644 scripts/opensm.logrotate $etc/logrotate.d/opensm > >>> I may be off here, but should the installed file name be opensmd > >>> to match the service name? > >>> > >>> -- Yevgeny > >>> _______________________________________________ > >>> general mailing list > >>> general at lists.openfabrics.org > >>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>> > >>> To unsubscribe, please visit > >>> http:// openib.org/mailman/listinfo/openib-general > >>> > >>> > >> _______________________________________________ > >> general mailing list > >> general at lists.openfabrics.org > >> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >> To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > >> > >> > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > From vlad at mellanox.co.il Mon Oct 27 09:55:52 2008 From: vlad at mellanox.co.il (Vladimir Sokolovsky) Date: Mon, 27 Oct 2008 18:55:52 +0200 Subject: [ofa-general] Re: [PATCH] IB/sysfs: Add port_xmit_wait counter. In-Reply-To: References: <20081026100202.GA15179@mellanox.co.il> Message-ID: <4905F298.8030102@mellanox.co.il> Roland Dreier wrote: > Looks OK... probably not worth checking > ClassPortInfo:CapabilityMask.PortCountersXmitWaitSupported to make sure > this field is defined, although it is unfortunate that the IB spec says > that PortXmitWait is undefined rather than 0 when it isn't supported. > > Anyway, one question: > > > static PORT_PMA_ATTR(port_xmit_packets , 14, 32, 256); > > static PORT_PMA_ATTR(port_rcv_packets , 15, 32, 288); > > +/* > > + * There is no bit allocated for port_xmit_wait in the CounterSelect field > > + * (IB spec). However, since this bit is ignored when reading > > + * (show_pma_counter), the _counter field of port_xmit_wait can be set to zero. > > + */ > > +static PORT_PMA_ATTR(port_xmit_wait , 0, 32, 320); > > I actually can't find any place where we look at the _counter field that > is passed into PORT_PMA_ATTR(), and this codfe was written so long ago > that I can't remember what reason (if any) we had for including it. Do > you know if there's any reason why not to just delete the whole _counter > thing entirely? The _counter field is ignored by show_pma_counter, it will be relevant for set_pma_counter, if we are going to add it. Regards, Vladimir From eli at dev.mellanox.co.il Mon Oct 27 10:03:14 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Mon, 27 Oct 2008 19:03:14 +0200 Subject: [ofa-general] [PATCH] ib_core: Use weak ordering for data registered memory In-Reply-To: References: <20081026093253.GA11974@mtls03> Message-ID: <20081027170314.GA6034@mtls03> On Mon, Oct 27, 2008 at 12:32:06PM -0400, Talpey, Thomas wrote: > > Eli - is there some reason you chose mode 0444 to protect against writing the > setting after module loading? It looks like the value is inspected dynamically. > I think the value of the parameter should be determined at driver load time, by the one who loads the module (e.g. the system administrator), based on the knowledge as for which applications will run on the system -- some of them do polling on data and others may not. So I don't see any point in changing this dynamically. From chu11 at llnl.gov Mon Oct 27 10:10:04 2008 From: chu11 at llnl.gov (Al Chu) Date: Mon, 27 Oct 2008 10:10:04 -0700 Subject: [ofa-general] opensm as service - cfg files In-Reply-To: <49058C81.6000007@cea.fr> References: <48FF22FC.6000606@dev.mellanox.co.il> <490073C0.70109@cea.fr> <1224786733.1197.398.camel@cardanus.llnl.gov> <49058C81.6000007@cea.fr> Message-ID: <1225127404.1197.458.camel@cardanus.llnl.gov> Hey Philippe, On Mon, 2008-10-27 at 10:40 +0100, Philippe Gregoire wrote: > Al Chu a écrit : > > On Thu, 2008-10-23 at 14:53 +0200, Philippe Gregoire wrote: > > > >> Hi Yevgeny, > >> > >> Is it possible to write this service so it will be able to manage multiple instances of opensm on the same node, I mean start and stop all instances at the same time or separately. > >> This will be very usefull when you have several Infiniband storage devices connected directly to one node, > >> so you have to run several opensm -g guid processes on this node. > >> > >> It is authorized to have a service that understand parameters like: > >> service start 0x8000010232 > >> or > >> service start ddn12.conf > >> > > > > This doesn't sound like that bad of idea, although "what does the user > > expect" is a concern. My co-worker brought up the simple issue of the > > log files. Do you automatically pick a different log file to store to, > > or does it store to the same log, or is it the user's responsibility to > > pick a reasonable different log file name in the .conf file? I have no > > idea what other daemons/init scripts do. > > > > Al > > > > > > init scripts generally execute/source some configuration file located > in /etc/sysconfig/ to set some variables used in the script. These > variables can be used to distinguish pid filename and log filename for > different opensm instances. If these variables are not defined in the > conf file, they should be build from the parameter value e.g : > opensm.log.ddn12 or opensm.pid.ddn12 My point was should the script automatically handle this, or is it the user's responsibility to set everything up? As Ira mentioned in a later post, the console port is supposed to be at a known port value so users know what port to connect to. So is it wise for the script to auto- magically select different different port values for different opensm instances? Personally I don't think so. I was initially thinking the init script could take command line arguments that could be passed directly to the init.d scripts. So for example, you can say: service opensmd start "--config ddn.conf" service opensmd start "--config lsi.conf" This puts alternate log file names and console port numbers into the responsibility of the user. Al > > >> Philippe Gregoire > >> CEA/DAM. > >> > >> Yevgeny Kliteynik a écrit : > >> > >>> Hi Sasha, > >>> > >>> I was just trying to put some order in my head regarding > >>> the use of opensm as service, and I have couple of questions. > >>> Some of them might be dumb, so please bear with me... :) > >>> > >>> 1. OpenSM config file. > >>> Do we still need opensm/scripts/opensm.conf? > >>> I think it's not used any more. > >>> > >>> 2. From opensm/scripts/opensm.init.in: > >>> @sbindir@/opensm -B $OPTIONS > /dev/null > >>> Is someone setting the $OPTIONS variable? I think it was > >>> set in the config file in the past, but not now. > >>> > >>> 3. From opensm/scripts/redhat-opensm.init.in: > >>> CONFIG=@sysconfdir@/sysconfig/opensm.conf > >>> if [ -f $CONFIG ]; then > >>> . $CONFIG > >>> fi > >>> > >>> From opensm/scripts/opensm.init.in: > >>> if [[ -s /etc/sysconfig/opensm ]]; then > >>> . /etc/sysconfig/opensm > >>> fi > >>> > >>> If it's not some naming convention, perhaps we should use > >>> opensm.conf in both cases? > >>> > >>> 4. Logrotate: > >>> opensm/scripts/opensm.spec.in installs logrotate file as follows: > >>> install -m 644 scripts/opensm.logrotate $etc/logrotate.d/opensm > >>> I may be off here, but should the installed file name be opensmd > >>> to match the service name? > >>> > >>> -- Yevgeny > >>> _______________________________________________ > >>> general mailing list > >>> general at lists.openfabrics.org > >>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>> > >>> To unsubscribe, please visit > >>> http:// openib.org/mailman/listinfo/openib-general > >>> > >>> > >> _______________________________________________ > >> general mailing list > >> general at lists.openfabrics.org > >> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >> To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > >> > >> > > -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From chu11 at llnl.gov Mon Oct 27 10:10:04 2008 From: chu11 at llnl.gov (Al Chu) Date: Mon, 27 Oct 2008 10:10:04 -0700 Subject: [ofa-general] opensm as service - cfg files In-Reply-To: <49058C81.6000007@cea.fr> References: <48FF22FC.6000606@dev.mellanox.co.il> <490073C0.70109@cea.fr> <1224786733.1197.398.camel@cardanus.llnl.gov> <49058C81.6000007@cea.fr> Message-ID: <1225127404.1197.458.camel@cardanus.llnl.gov> Hey Philippe, On Mon, 2008-10-27 at 10:40 +0100, Philippe Gregoire wrote: > Al Chu a écrit : > > On Thu, 2008-10-23 at 14:53 +0200, Philippe Gregoire wrote: > > > >> Hi Yevgeny, > >> > >> Is it possible to write this service so it will be able to manage multiple instances of opensm on the same node, I mean start and stop all instances at the same time or separately. > >> This will be very usefull when you have several Infiniband storage devices connected directly to one node, > >> so you have to run several opensm -g guid processes on this node. > >> > >> It is authorized to have a service that understand parameters like: > >> service start 0x8000010232 > >> or > >> service start ddn12.conf > >> > > > > This doesn't sound like that bad of idea, although "what does the user > > expect" is a concern. My co-worker brought up the simple issue of the > > log files. Do you automatically pick a different log file to store to, > > or does it store to the same log, or is it the user's responsibility to > > pick a reasonable different log file name in the .conf file? I have no > > idea what other daemons/init scripts do. > > > > Al > > > > > > init scripts generally execute/source some configuration file located > in /etc/sysconfig/ to set some variables used in the script. These > variables can be used to distinguish pid filename and log filename for > different opensm instances. If these variables are not defined in the > conf file, they should be build from the parameter value e.g : > opensm.log.ddn12 or opensm.pid.ddn12 My point was should the script automatically handle this, or is it the user's responsibility to set everything up? As Ira mentioned in a later post, the console port is supposed to be at a known port value so users know what port to connect to. So is it wise for the script to auto- magically select different different port values for different opensm instances? Personally I don't think so. I was initially thinking the init script could take command line arguments that could be passed directly to the init.d scripts. So for example, you can say: service opensmd start "--config ddn.conf" service opensmd start "--config lsi.conf" This puts alternate log file names and console port numbers into the responsibility of the user. Al > > >> Philippe Gregoire > >> CEA/DAM. > >> > >> Yevgeny Kliteynik a écrit : > >> > >>> Hi Sasha, > >>> > >>> I was just trying to put some order in my head regarding > >>> the use of opensm as service, and I have couple of questions. > >>> Some of them might be dumb, so please bear with me... :) > >>> > >>> 1. OpenSM config file. > >>> Do we still need opensm/scripts/opensm.conf? > >>> I think it's not used any more. > >>> > >>> 2. From opensm/scripts/opensm.init.in: > >>> @sbindir@/opensm -B $OPTIONS > /dev/null > >>> Is someone setting the $OPTIONS variable? I think it was > >>> set in the config file in the past, but not now. > >>> > >>> 3. From opensm/scripts/redhat-opensm.init.in: > >>> CONFIG=@sysconfdir@/sysconfig/opensm.conf > >>> if [ -f $CONFIG ]; then > >>> . $CONFIG > >>> fi > >>> > >>> From opensm/scripts/opensm.init.in: > >>> if [[ -s /etc/sysconfig/opensm ]]; then > >>> . /etc/sysconfig/opensm > >>> fi > >>> > >>> If it's not some naming convention, perhaps we should use > >>> opensm.conf in both cases? > >>> > >>> 4. Logrotate: > >>> opensm/scripts/opensm.spec.in installs logrotate file as follows: > >>> install -m 644 scripts/opensm.logrotate $etc/logrotate.d/opensm > >>> I may be off here, but should the installed file name be opensmd > >>> to match the service name? > >>> > >>> -- Yevgeny > >>> _______________________________________________ > >>> general mailing list > >>> general at lists.openfabrics.org > >>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>> > >>> To unsubscribe, please visit > >>> http:// openib.org/mailman/listinfo/openib-general > >>> > >>> > >> _______________________________________________ > >> general mailing list > >> general at lists.openfabrics.org > >> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >> To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > >> > >> > > -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From yosefe at Voltaire.COM Mon Oct 27 11:47:22 2008 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Mon, 27 Oct 2008 20:47:22 +0200 Subject: [ofa-general] false warnings of multicast join failures Message-ID: <49060CBA.4040100@Voltaire.COM> I'm referring to these: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 The patch in http://lists.openfabrics.org/pipermail/general/2008-May/050551.html is causing them. The patch creates a state when there is no sm_ah, so all alloc_mad() calls return -11 (-EAGAIN), this goes back to ipoib multicast join: ipoib asks the sa to join, it queues work that calls send_join(), this calls ib_sa_mcmember_rec_query(), this one calls alloc_mad() and gets -EAGAIN. How about lowering the severity of this error in ipoib_mcast_join_complete() from warning to debug? Index: b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2008-10-22 20:28:06.000000000 +0200 +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2008-10-27 20:13:59.000000000 +0200 @@ -443,7 +443,7 @@ static int ipoib_mcast_join_complete(int } if (mcast->logcount++ < 20) { - if (status == -ETIMEDOUT) { + if (status == -ETIMEDOUT || status == -EAGAIN) { ipoib_dbg_mcast(priv, "multicast join failed for " IPOIB_GID_FMT ", status %d\n", IPOIB_GID_ARG(mcast->mcmember.mgid), -- --Yossi From kliteyn at dev.mellanox.co.il Mon Oct 27 11:53:25 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 27 Oct 2008 20:53:25 +0200 Subject: [ofa-general] opensm as service - cfg files In-Reply-To: <1225127404.1197.458.camel@cardanus.llnl.gov> References: <48FF22FC.6000606@dev.mellanox.co.il> <490073C0.70109@cea.fr> <1224786733.1197.398.camel@cardanus.llnl.gov> <49058C81.6000007@cea.fr> <1225127404.1197.458.camel@cardanus.llnl.gov> Message-ID: <49060E25.3000204@dev.mellanox.co.il> Al Chu wrote: > Hey Philippe, > > On Mon, 2008-10-27 at 10:40 +0100, Philippe Gregoire wrote: >> Al Chu a écrit : >>> On Thu, 2008-10-23 at 14:53 +0200, Philippe Gregoire wrote: >>> >>>> Hi Yevgeny, >>>> >>>> Is it possible to write this service so it will be able to manage multiple instances of opensm on the same node, I mean start and stop all instances at the same time or separately. >>>> This will be very usefull when you have several Infiniband storage devices connected directly to one node, >>>> so you have to run several opensm -g guid processes on this node. >>>> >>>> It is authorized to have a service that understand parameters like: >>>> service start 0x8000010232 >>>> or >>>> service start ddn12.conf >>>> >>> This doesn't sound like that bad of idea, although "what does the user >>> expect" is a concern. My co-worker brought up the simple issue of the >>> log files. Do you automatically pick a different log file to store to, >>> or does it store to the same log, or is it the user's responsibility to >>> pick a reasonable different log file name in the .conf file? I have no >>> idea what other daemons/init scripts do. >>> >>> Al >>> >>> >> init scripts generally execute/source some configuration file located >> in /etc/sysconfig/ to set some variables used in the script. These >> variables can be used to distinguish pid filename and log filename for >> different opensm instances. If these variables are not defined in the >> conf file, they should be build from the parameter value e.g : >> opensm.log.ddn12 or opensm.pid.ddn12 > > My point was should the script automatically handle this, or is it the > user's responsibility to set everything up? As Ira mentioned in a later > post, the console port is supposed to be at a known port value so users > know what port to connect to. So is it wise for the script to auto- > magically select different different port values for different opensm > instances? Personally I don't think so. > > I was initially thinking the init script could take command line > arguments that could be passed directly to the init.d scripts. So for > example, you can say: > > service opensmd start "--config ddn.conf" > service opensmd start "--config lsi.conf" But then how would the user be able to check the specific service that was launched? I mean, you have "start" command, but what about "status" and "stop"? -- Yevgeny > This puts alternate log file names and console port numbers into the > responsibility of the user. > > Al > >>>> Philippe Gregoire >>>> CEA/DAM. >>>> >>>> Yevgeny Kliteynik a écrit : >>>> >>>>> Hi Sasha, >>>>> >>>>> I was just trying to put some order in my head regarding >>>>> the use of opensm as service, and I have couple of questions. >>>>> Some of them might be dumb, so please bear with me... :) >>>>> >>>>> 1. OpenSM config file. >>>>> Do we still need opensm/scripts/opensm.conf? >>>>> I think it's not used any more. >>>>> >>>>> 2. From opensm/scripts/opensm.init.in: >>>>> @sbindir@/opensm -B $OPTIONS > /dev/null >>>>> Is someone setting the $OPTIONS variable? I think it was >>>>> set in the config file in the past, but not now. >>>>> >>>>> 3. From opensm/scripts/redhat-opensm.init.in: >>>>> CONFIG=@sysconfdir@/sysconfig/opensm.conf >>>>> if [ -f $CONFIG ]; then >>>>> . $CONFIG >>>>> fi >>>>> >>>>> From opensm/scripts/opensm.init.in: >>>>> if [[ -s /etc/sysconfig/opensm ]]; then >>>>> . /etc/sysconfig/opensm >>>>> fi >>>>> >>>>> If it's not some naming convention, perhaps we should use >>>>> opensm.conf in both cases? >>>>> >>>>> 4. Logrotate: >>>>> opensm/scripts/opensm.spec.in installs logrotate file as follows: >>>>> install -m 644 scripts/opensm.logrotate $etc/logrotate.d/opensm >>>>> I may be off here, but should the installed file name be opensmd >>>>> to match the service name? >>>>> >>>>> -- Yevgeny >>>>> _______________________________________________ >>>>> general mailing list >>>>> general at lists.openfabrics.org >>>>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>> >>>>> To unsubscribe, please visit >>>>> http:// openib.org/mailman/listinfo/openib-general >>>>> >>>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general >>>> >>>> >> From jeff at garzik.org Mon Oct 27 11:52:57 2008 From: jeff at garzik.org (Jeff Garzik) Date: Mon, 27 Oct 2008 14:52:57 -0400 Subject: [ofa-general][PATCH] mlx4: Setting the correct offset for default mac address In-Reply-To: <49048914.1050000@mellanox.co.il> References: <49048914.1050000@mellanox.co.il> Message-ID: <49060E09.9060706@garzik.org> Yevgeny Petrilin wrote: > Signed-off-by: Yevgeny Petrilin > --- > drivers/net/mlx4/fw.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/drivers/net/mlx4/fw.c b/drivers/net/mlx4/fw.c > index be09fdb..cee199c 100644 > --- a/drivers/net/mlx4/fw.c > +++ b/drivers/net/mlx4/fw.c > @@ -360,9 +360,9 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap) > #define QUERY_PORT_ETH_MTU_OFFSET 0x02 > #define QUERY_PORT_WIDTH_OFFSET 0x06 > #define QUERY_PORT_MAX_GID_PKEY_OFFSET 0x07 > -#define QUERY_PORT_MAC_OFFSET 0x08 > #define QUERY_PORT_MAX_MACVLAN_OFFSET 0x0a > #define QUERY_PORT_MAX_VL_OFFSET 0x0b > +#define QUERY_PORT_MAC_OFFSET 0x10 applied From jgarzik at pobox.com Mon Oct 27 11:52:49 2008 From: jgarzik at pobox.com (Jeff Garzik) Date: Mon, 27 Oct 2008 14:52:49 -0400 Subject: [ofa-general] Re: mlx4_en: remove duplicated #include In-Reply-To: <2a61071b0810260805o17e78c73q38d133d83e2f5ea8@mail.gmail.com> References: <2a61071b0810260805o17e78c73q38d133d83e2f5ea8@mail.gmail.com> Message-ID: <49060E01.4020904@pobox.com> Huang Weiyi wrote: > Removed duplicated #include in > drivers/net/mlx4/en_main.c. > > Signed-off-by: Huang Weiyi > > > diff --git a/drivers/net/mlx4/en_main.c b/drivers/net/mlx4/en_main.c > index 1b0eebf..4b9794e 100644 > --- a/drivers/net/mlx4/en_main.c > +++ b/drivers/net/mlx4/en_main.c > @@ -35,7 +35,6 @@ > #include > #include > #include > -#include applied From chu11 at llnl.gov Mon Oct 27 13:36:02 2008 From: chu11 at llnl.gov (Al Chu) Date: Mon, 27 Oct 2008 13:36:02 -0700 Subject: [ofa-general] opensm as service - cfg files In-Reply-To: <49060E25.3000204@dev.mellanox.co.il> References: <48FF22FC.6000606@dev.mellanox.co.il> <490073C0.70109@cea.fr> <1224786733.1197.398.camel@cardanus.llnl.gov> <49058C81.6000007@cea.fr> <1225127404.1197.458.camel@cardanus.llnl.gov> <49060E25.3000204@dev.mellanox.co.il> Message-ID: <1225139762.1197.467.camel@cardanus.llnl.gov> On Mon, 2008-10-27 at 20:53 +0200, Yevgeny Kliteynik wrote: > Al Chu wrote: > > Hey Philippe, > > > > On Mon, 2008-10-27 at 10:40 +0100, Philippe Gregoire wrote: > >> Al Chu a écrit : > >>> On Thu, 2008-10-23 at 14:53 +0200, Philippe Gregoire wrote: > >>> > >>>> Hi Yevgeny, > >>>> > >>>> Is it possible to write this service so it will be able to manage multiple instances of opensm on the same node, I mean start and stop all instances at the same time or separately. > >>>> This will be very usefull when you have several Infiniband storage devices connected directly to one node, > >>>> so you have to run several opensm -g guid processes on this node. > >>>> > >>>> It is authorized to have a service that understand parameters like: > >>>> service start 0x8000010232 > >>>> or > >>>> service start ddn12.conf > >>>> > >>> This doesn't sound like that bad of idea, although "what does the user > >>> expect" is a concern. My co-worker brought up the simple issue of the > >>> log files. Do you automatically pick a different log file to store to, > >>> or does it store to the same log, or is it the user's responsibility to > >>> pick a reasonable different log file name in the .conf file? I have no > >>> idea what other daemons/init scripts do. > >>> > >>> Al > >>> > >>> > >> init scripts generally execute/source some configuration file located > >> in /etc/sysconfig/ to set some variables used in the script. These > >> variables can be used to distinguish pid filename and log filename for > >> different opensm instances. If these variables are not defined in the > >> conf file, they should be build from the parameter value e.g : > >> opensm.log.ddn12 or opensm.pid.ddn12 > > > > My point was should the script automatically handle this, or is it the > > user's responsibility to set everything up? As Ira mentioned in a later > > post, the console port is supposed to be at a known port value so users > > know what port to connect to. So is it wise for the script to auto- > > magically select different different port values for different opensm > > instances? Personally I don't think so. > > > > I was initially thinking the init script could take command line > > arguments that could be passed directly to the init.d scripts. So for > > example, you can say: > > > > service opensmd start "--config ddn.conf" > > service opensmd start "--config lsi.conf" > > But then how would the user be able to check the specific service > that was launched? I mean, you have "start" command, but what about > "status" and "stop"? I didn't think that far. So maybe it's not that good of an idea in the end. I'm just a bit concerned that a service can be launched and may elect to override our .conf file settings b/c it wants to launch multiple daemons. Al > -- Yevgeny > > > This puts alternate log file names and console port numbers into the > > responsibility of the user. > > > > Al > > > >>>> Philippe Gregoire > >>>> CEA/DAM. > >>>> > >>>> Yevgeny Kliteynik a écrit : > >>>> > >>>>> Hi Sasha, > >>>>> > >>>>> I was just trying to put some order in my head regarding > >>>>> the use of opensm as service, and I have couple of questions. > >>>>> Some of them might be dumb, so please bear with me... :) > >>>>> > >>>>> 1. OpenSM config file. > >>>>> Do we still need opensm/scripts/opensm.conf? > >>>>> I think it's not used any more. > >>>>> > >>>>> 2. From opensm/scripts/opensm.init.in: > >>>>> @sbindir@/opensm -B $OPTIONS > /dev/null > >>>>> Is someone setting the $OPTIONS variable? I think it was > >>>>> set in the config file in the past, but not now. > >>>>> > >>>>> 3. From opensm/scripts/redhat-opensm.init.in: > >>>>> CONFIG=@sysconfdir@/sysconfig/opensm.conf > >>>>> if [ -f $CONFIG ]; then > >>>>> . $CONFIG > >>>>> fi > >>>>> > >>>>> From opensm/scripts/opensm.init.in: > >>>>> if [[ -s /etc/sysconfig/opensm ]]; then > >>>>> . /etc/sysconfig/opensm > >>>>> fi > >>>>> > >>>>> If it's not some naming convention, perhaps we should use > >>>>> opensm.conf in both cases? > >>>>> > >>>>> 4. Logrotate: > >>>>> opensm/scripts/opensm.spec.in installs logrotate file as follows: > >>>>> install -m 644 scripts/opensm.logrotate $etc/logrotate.d/opensm > >>>>> I may be off here, but should the installed file name be opensmd > >>>>> to match the service name? > >>>>> > >>>>> -- Yevgeny > >>>>> _______________________________________________ > >>>>> general mailing list > >>>>> general at lists.openfabrics.org > >>>>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>>>> > >>>>> To unsubscribe, please visit > >>>>> http:// openib.org/mailman/listinfo/openib-general > >>>>> > >>>>> > >>>> _______________________________________________ > >>>> general mailing list > >>>> general at lists.openfabrics.org > >>>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>>> > >>>> To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > >>>> > >>>> > >> > > -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From rdreier at cisco.com Mon Oct 27 15:30:13 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 27 Oct 2008 15:30:13 -0700 Subject: [ofa-general] [PATCH 1/3] IB/ipath - fix the length returned in loopback UD completion queue entries In-Reply-To: <20081023195006.10020.16845.stgit@eng-46.mv.qlogic.com> (Ralph Campbell's message of "Thu, 23 Oct 2008 12:50:07 -0700") References: <20081023195001.10020.96260.stgit@eng-46.mv.qlogic.com> <20081023195006.10020.16845.stgit@eng-46.mv.qlogic.com> Message-ID: > UD packets sent to the local IB port (loopback) have a zero length > reported in the send work request completion entry. This fixes it > by using a copy of the WQE to copy the data. According to the IB spec (as I read it at least), the bytes transferred field of a completion entry is only defined for receive completions, and for RDMA read and atomic operation send completions. The value in a UD send completion is undefined anyway. So is this patch really worth it? From rdreier at cisco.com Mon Oct 27 15:31:38 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 27 Oct 2008 15:31:38 -0700 Subject: [ofa-general] [PATCH 2/3] IB/ipath - fix RDMA write with immediate copy of last packet In-Reply-To: <20081023195012.10020.18967.stgit@eng-46.mv.qlogic.com> (Ralph Campbell's message of "Thu, 23 Oct 2008 12:50:12 -0700") References: <20081023195001.10020.96260.stgit@eng-46.mv.qlogic.com> <20081023195012.10020.18967.stgit@eng-46.mv.qlogic.com> Message-ID: thanks, applied. From ricklist at microway.com Mon Oct 27 15:38:48 2008 From: ricklist at microway.com (Rick Warner) Date: Mon, 27 Oct 2008 18:38:48 -0400 Subject: [ofa-general] poll CQ failed -2 with connectX Message-ID: <200810271838.48510.ricklist@microway.com> Hi all, I am configuring an opteron cluster with connectX Infiniband. I have a problem that if I run one of the NAS tests, it works the first, and maybe 2nd time, but after that the jobs instantly fail with messages like this- [Rank 44][cm.c: line 860]poll CQ failed -2 [Rank 51][cm.c: line 860]poll CQ failed -2 [Rank 119][cm.c: line 860]poll CQ failed -2 [Rank 85][cm.c: line 860]poll CQ failed -2 [Rank 0][cm.c: line 860]poll CQ failed -2 [Rank 9][cm.c: line 860]poll CQ failed -2 [Rank 26][cm.c: line 860]poll CQ failed -2[Rank 43][cm.c: line 860] poll CQ failed -2 [Rank 94][cm.c: line 860]poll CQ failed -2 [Rank 111][cm.c: line 860]poll CQ failed -2 I can easily reproduce this with only 2 systems using a 16 process LU job, class B. Here are the configs I've tried- Suse 11 with distro provided IB driver and libraries,etc, using mvapich as provided by ohio state Suse 11 with distro driver, using OFED 1.3.1 libraries and mvapich Suse 10.3 with OFED 1.3.1, OFED 1.2.5.4, and OFED 1.4rc3 They all have the same basic problem. I think one of them reported "Error polling CQ" instead of "poll CQ failed". If I replace the connectX cards with regular DDR cards the problem goes away. I'm getting quite stumped at this point and would appreciate any suggestions or patches. Thanks, Rick -- Richard Warner Lead Systems Integrator Microway, Inc (508)732-5517 From ralph.campbell at qlogic.com Mon Oct 27 15:44:18 2008 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Mon, 27 Oct 2008 15:44:18 -0700 Subject: [ofa-general] [PATCH 1/3] IB/ipath - fix the length returned in loopback UD completion queue entries In-Reply-To: References: <20081023195001.10020.96260.stgit@eng-46.mv.qlogic.com> <20081023195006.10020.16845.stgit@eng-46.mv.qlogic.com> Message-ID: <1225147458.12238.640.camel@chromite.mv.qlogic.com> On Mon, 2008-10-27 at 15:30 -0700, Roland Dreier wrote: > > UD packets sent to the local IB port (loopback) have a zero length > > reported in the send work request completion entry. This fixes it > > by using a copy of the WQE to copy the data. > > According to the IB spec (as I read it at least), the bytes transferred > field of a completion entry is only defined for receive completions, and > for RDMA read and atomic operation send completions. The value in a UD > send completion is undefined anyway. So is this patch really worth it? I guess not. I set the length in the non-loopback path and thought it should be consistent. I don't know what applications might check it but if the spec. says it isn't valid then I guess it makes more sense to make things consistent by not setting the length in the other cases. From rick at microway.com Mon Oct 27 15:44:02 2008 From: rick at microway.com (Rick Warner) Date: Mon, 27 Oct 2008 18:44:02 -0400 Subject: [ofa-general] poll CQ failed -2 with connectX In-Reply-To: <200810271838.48510.ricklist@microway.com> References: <200810271838.48510.ricklist@microway.com> Message-ID: <200810271844.02420.rick@microway.com> On Monday 27 October 2008, Rick Warner wrote: > Hi all, > > I am configuring an opteron cluster with connectX Infiniband. I have a > problem that if I run one of the NAS tests, it works the first, and maybe > 2nd time, but after that the jobs instantly fail with messages like this- > > [Rank 44][cm.c: line 860]poll CQ failed -2 > [Rank 51][cm.c: line 860]poll CQ failed -2 > [Rank 119][cm.c: line 860]poll CQ failed -2 > [Rank 85][cm.c: line 860]poll CQ failed -2 > [Rank 0][cm.c: line 860]poll CQ failed -2 > [Rank 9][cm.c: line 860]poll CQ failed -2 > [Rank 26][cm.c: line 860]poll CQ failed -2[Rank 43][cm.c: line 860] > poll CQ failed -2 > [Rank 94][cm.c: line 860]poll CQ failed -2 > [Rank 111][cm.c: line 860]poll CQ failed -2 > > I can easily reproduce this with only 2 systems using a 16 process LU job, > class B. > > Here are the configs I've tried- > Suse 11 with distro provided IB driver and libraries,etc, using mvapich as > provided by ohio state > Suse 11 with distro driver, using OFED 1.3.1 libraries and mvapich > Suse 10.3 with OFED 1.3.1, OFED 1.2.5.4, and OFED 1.4rc3 > > They all have the same basic problem. I think one of them reported "Error > polling CQ" instead of "poll CQ failed". > > If I replace the connectX cards with regular DDR cards the problem goes > away. > > I'm getting quite stumped at this point and would appreciate any > suggestions or patches. > > Thanks, > Rick I forgot to mention- on Suse 11 I also tried a manually compiled 2.6.26.4 and 2.6.27.2 kernel, using the in kernel drivers. Thanks, Rick -- Richard Warner Lead Systems Integrator Microway, Inc (508)732-5517 From friedman at ucla.edu Mon Oct 27 18:01:17 2008 From: friedman at ucla.edu (Scott A. Friedman) Date: Mon, 27 Oct 2008 18:01:17 -0700 Subject: [ofa-general] ib_mthca catastrophic error detected Message-ID: <4906645D.6010101@ucla.edu> Hello On a several hundred node cluster we run here we have experienced several large (512+ core) job die with the following left in several of the node's logs. Below is an example from two different nodes. 22 nodes had this error after the large run died. What is this error and why would be seeing it. I looked through this list and only came across a couple of mentions but no real explanation. node example A: ib_mthca 0000:02:00.0: Catastrophic error detected: internal error ib_mthca 0000:02:00.0: buf[00]: 0012f6f8 ib_mthca 0000:02:00.0: buf[01]: 00000000 ib_mthca 0000:02:00.0: buf[02]: 00000000 ib_mthca 0000:02:00.0: buf[03]: 00000000 ib_mthca 0000:02:00.0: buf[04]: 00000000 ib_mthca 0000:02:00.0: buf[05]: 0012f6dc ib_mthca 0000:02:00.0: buf[06]: 001b3714 ib_mthca 0000:02:00.0: buf[07]: 00000000 ib_mthca 0000:02:00.0: buf[08]: 00000000 ib_mthca 0000:02:00.0: buf[09]: 00000000 ib_mthca 0000:02:00.0: buf[0a]: 00000000 ib_mthca 0000:02:00.0: buf[0b]: 00000000 ib_mthca 0000:02:00.0: buf[0c]: 00000000 ib_mthca 0000:02:00.0: buf[0d]: 00000000 ib_mthca 0000:02:00.0: buf[0e]: 00000000 ib_mthca 0000:02:00.0: buf[0f]: 00000000 node example B: ib_mthca 0000:02:00.0: Catastrophic error detected: internal error ib_mthca 0000:02:00.0: buf[00]: 0012bb7c ib_mthca 0000:02:00.0: buf[01]: 00000000 ib_mthca 0000:02:00.0: buf[02]: 00000000 ib_mthca 0000:02:00.0: buf[03]: 00000000 ib_mthca 0000:02:00.0: buf[04]: 00000000 ib_mthca 0000:02:00.0: buf[05]: 0012bb5c ib_mthca 0000:02:00.0: buf[06]: 001905a0 ib_mthca 0000:02:00.0: buf[07]: 00000000 ib_mthca 0000:02:00.0: buf[08]: 00000000 ib_mthca 0000:02:00.0: buf[09]: 00000000 ib_mthca 0000:02:00.0: buf[0a]: 00000000 ib_mthca 0000:02:00.0: buf[0b]: 00000000 ib_mthca 0000:02:00.0: buf[0c]: 00000000 ib_mthca 0000:02:00.0: buf[0d]: 00000000 ib_mthca 0000:02:00.0: buf[0e]: 00000000 ib_mthca 0000:02:00.0: buf[0f]: 00000000 From rdreier at cisco.com Mon Oct 27 19:53:00 2008 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 27 Oct 2008 19:53:00 -0700 Subject: [ofa-general] ib_mthca catastrophic error detected In-Reply-To: <4906645D.6010101@ucla.edu> (Scott A. Friedman's message of "Mon, 27 Oct 2008 18:01:17 -0700") References: <4906645D.6010101@ucla.edu> Message-ID: > ib_mthca 0000:02:00.0: Catastrophic error detected: internal error This means your HCA detected an internal error -- overheating, power glitch, cosmic ray, firmware bug, something like that. From eli at dev.mellanox.co.il Tue Oct 28 01:08:00 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Tue, 28 Oct 2008 10:08:00 +0200 Subject: [ofa-general] [PATCH] ib_core: Use weak ordering for data registered memory In-Reply-To: References: <20081026093253.GA11974@mtls03> Message-ID: <20081028080800.GA10885@mtls03> On Mon, Oct 27, 2008 at 09:19:20AM -0700, Roland Dreier wrote: > > No, since this appeared late in the 2.6.28 merge window, it's way too > late for 2.6.28 -- things need to be submitted before the merge window > to have a chance of going in. > > We'll get some fix for Cell performance into 2.6.29. > OK thanks. We'll push this is a fix to ofed-1.4 and it will be available for kernel 2.6.27 only. From vlad at lists.openfabrics.org Tue Oct 28 03:17:42 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 28 Oct 2008 03:17:42 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081028-0200 daily build status Message-ID: <20081028101742.1E2C6E60FB4@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From arnd at arndb.de Tue Oct 28 05:18:12 2008 From: arnd at arndb.de (Arnd Bergmann) Date: Tue, 28 Oct 2008 13:18:12 +0100 Subject: [ofa-general] [PATCH] ib_core: Use weak ordering for data registered memory In-Reply-To: References: <20081026093253.GA11974@mtls03> Message-ID: <200810281318.13255.arnd@arndb.de> On Monday 27 October 2008, Talpey, Thomas wrote: > They can't, right? RDMA operations aren't ordered at all per spec, though > there are some architectures/implementations that do. > > In fact, one might argue that weak ordering should be the _default_ setting > here. It would certainly prevent surprise later. Well, the problem is that we have existing code out there that assumes strict ordering for RDMA, e.g. the eager RDMA option in openmpi. Simply changing the Linux implementation breaks that code, which is something we don't do if we can avoid. For the mthca device driver, we already have an interface (struct mthca_reg_mr) that allows selecting either strict or relaxed (aka strong or weak) ordering for a memory region, the default being relaxed ordering. In the long run, I'd like to see something like that for all device drivers, so that a user space library can tell the kernel about its requirements. Arnd <>< From tziporet at dev.mellanox.co.il Tue Oct 28 05:27:58 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 28 Oct 2008 14:27:58 +0200 Subject: [ofa-general] ib_mthca catastrophic error detected In-Reply-To: <4906645D.6010101@ucla.edu> References: <4906645D.6010101@ucla.edu> Message-ID: <4907054E.9080205@mellanox.co.il> Scott A. Friedman wrote: > Hello > > On a several hundred node cluster we run here we have experienced > several large (512+ core) job die with the following left in several > of the node's logs. Below is an example from two different nodes. 22 > nodes had this error after the large run died. > > What is this error and why would be seeing it. I looked through this > list and only came across a couple of mentions but no real explanation. > > node example A: > > ib_mthca 0000:02:00.0: Catastrophic error detected: internal error > Can you specify: Which OFED version you use? (or IB from kernel.org) Which HCA and FW version? Tziporet From gopalakk at cse.ohio-state.edu Tue Oct 28 06:07:59 2008 From: gopalakk at cse.ohio-state.edu (Karthik Gopalakrishnan) Date: Tue, 28 Oct 2008 09:07:59 -0400 Subject: [ofa-general] Question about ibv_asyncwatch Message-ID: <92eddfb50810280607t4135d701p9ed16b3cb23023d8@mail.gmail.com> Hi Folks. I have written a standalone program that calls 'ibv_get_async_event()'. I want to know if that program can get async events about errors on QPs (IBV_EVENT_PATH_MIG_ERR for example) that are created by a different process (say some MPI Program). I also see a utility called 'ibv_asyncwatch' that is shipped as part of OFED that seems to do something similar. I will be grateful if someone could throw more light about what it does and point me to its source. Thanks & Regards, Karthik From pasha at dev.mellanox.co.il Tue Oct 28 06:25:16 2008 From: pasha at dev.mellanox.co.il (Pavel Shamis (Pasha)) Date: Tue, 28 Oct 2008 15:25:16 +0200 Subject: [ofa-general] poll CQ failed -2 with connectX In-Reply-To: <200810271844.02420.rick@microway.com> References: <200810271838.48510.ricklist@microway.com> <200810271844.02420.rick@microway.com> Message-ID: <490712BC.20701@dev.mellanox.co.il> Which MPI implementation do you use ? Rick Warner wrote: > On Monday 27 October 2008, Rick Warner wrote: > >> Hi all, >> >> I am configuring an opteron cluster with connectX Infiniband. I have a >> problem that if I run one of the NAS tests, it works the first, and maybe >> 2nd time, but after that the jobs instantly fail with messages like this- >> >> [Rank 44][cm.c: line 860]poll CQ failed -2 >> [Rank 51][cm.c: line 860]poll CQ failed -2 >> [Rank 119][cm.c: line 860]poll CQ failed -2 >> [Rank 85][cm.c: line 860]poll CQ failed -2 >> [Rank 0][cm.c: line 860]poll CQ failed -2 >> [Rank 9][cm.c: line 860]poll CQ failed -2 >> [Rank 26][cm.c: line 860]poll CQ failed -2[Rank 43][cm.c: line 860] >> poll CQ failed -2 >> [Rank 94][cm.c: line 860]poll CQ failed -2 >> [Rank 111][cm.c: line 860]poll CQ failed -2 >> >> I can easily reproduce this with only 2 systems using a 16 process LU job, >> class B. >> >> Here are the configs I've tried- >> Suse 11 with distro provided IB driver and libraries,etc, using mvapich as >> provided by ohio state >> Suse 11 with distro driver, using OFED 1.3.1 libraries and mvapich >> Suse 10.3 with OFED 1.3.1, OFED 1.2.5.4, and OFED 1.4rc3 >> >> They all have the same basic problem. I think one of them reported "Error >> polling CQ" instead of "poll CQ failed". >> >> If I replace the connectX cards with regular DDR cards the problem goes >> away. >> >> I'm getting quite stumped at this point and would appreciate any >> suggestions or patches. >> >> Thanks, >> Rick >> > > I forgot to mention- on Suse 11 I also tried a manually compiled 2.6.26.4 and > 2.6.27.2 kernel, using the in kernel drivers. > > Thanks, > Rick > > From Thomas.Talpey at netapp.com Tue Oct 28 06:41:49 2008 From: Thomas.Talpey at netapp.com (Talpey, Thomas) Date: Tue, 28 Oct 2008 09:41:49 -0400 Subject: [ofa-general] [PATCH] ib_core: Use weak ordering for data registered memory In-Reply-To: <200810281318.13255.arnd@arndb.de> References: <20081026093253.GA11974@mtls03> <200810281318.13255.arnd@arndb.de> Message-ID: At 08:18 AM 10/28/2008, Arnd Bergmann wrote: >On Monday 27 October 2008, Talpey, Thomas wrote: >> They can't, right? RDMA operations aren't ordered at all per spec, though >> there are some architectures/implementations that do. >> >> In fact, one might argue that weak ordering should be the _default_ setting >> here. It would certainly prevent surprise later. > >Well, the problem is that we have existing code out there that assumes >strict ordering for RDMA, e.g. the eager RDMA option in openmpi. Simply >changing the Linux implementation breaks that code, which is something >we don't do if we can avoid. So, how does openmpi handle this on devices or architectures that don't provide the placement ordering guarantee? An "eager RDMA option" sounds suggestive, and unless chosen, wouldn't break anything. > >For the mthca device driver, we already have an interface >(struct mthca_reg_mr) that allows selecting either strict or relaxed >(aka strong or weak) ordering for a memory region, the default being >relaxed ordering. > >In the long run, I'd like to see something like that for all device drivers, >so that a user space library can tell the kernel about its requirements. And vice-versa. If a driver or device cannot provide the requirement, it needs to communicate that back to the requester. Tom. From rick at microway.com Tue Oct 28 07:02:38 2008 From: rick at microway.com (Rick Warner) Date: Tue, 28 Oct 2008 10:02:38 -0400 Subject: [ofa-general] poll CQ failed -2 with connectX In-Reply-To: <490712BC.20701@dev.mellanox.co.il> References: <200810271838.48510.ricklist@microway.com> <200810271844.02420.rick@microway.com> <490712BC.20701@dev.mellanox.co.il> Message-ID: <200810281002.39579.rick@microway.com> mvapich 1. (0.9.9, 1.0.1, 1.1.0, depending on the OFED version, etc) Thanks, Rick On Tuesday 28 October 2008, Pavel Shamis (Pasha) wrote: > Which MPI implementation do you use ? > > Rick Warner wrote: > > On Monday 27 October 2008, Rick Warner wrote: > >> Hi all, > >> > >> I am configuring an opteron cluster with connectX Infiniband. I have a > >> problem that if I run one of the NAS tests, it works the first, and > >> maybe 2nd time, but after that the jobs instantly fail with messages > >> like this- > >> > >> [Rank 44][cm.c: line 860]poll CQ failed -2 > >> [Rank 51][cm.c: line 860]poll CQ failed -2 > >> [Rank 119][cm.c: line 860]poll CQ failed -2 > >> [Rank 85][cm.c: line 860]poll CQ failed -2 > >> [Rank 0][cm.c: line 860]poll CQ failed -2 > >> [Rank 9][cm.c: line 860]poll CQ failed -2 > >> [Rank 26][cm.c: line 860]poll CQ failed -2[Rank 43][cm.c: line 860] > >> poll CQ failed -2 > >> [Rank 94][cm.c: line 860]poll CQ failed -2 > >> [Rank 111][cm.c: line 860]poll CQ failed -2 > >> > >> I can easily reproduce this with only 2 systems using a 16 process LU > >> job, class B. > >> > >> Here are the configs I've tried- > >> Suse 11 with distro provided IB driver and libraries,etc, using mvapich > >> as provided by ohio state > >> Suse 11 with distro driver, using OFED 1.3.1 libraries and mvapich > >> Suse 10.3 with OFED 1.3.1, OFED 1.2.5.4, and OFED 1.4rc3 > >> > >> They all have the same basic problem. I think one of them reported > >> "Error polling CQ" instead of "poll CQ failed". > >> > >> If I replace the connectX cards with regular DDR cards the problem goes > >> away. > >> > >> I'm getting quite stumped at this point and would appreciate any > >> suggestions or patches. > >> > >> Thanks, > >> Rick > > > > I forgot to mention- on Suse 11 I also tried a manually compiled 2.6.26.4 > > and 2.6.27.2 kernel, using the in kernel drivers. > > > > Thanks, > > Rick > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general -- Richard Warner Lead Systems Integrator Microway, Inc (508)732-5517 From arnd at arndb.de Tue Oct 28 07:52:24 2008 From: arnd at arndb.de (Arnd Bergmann) Date: Tue, 28 Oct 2008 15:52:24 +0100 Subject: [ofa-general] [PATCH] ib_core: Use weak ordering for data registered memory In-Reply-To: References: <20081026093253.GA11974@mtls03> <200810281318.13255.arnd@arndb.de> Message-ID: <200810281552.24886.arnd@arndb.de> On Tuesday 28 October 2008, Talpey, Thomas wrote: > > > >In the long run, I'd like to see something like that for all device drivers, > >so that a user space library can tell the kernel about its requirements. > > And vice-versa. If a driver or device cannot provide the requirement, it > needs to communicate that back to the requester. It all depends on how you define it. If the eager RDMA option asks specifically for strict ordering, you need to handle errors. If you define it so that not using eager RDMA passes a flag to allow relaxed ordering, you don't need to communicate back. Arnd <>< From tziporet at mellanox.co.il Tue Oct 28 07:36:44 2008 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 28 Oct 2008 16:36:44 +0200 Subject: [ofa-general] OFED October 27 2008 meeting summary on OFED 1.4 status Message-ID: <5D49E7A8952DC44FB38C38FA0D758EADD002D1@mtlexch01.mtl.com> OFED October 27 2008 meeting summary on OFED 1.4 status Meeting minutes on the web: http://www.openfabrics.org/txt/documentation/linux/EWG_meeting_minutes/ Meeting Summary: ============== 1. RC4 is planed for next week - Nov 4 2. NFS_RDMA will be enabled on kernel 2.6.27 only since its not actually working on the distros Details: ====== 1. Bugs status and decide on their priority 1262 cri andy.grover at oracle.com congestion hang with RDS - not critical 1298 cri Jeffrey.C.Becker at nasa.gov nfsrdma rh5.1 causes kernel panic - not critical - will disable it on RHEL5 1299 cri Jeffrey.C.Becker at nasa.gov nfs module is missing symbols in rh5.1 - not critical - same as above 1287 cri vlad at mellanox.co.il IPoIB datagram mode initial packet loss - not critical need to check if it happened on OFED 1.3 too 1242 cri yannick.cote at qlogic.com kernel panic while running mpi2007 against ofed1.4 -- ib_... - on work 1301 maj andy.grover at oracle.com Can not load rds module on RH4 up7 - Olga will look at this 1221 maj Jeffrey.C.Becker at nasa.gov SLES10 sp2: remote logins via ssh fail due to rpcbind and... - will disable on this OS 1284 maj monis at voltaire.com Bonding - when eth bonding and IB bonding are configured,... on work 1286 maj monis at voltaire.com bond does not failover correctly - on work 1308 maj vlad at mellanox.co.il path_rec_completion [ib_ipoib] kernel Oops (Unable to han...- on work 1164 maj yosefe at voltaire.com iperf over IPoIB UD fails for 100 tcp connections - reduce to normal 1288 maj yosefe at voltaire.com bug warning while disabling/enabling ports from he switch - check with Yossi 2. We had a discussion on NFS-RDMA since both RHEL 5.1 and SLES10 SP2 backports are not working well We had a debate - do we take it out of OFED since it is not working on the distros Leave it in: We can have bug fixes for 1.4.1, and give customers a platform to play with Take it out: If someone will try it on the distro experience can be problematic Decision: We will leave it for 2.6.27 kernel only. All testing should be done on this kernel mainly to see that basic functionality is working 3. Decide on RC4 target date - Nov 4 Betsy will check the compilation warnings status - and see if we can have more improvements for RC4 4. Reviewed BOF presentation from Woody. General content is good. For OFED 1.5 - we will add list of OSes that their support will be dropped From yevgenyp at mellanox.co.il Tue Oct 28 08:49:34 2008 From: yevgenyp at mellanox.co.il (Yevgeny Petrilin) Date: Tue, 28 Oct 2008 17:49:34 +0200 Subject: [ofa-general][PATCH 1/3]mlx4: Multiple completion vectors support Message-ID: <4907348E.7060508@mellanox.co.il> The driver now creates a completion EQ for every cpu. While allocating CQ a ULP asks a completion vector number it wants the CQ to be attached to. The number of completion vectors is advertised via ib_device.num_comp_vectors Signed-off-by: Yevgeny Petrilin --- drivers/infiniband/hw/mlx4/cq.c | 2 +- drivers/infiniband/hw/mlx4/main.c | 2 +- drivers/net/mlx4/cq.c | 14 ++++++++-- drivers/net/mlx4/en_cq.c | 9 ++++-- drivers/net/mlx4/en_main.c | 4 +- drivers/net/mlx4/eq.c | 47 ++++++++++++++++++++++++------------ drivers/net/mlx4/main.c | 14 ++++++---- drivers/net/mlx4/mlx4.h | 4 +- include/linux/mlx4/device.h | 4 ++- 9 files changed, 65 insertions(+), 35 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index d0866a3..5de41bd 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -222,7 +222,7 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector } err = mlx4_cq_alloc(dev->dev, entries, &cq->buf.mtt, uar, - cq->db.dma, &cq->mcq, 0); + cq->db.dma, &cq->mcq, vector, 0); if (err) goto err_dbmap; diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index 2e80f8f..dcefe1f 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -578,7 +578,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev) mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_IB) ibdev->num_ports++; ibdev->ib_dev.phys_port_cnt = ibdev->num_ports; - ibdev->ib_dev.num_comp_vectors = 1; + ibdev->ib_dev.num_comp_vectors = dev->caps.num_comp_vectors; ibdev->ib_dev.dma_device = &dev->pdev->dev; ibdev->ib_dev.uverbs_abi_ver = MLX4_IB_UVERBS_ABI_VERSION; diff --git a/drivers/net/mlx4/cq.c b/drivers/net/mlx4/cq.c index b7ad282..a675e85 100644 --- a/drivers/net/mlx4/cq.c +++ b/drivers/net/mlx4/cq.c @@ -189,7 +189,7 @@ EXPORT_SYMBOL_GPL(mlx4_cq_resize); int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, struct mlx4_uar *uar, u64 db_rec, struct mlx4_cq *cq, - int collapsed) + unsigned vector, int collapsed) { struct mlx4_priv *priv = mlx4_priv(dev); struct mlx4_cq_table *cq_table = &priv->cq_table; @@ -227,7 +227,15 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, cq_context->flags = cpu_to_be32(!!collapsed << 18); cq_context->logsize_usrpage = cpu_to_be32((ilog2(nent) << 24) | uar->index); - cq_context->comp_eqn = priv->eq_table.eq[MLX4_EQ_COMP].eqn; + + if (vector >= dev->caps.num_comp_vectors) { + err = -EINVAL; + goto err_radix; + } + + cq->comp_eq_idx = MLX4_EQ_COMP_CPU0 + vector; + cq_context->comp_eqn = priv->eq_table.eq[MLX4_EQ_COMP_CPU0 + + vector].eqn; cq_context->log_page_size = mtt->page_shift - MLX4_ICM_PAGE_SHIFT; mtt_addr = mlx4_mtt_addr(dev, mtt); @@ -276,7 +284,7 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq) if (err) mlx4_warn(dev, "HW2SW_CQ failed (%d) for CQN %06x\n", err, cq->cqn); - synchronize_irq(priv->eq_table.eq[MLX4_EQ_COMP].irq); + synchronize_irq(priv->eq_table.eq[cq->comp_eq_idx].irq); spin_lock_irq(&cq_table->lock); radix_tree_delete(&cq_table->tree, cq->cqn); diff --git a/drivers/net/mlx4/en_cq.c b/drivers/net/mlx4/en_cq.c index 1368a80..8f388e8 100644 --- a/drivers/net/mlx4/en_cq.c +++ b/drivers/net/mlx4/en_cq.c @@ -51,10 +51,13 @@ int mlx4_en_create_cq(struct mlx4_en_priv *priv, int err; cq->size = entries; - if (mode == RX) + if (mode == RX) { cq->buf_size = cq->size * sizeof(struct mlx4_cqe); - else + cq->vector = ring % mdev->dev->caps.num_comp_vectors; + } else { cq->buf_size = sizeof(struct mlx4_cqe); + cq->vector = 0; + } cq->ring = ring; cq->is_tx = mode; @@ -86,7 +89,7 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq) memset(cq->buf, 0, cq->buf_size); err = mlx4_cq_alloc(mdev->dev, cq->size, &cq->wqres.mtt, &mdev->priv_uar, - cq->wqres.db.dma, &cq->mcq, cq->is_tx); + cq->wqres.db.dma, &cq->mcq, cq->vector, cq->is_tx); if (err) return err; diff --git a/drivers/net/mlx4/en_main.c b/drivers/net/mlx4/en_main.c index 1b0eebf..7423bf9 100644 --- a/drivers/net/mlx4/en_main.c +++ b/drivers/net/mlx4/en_main.c @@ -171,9 +171,9 @@ static void *mlx4_en_add(struct mlx4_dev *dev) mlx4_info(mdev, "Using %d tx rings for port:%d\n", mdev->profile.prof[i].tx_ring_num, i); if (!mdev->profile.prof[i].rx_ring_num) { - mdev->profile.prof[i].rx_ring_num = 1; + mdev->profile.prof[i].rx_ring_num = dev->caps.num_comp_vectors; mlx4_info(mdev, "Defaulting to %d rx rings for port:%d\n", - 1, i); + mdev->profile.prof[i].rx_ring_num, i); } else mlx4_info(mdev, "Using %d rx rings for port:%d\n", mdev->profile.prof[i].rx_ring_num, i); diff --git a/drivers/net/mlx4/eq.c b/drivers/net/mlx4/eq.c index de16933..b436234 100644 --- a/drivers/net/mlx4/eq.c +++ b/drivers/net/mlx4/eq.c @@ -266,7 +266,7 @@ static irqreturn_t mlx4_interrupt(int irq, void *dev_ptr) writel(priv->eq_table.clr_mask, priv->eq_table.clr_int); - for (i = 0; i < MLX4_NUM_EQ; ++i) + for (i = 0; i < MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors; ++i) work |= mlx4_eq_int(dev, &priv->eq_table.eq[i]); return IRQ_RETVAL(work); @@ -483,7 +483,7 @@ static void mlx4_free_irqs(struct mlx4_dev *dev) if (eq_table->have_irq) free_irq(dev->pdev->irq, dev); - for (i = 0; i < MLX4_NUM_EQ; ++i) + for (i = 0; i < MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors; ++i) if (eq_table->eq[i].have_irq) free_irq(eq_table->eq[i].irq, eq_table->eq + i); } @@ -554,6 +554,7 @@ void mlx4_unmap_eq_icm(struct mlx4_dev *dev) int mlx4_init_eq_table(struct mlx4_dev *dev) { struct mlx4_priv *priv = mlx4_priv(dev); + int req_eqs; int err; int i; @@ -574,11 +575,21 @@ int mlx4_init_eq_table(struct mlx4_dev *dev) priv->eq_table.clr_int = priv->clr_base + (priv->eq_table.inta_pin < 32 ? 4 : 0); - err = mlx4_create_eq(dev, dev->caps.num_cqs + MLX4_NUM_SPARE_EQE, - (dev->flags & MLX4_FLAG_MSI_X) ? MLX4_EQ_COMP : 0, - &priv->eq_table.eq[MLX4_EQ_COMP]); - if (err) - goto err_out_unmap; + dev->caps.num_comp_vectors = 0; + req_eqs = (dev->flags & MLX4_FLAG_MSI_X) ? num_online_cpus() : 1; + while (req_eqs) { + err = mlx4_create_eq( + dev, dev->caps.num_cqs + MLX4_NUM_SPARE_EQE, + (dev->flags & MLX4_FLAG_MSI_X) ? + (MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors) : 0, + &priv->eq_table.eq[MLX4_EQ_COMP_CPU0 + + dev->caps.num_comp_vectors]); + if (err) + goto err_out_comp; + + dev->caps.num_comp_vectors++; + req_eqs--; + } err = mlx4_create_eq(dev, MLX4_NUM_ASYNC_EQE + MLX4_NUM_SPARE_EQE, (dev->flags & MLX4_FLAG_MSI_X) ? MLX4_EQ_ASYNC : 0, @@ -587,12 +598,16 @@ int mlx4_init_eq_table(struct mlx4_dev *dev) goto err_out_comp; if (dev->flags & MLX4_FLAG_MSI_X) { - static const char *eq_name[] = { - [MLX4_EQ_COMP] = DRV_NAME " (comp)", - [MLX4_EQ_ASYNC] = DRV_NAME " (async)" - }; + static char eq_name[MLX4_NUM_EQ][20]; + + for (i = 0; i < MLX4_EQ_COMP_CPU0 + + dev->caps.num_comp_vectors; ++i) { + if (i == 0) + snprintf(eq_name[0], 20, DRV_NAME "(async)"); + else + snprintf(eq_name[i], 20, "comp_" DRV_NAME "%d", + i - 1); - for (i = 0; i < MLX4_NUM_EQ; ++i) { err = request_irq(priv->eq_table.eq[i].irq, mlx4_msi_x_interrupt, 0, eq_name[i], priv->eq_table.eq + i); @@ -617,7 +632,7 @@ int mlx4_init_eq_table(struct mlx4_dev *dev) mlx4_warn(dev, "MAP_EQ for async EQ %d failed (%d)\n", priv->eq_table.eq[MLX4_EQ_ASYNC].eqn, err); - for (i = 0; i < MLX4_NUM_EQ; ++i) + for (i = 0; i < MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors; ++i) eq_set_ci(&priv->eq_table.eq[i], 1); return 0; @@ -626,9 +641,9 @@ err_out_async: mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_ASYNC]); err_out_comp: - mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_COMP]); + for (i = 0; i < dev->caps.num_comp_vectors; ++i) + mlx4_free_eq(dev, &priv->eq_table.eq[MLX4_EQ_COMP_CPU0 + i]); -err_out_unmap: mlx4_unmap_clr_int(dev); mlx4_free_irqs(dev); @@ -647,7 +662,7 @@ void mlx4_cleanup_eq_table(struct mlx4_dev *dev) mlx4_free_irqs(dev); - for (i = 0; i < MLX4_NUM_EQ; ++i) + for (i = 0; i < MLX4_EQ_COMP_CPU0 + dev->caps.num_comp_vectors; ++i) mlx4_free_eq(dev, &priv->eq_table.eq[i]); mlx4_unmap_clr_int(dev); diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index 468921b..aaf3eec 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -901,22 +901,24 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev) { struct mlx4_priv *priv = mlx4_priv(dev); struct msix_entry entries[MLX4_NUM_EQ]; + int needed_vectors = MLX4_EQ_COMP_CPU0 + num_online_cpus(); int err; int i; if (msi_x) { - for (i = 0; i < MLX4_NUM_EQ; ++i) + for (i = 0; i < needed_vectors; ++i) entries[i].entry = i; - err = pci_enable_msix(dev->pdev, entries, ARRAY_SIZE(entries)); + err = pci_enable_msix(dev->pdev, entries, needed_vectors); if (err) { if (err > 0) - mlx4_info(dev, "Only %d MSI-X vectors available, " - "not using MSI-X\n", err); + mlx4_info(dev, "Only %d MSI-X vectors " + "available, need %d. Not using MSI-X\n", + err, needed_vectors); goto no_msi; } - for (i = 0; i < MLX4_NUM_EQ; ++i) + for (i = 0; i < needed_vectors; ++i) priv->eq_table.eq[i].irq = entries[i].vector; dev->flags |= MLX4_FLAG_MSI_X; @@ -924,7 +926,7 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev) } no_msi: - for (i = 0; i < MLX4_NUM_EQ; ++i) + for (i = 0; i < needed_vectors; ++i) priv->eq_table.eq[i].irq = dev->pdev->irq; } diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h index fa431fa..612abe6 100644 --- a/drivers/net/mlx4/mlx4.h +++ b/drivers/net/mlx4/mlx4.h @@ -64,8 +64,8 @@ enum { enum { MLX4_EQ_ASYNC, - MLX4_EQ_COMP, - MLX4_NUM_EQ + MLX4_EQ_COMP_CPU0, + MLX4_NUM_EQ = MLX4_EQ_COMP_CPU0 + NR_CPUS }; enum { diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index bd9977b..6228b97 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -205,6 +205,7 @@ struct mlx4_caps { int reserved_cqs; int num_eqs; int reserved_eqs; + int num_comp_vectors; int num_mpts; int num_mtt_segs; int fmr_reserved_mtts; @@ -327,6 +328,7 @@ struct mlx4_cq { int arm_sn; int cqn; + int comp_eq_idx; atomic_t refcount; struct completion free; @@ -436,7 +438,7 @@ void mlx4_free_hwq_res(struct mlx4_dev *mdev, struct mlx4_hwq_resources *wqres, int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, struct mlx4_uar *uar, u64 db_rec, struct mlx4_cq *cq, - int collapsed); + unsigned vector, int collapsed); void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq); int mlx4_qp_reserve_range(struct mlx4_dev *dev, int cnt, int align, int *base); -- 1.5.4 From yevgenyp at mellanox.co.il Tue Oct 28 08:50:26 2008 From: yevgenyp at mellanox.co.il (Yevgeny Petrilin) Date: Tue, 28 Oct 2008 17:50:26 +0200 Subject: [ofa-general][PATCH 2/3] mlx4: Default value for automatic completion vector selection Message-ID: <490734C2.1070700@mellanox.co.il> When the vector number passed to mlx4_cq_alloc is MLX4_LEAST_ATTACHED_VECTOR (0xffffffff), the driver selects the completion vector that has the least CQ's attached to it and attaches the CQ to the chosen vector. IB_CQ_VECTOR_LEAST_ATTACHED is defined in rdma/ib_verbs.h, when mlx4_ib driver, receives this cq vector number, it uses MLX4_LEAST_ATTACHED_VECTOR an CQ creation. Signed-off-by: Yevgeny Petrilin --- drivers/infiniband/hw/mlx4/cq.c | 4 +++- drivers/net/mlx4/cq.c | 22 +++++++++++++++++++++- drivers/net/mlx4/en_cq.c | 2 +- drivers/net/mlx4/mlx4.h | 1 + include/linux/mlx4/device.h | 2 ++ include/rdma/ib_verbs.h | 10 +++++++++- 6 files changed, 37 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index 5de41bd..384e616 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -222,7 +222,9 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector } err = mlx4_cq_alloc(dev->dev, entries, &cq->buf.mtt, uar, - cq->db.dma, &cq->mcq, vector, 0); + cq->db.dma, &cq->mcq, + vector == IB_CQ_VECTOR_LEAST_ATTACHED ? + MLX4_LEAST_ATTACHED_VECTOR : vector, 0); if (err) goto err_dbmap; diff --git a/drivers/net/mlx4/cq.c b/drivers/net/mlx4/cq.c index a675e85..31a5190 100644 --- a/drivers/net/mlx4/cq.c +++ b/drivers/net/mlx4/cq.c @@ -187,6 +187,22 @@ int mlx4_cq_resize(struct mlx4_dev *dev, struct mlx4_cq *cq, } EXPORT_SYMBOL_GPL(mlx4_cq_resize); +static int mlx4_find_least_loaded_vector(struct mlx4_priv *priv) +{ + int i; + int index = 0; + int min = priv->eq_table.eq[MLX4_EQ_COMP_CPU0].load; + + for (i = 1; i < priv->dev.caps.num_comp_vectors; i++) { + if (priv->eq_table.eq[MLX4_EQ_COMP_CPU0 + i].load < min) { + index = i; + min = priv->eq_table.eq[MLX4_EQ_COMP_CPU0 + i].load; + } + } + + return index; +} + int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, struct mlx4_uar *uar, u64 db_rec, struct mlx4_cq *cq, unsigned vector, int collapsed) @@ -228,7 +244,9 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, cq_context->flags = cpu_to_be32(!!collapsed << 18); cq_context->logsize_usrpage = cpu_to_be32((ilog2(nent) << 24) | uar->index); - if (vector >= dev->caps.num_comp_vectors) { + if (vector == MLX4_LEAST_ATTACHED_VECTOR) + vector = mlx4_find_least_loaded_vector(priv); + else if (vector >= dev->caps.num_comp_vectors) { err = -EINVAL; goto err_radix; } @@ -248,6 +266,7 @@ int mlx4_cq_alloc(struct mlx4_dev *dev, int nent, struct mlx4_mtt *mtt, if (err) goto err_radix; + priv->eq_table.eq[cq->comp_eq_idx].load++; cq->cons_index = 0; cq->arm_sn = 1; cq->uar = uar; @@ -285,6 +304,7 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq) mlx4_warn(dev, "HW2SW_CQ failed (%d) for CQN %06x\n", err, cq->cqn); synchronize_irq(priv->eq_table.eq[cq->comp_eq_idx].irq); + priv->eq_table.eq[cq->comp_eq_idx].load--; spin_lock_irq(&cq_table->lock); radix_tree_delete(&cq_table->tree, cq->cqn); diff --git a/drivers/net/mlx4/en_cq.c b/drivers/net/mlx4/en_cq.c index 8f388e8..9fd9eab 100644 --- a/drivers/net/mlx4/en_cq.c +++ b/drivers/net/mlx4/en_cq.c @@ -56,7 +56,7 @@ int mlx4_en_create_cq(struct mlx4_en_priv *priv, cq->vector = ring % mdev->dev->caps.num_comp_vectors; } else { cq->buf_size = sizeof(struct mlx4_cqe); - cq->vector = 0; + cq->vector = MLX4_LEAST_ATTACHED_VECTOR; } cq->ring = ring; diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h index 612abe6..1b953ab 100644 --- a/drivers/net/mlx4/mlx4.h +++ b/drivers/net/mlx4/mlx4.h @@ -145,6 +145,7 @@ struct mlx4_eq { u16 irq; u16 have_irq; int nent; + int load; struct mlx4_buf_list *page_list; struct mlx4_mtt mtt; }; diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index 6228b97..f9638e5 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -169,6 +169,8 @@ enum { MLX4_NUM_FEXCH = 64 * 1024, }; +#define MLX4_LEAST_ATTACHED_VECTOR 0xffffffff + static inline u64 mlx4_fw_ver(u64 major, u64 minor, u64 subminor) { return (major << 32) | (minor << 16) | subminor; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 936e333..e76e028 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1448,6 +1448,13 @@ static inline int ib_post_recv(struct ib_qp *qp, return qp->device->post_recv(qp, recv_wr, bad_recv_wr); } +/* + * IB_CQ_VECTOR_LEAST_ATTACHED: The constant specifies that + * the CQ will be attached to the completion vector that has + * the least number of CQs already attached to it. + */ +#define IB_CQ_VECTOR_LEAST_ATTACHED 0xffffffff + /** * ib_create_cq - Creates a CQ on the specified device. * @device: The device on which to create the CQ. @@ -1459,7 +1466,8 @@ static inline int ib_post_recv(struct ib_qp *qp, * the associated completion and event handlers. * @cqe: The minimum size of the CQ. * @comp_vector - Completion vector used to signal completion events. - * Must be >= 0 and < context->num_comp_vectors. + * Must be >= 0 and < context->num_comp_vectors + * or IB_CQ_VECTOR_LEAST_ATTACHED. * * Users can examine the cq structure to determine the actual CQ size. */ -- 1.5.4 From yevgenyp at mellanox.co.il Tue Oct 28 08:51:16 2008 From: yevgenyp at mellanox.co.il (Yevgeny Petrilin) Date: Tue, 28 Oct 2008 17:51:16 +0200 Subject: [ofa-general][PATCH 3/3] mlx4_core: Auto negotiation support Message-ID: <490734F4.6060403@mellanox.co.il> At any time when port link is down (except to driver restart), and port is configured to auto sensing, we try to sense port to configuration in order to determine how to initialize the port. If port type needs to be changed, all interfaces are unregistered and then registered again with the new port types. Sense is done with intervals of 3 seconds. Initial port configuration is set to sense link type. Signed-off-by: Yevgeny Petrilin --- drivers/net/mlx4/Makefile | 2 +- drivers/net/mlx4/eq.c | 16 +++-- drivers/net/mlx4/intf.c | 4 + drivers/net/mlx4/main.c | 97 +++++++++++++++++++++++------ drivers/net/mlx4/mlx4.h | 24 +++++++ drivers/net/mlx4/sense.c | 144 +++++++++++++++++++++++++++++++++++++++++++ include/linux/mlx4/cmd.h | 1 + include/linux/mlx4/device.h | 6 +- 8 files changed, 265 insertions(+), 29 deletions(-) create mode 100644 drivers/net/mlx4/sense.c diff --git a/drivers/net/mlx4/Makefile b/drivers/net/mlx4/Makefile index a7a97bf..21040a0 100644 --- a/drivers/net/mlx4/Makefile +++ b/drivers/net/mlx4/Makefile @@ -1,7 +1,7 @@ obj-$(CONFIG_MLX4_CORE) += mlx4_core.o mlx4_core-y := alloc.o catas.o cmd.o cq.o eq.o fw.o icm.o intf.o main.o mcg.o \ - mr.o pd.o port.o profile.o qp.o reset.o srq.o + mr.o pd.o port.o profile.o qp.o reset.o sense.o srq.o obj-$(CONFIG_MLX4_EN) += mlx4_en.o diff --git a/drivers/net/mlx4/eq.c b/drivers/net/mlx4/eq.c index b436234..bd3ce60 100644 --- a/drivers/net/mlx4/eq.c +++ b/drivers/net/mlx4/eq.c @@ -163,6 +163,7 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq) int cqn; int eqes_found = 0; int set_ci = 0; + int port; while ((eqe = next_eqe_sw(eq))) { /* @@ -203,11 +204,16 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq) break; case MLX4_EVENT_TYPE_PORT_CHANGE: - mlx4_dispatch_event(dev, - eqe->subtype == MLX4_PORT_CHANGE_SUBTYPE_ACTIVE ? - MLX4_DEV_EVENT_PORT_UP : - MLX4_DEV_EVENT_PORT_DOWN, - be32_to_cpu(eqe->event.port_change.port) >> 28); + port = be32_to_cpu(eqe->event.port_change.port) >> 28; + if (eqe->subtype == MLX4_PORT_CHANGE_SUBTYPE_DOWN) { + mlx4_dispatch_event(dev, MLX4_DEV_EVENT_PORT_DOWN, + port); + mlx4_priv(dev)->sense.do_sense_port[port] = 1; + } else { + mlx4_dispatch_event(dev, MLX4_DEV_EVENT_PORT_UP, + port); + mlx4_priv(dev)->sense.do_sense_port[port] = 0; + } break; case MLX4_EVENT_TYPE_CQ_ERROR: diff --git a/drivers/net/mlx4/intf.c b/drivers/net/mlx4/intf.c index 0e7eb10..30ef000 100644 --- a/drivers/net/mlx4/intf.c +++ b/drivers/net/mlx4/intf.c @@ -141,6 +141,8 @@ int mlx4_register_device(struct mlx4_dev *dev) mutex_unlock(&intf_mutex); mlx4_start_catas_poll(dev); + mlx4_start_sense(dev); + return 0; } @@ -149,6 +151,8 @@ void mlx4_unregister_device(struct mlx4_dev *dev) struct mlx4_priv *priv = mlx4_priv(dev); struct mlx4_interface *intf; + mlx4_stop_sense(dev); + mlx4_stop_catas_poll(dev); mutex_lock(&intf_mutex); diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index aaf3eec..4ff4789 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -98,24 +98,26 @@ module_param_named(use_prio, use_prio, bool, 0444); MODULE_PARM_DESC(use_prio, "Enable steering by VLAN priority on ETH ports " "(0/1, default 0)"); -static int mlx4_check_port_params(struct mlx4_dev *dev, +int mlx4_check_port_params(struct mlx4_dev *dev, enum mlx4_port_type *port_type) { int i; for (i = 0; i < dev->caps.num_ports - 1; i++) { - if (port_type[i] != port_type[i+1] && - !(dev->caps.flags & MLX4_DEV_CAP_FLAG_DPDP)) { - mlx4_err(dev, "Only same port types supported " - "on this HCA, aborting.\n"); - return -EINVAL; + if (port_type[i] != port_type[i+1]) { + if (!(dev->caps.flags & MLX4_DEV_CAP_FLAG_DPDP)) { + mlx4_err(dev, "Only same port types supported " + "on this HCA, aborting.\n"); + return -EINVAL; + } + if ((port_type[i] == MLX4_PORT_TYPE_ETH) || + (port_type[i+1] == MLX4_PORT_TYPE_IB)) { + mlx4_err(dev, "Given ports configuration is " + "not supported\n"); + return -EINVAL; + } } } - if ((port_type[0] == MLX4_PORT_TYPE_ETH) && - (port_type[1] == MLX4_PORT_TYPE_IB)) { - mlx4_err(dev, "eth-ib configuration is not supported.\n"); - return -EINVAL; - } for (i = 0; i < dev->caps.num_ports; i++) { if (!(port_type[i] & dev->caps.supported_type[i+1])) { @@ -225,6 +227,9 @@ static int mlx4_dev_cap(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap) dev->caps.port_type[i] = MLX4_PORT_TYPE_IB; else dev->caps.port_type[i] = MLX4_PORT_TYPE_ETH; + dev->caps.possible_type[i] = dev->caps.supported_type[i]; + mlx4_priv(dev)->sense.sense_allowed[i] = + dev->caps.supported_type[i] == MLX4_PORT_TYPE_AUTO ? 1 : 0; if (dev->caps.log_num_macs > dev_cap->log_max_macs[i]) { dev->caps.log_num_macs = dev_cap->log_max_macs[i]; @@ -263,7 +268,7 @@ static int mlx4_dev_cap(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap) * Change the port configuration of the device. * Every user of this function must hold the port mutex. */ -static int mlx4_change_port_types(struct mlx4_dev *dev, +int mlx4_change_port_types(struct mlx4_dev *dev, enum mlx4_port_type *port_types) { int err = 0; @@ -274,6 +279,8 @@ static int mlx4_change_port_types(struct mlx4_dev *dev, if (port_types[port] != dev->caps.port_type[port + 1]) { change = 1; dev->caps.port_type[port + 1] = port_types[port]; + if (dev->caps.possible_type[port + 1] != MLX4_PORT_TYPE_AUTO) + dev->caps.possible_type[port + 1] = port_types[port]; } } if (change) { @@ -302,10 +309,17 @@ static ssize_t show_port_type(struct device *dev, struct mlx4_port_info *info = container_of(attr, struct mlx4_port_info, port_attr); struct mlx4_dev *mdev = info->dev; + char type[8]; + + sprintf(type, "%s", + (mdev->caps.port_type[info->port] == MLX4_PORT_TYPE_IB) ? + "ib" : "eth"); + if (mdev->caps.possible_type[info->port] == MLX4_PORT_TYPE_AUTO) + sprintf(buf, "auto (%s)\n", type); + else + sprintf(buf, "%s\n", type); - return sprintf(buf, "%s\n", - mdev->caps.port_type[info->port] == MLX4_PORT_TYPE_IB ? - "ib" : "eth"); + return strlen(buf); } static ssize_t set_port_type(struct device *dev, @@ -324,6 +338,8 @@ static ssize_t set_port_type(struct device *dev, info->tmp_type = MLX4_PORT_TYPE_IB; else if (!strcmp(buf, "eth\n")) info->tmp_type = MLX4_PORT_TYPE_ETH; + else if (!strcmp(buf, "auto\n")) + info->tmp_type = MLX4_PORT_TYPE_AUTO; else { mlx4_err(mdev, "%s is not supported port type\n", buf); return -EINVAL; @@ -332,14 +348,19 @@ static ssize_t set_port_type(struct device *dev, mutex_lock(&priv->port_mutex); for (i = 0; i < mdev->caps.num_ports; i++) types[i] = priv->port[i+1].tmp_type ? priv->port[i+1].tmp_type : - mdev->caps.port_type[i+1]; + mdev->caps.possible_type[i+1]; err = mlx4_check_port_params(mdev, types); if (err) goto out; - for (i = 1; i <= mdev->caps.num_ports; i++) - priv->port[i].tmp_type = 0; + for (i = 0; i < mdev->caps.num_ports; i++) { + mdev->caps.possible_type[i + 1] = types[i]; + if (types[i] == MLX4_PORT_TYPE_AUTO) + types[i] = mdev->caps.port_type[i + 1]; + + priv->port[i + 1].tmp_type = 0; + } err = mlx4_change_port_types(mdev, types); @@ -963,6 +984,32 @@ static void mlx4_cleanup_port_info(struct mlx4_port_info *info) device_remove_file(&info->dev->pdev->dev, &info->port_attr); } +static void mlx4_set_actual_type(struct mlx4_dev *dev) +{ + enum mlx4_port_type stype[dev->caps.num_ports]; + int i; + + if (!(dev->caps.flags & MLX4_DEV_CAP_FLAG_DPDP)) + return; + + for (i = 1; i <= dev->caps.num_ports; i++) { + stype[i-1] = 0; + if (mlx4_priv(dev)->sense.sense_allowed[i] && + dev->caps.possible_type[i] == MLX4_PORT_TYPE_AUTO) { + if (mlx4_SENSE_PORT(dev, i, &stype[i-1])) + return; + } + if (!stype[i-1]) + stype[i-1] = dev->caps.port_type[i]; + } + + if (!mlx4_check_port_params(dev, stype)) { + for (i = 1; i <= dev->caps.num_ports; i++) + dev->caps.port_type[i] = stype[i-1]; + } + mlx4_set_port_mask(dev); +} + static int __mlx4_init_one(struct pci_dev *pdev, const struct pci_device_id *id) { struct mlx4_priv *priv; @@ -1086,14 +1133,23 @@ static int __mlx4_init_one(struct pci_dev *pdev, const struct pci_device_id *id) goto err_port; } - err = mlx4_register_device(dev); + mlx4_set_actual_type(dev); + + err = mlx4_sense_init(dev); if (err) goto err_port; + err = mlx4_register_device(dev); + if (err) + goto err_sense; + pci_set_drvdata(pdev, dev); return 0; +err_sense: + mlx4_sense_cleanup(dev); + err_port: for (port = 1; port <= dev->caps.num_ports; port++) mlx4_cleanup_port_info(&priv->port[port]); @@ -1153,12 +1209,11 @@ static void mlx4_remove_one(struct pci_dev *pdev) if (dev) { mlx4_unregister_device(dev); - + mlx4_sense_cleanup(dev); for (p = 1; p <= dev->caps.num_ports; p++) { mlx4_cleanup_port_info(&priv->port[p]); mlx4_CLOSE_PORT(dev, p); } - mlx4_cleanup_mcg_table(dev); mlx4_cleanup_qp_table(dev); mlx4_cleanup_srq_table(dev); diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h index 1b953ab..e44454f 100644 --- a/drivers/net/mlx4/mlx4.h +++ b/drivers/net/mlx4/mlx4.h @@ -40,6 +40,7 @@ #include #include #include +#include #include #include @@ -285,6 +287,15 @@ struct mlx4_port_info { struct mlx4_vlan_table vlan_table; }; +struct mlx4_sense { + struct mlx4_dev *dev; + u8 do_sense_port[MLX4_MAX_PORTS + 1]; + u8 sense_allowed[MLX4_MAX_PORTS + 1]; + struct delayed_work sense_poll; + struct workqueue_struct *sense_wq; + u32 resched; +}; + struct mlx4_priv { struct mlx4_dev dev; @@ -314,6 +325,7 @@ struct mlx4_priv { struct mlx4_uar driver_uar; void __iomem *kar; struct mlx4_port_info port[MLX4_MAX_PORTS + 1]; + struct mlx4_sense sense; struct mutex port_mutex; }; @@ -322,6 +334,8 @@ static inline struct mlx4_priv *mlx4_priv(struct mlx4_dev *dev) return container_of(dev, struct mlx4_priv, dev); } +#define MLX4_SENSE_RANGE (HZ * 3) + u32 mlx4_bitmap_alloc(struct mlx4_bitmap *bitmap); void mlx4_bitmap_free(struct mlx4_bitmap *bitmap, u32 obj); u32 mlx4_bitmap_alloc_range(struct mlx4_bitmap *bitmap, int cnt, int align); @@ -385,6 +399,16 @@ void mlx4_srq_event(struct mlx4_dev *dev, u32 srqn, int event_type); void mlx4_handle_catas_err(struct mlx4_dev *dev); +void mlx4_start_sense(struct mlx4_dev *dev); +void mlx4_stop_sense(struct mlx4_dev *dev); +int mlx4_sense_init(struct mlx4_dev *dev); +void mlx4_sense_cleanup(struct mlx4_dev *dev); +int mlx4_SENSE_PORT(struct mlx4_dev *dev, int port, enum mlx4_port_type *type); +int mlx4_check_port_params(struct mlx4_dev *dev, + enum mlx4_port_type *port_type); +int mlx4_change_port_types(struct mlx4_dev *dev, + enum mlx4_port_type *port_types); + void mlx4_init_mac_table(struct mlx4_dev *dev, struct mlx4_mac_table *table); void mlx4_init_vlan_table(struct mlx4_dev *dev, struct mlx4_vlan_table *table); diff --git a/drivers/net/mlx4/sense.c b/drivers/net/mlx4/sense.c new file mode 100644 index 0000000..f4b8d73 --- /dev/null +++ b/drivers/net/mlx4/sense.c @@ -0,0 +1,144 @@ +/* + * Copyright (c) 2007 Mellanox Technologies. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#include +#include + +#include + +#include "mlx4.h" + + +int mlx4_SENSE_PORT(struct mlx4_dev *dev, int port, enum mlx4_port_type *type) +{ + u64 out_param; + int err = 0; + + err = mlx4_cmd_imm(dev, 0, &out_param, port, 0, + MLX4_CMD_SENSE_PORT, MLX4_CMD_TIME_CLASS_B); + if (err) { + mlx4_err(dev, "Sense command failed for port: %d\n", port); + return err; + } + + if (out_param > 2) { + mlx4_err(dev, "Sense returned illegal value: 0x%llx\n", out_param); + return EINVAL; + } + + *type = out_param; + return 0; +} + +static void mlx4_sense_port(struct work_struct *work) +{ + struct delayed_work *delay = container_of(work, struct delayed_work, work); + struct mlx4_sense *sense = container_of(delay, struct mlx4_sense, + sense_poll); + struct mlx4_dev *dev = sense->dev; + struct mlx4_priv *priv = mlx4_priv(dev); + enum mlx4_port_type stype[MLX4_MAX_PORTS]; + int err = 0; + int i; + + mutex_lock(&priv->port_mutex); + for (i = 1; i <= dev->caps.num_ports; i++) { + stype[i-1] = 0; + if (sense->do_sense_port[i] && sense->sense_allowed[i] && + dev->caps.possible_type[i] == MLX4_PORT_TYPE_AUTO) { + err = mlx4_SENSE_PORT(dev, i, &stype[i-1]); + if (err) + goto sense_again; + } + if (!stype[i-1]) + stype[i-1] = dev->caps.port_type[i]; + } + + if (mlx4_check_port_params(dev, stype)) + goto sense_again; + + if (mlx4_change_port_types(dev, stype)) + mlx4_err(dev, "Failed to change port_types\n"); + +sense_again: + mutex_unlock(&priv->port_mutex); + if (sense->resched) + queue_delayed_work(sense->sense_wq , &sense->sense_poll, + round_jiffies(MLX4_SENSE_RANGE)); +} + + +void mlx4_start_sense(struct mlx4_dev *dev) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + struct mlx4_sense *sense = &priv->sense; + + if (!(dev->caps.flags & MLX4_DEV_CAP_FLAG_DPDP)) + return; + + sense->resched = 1; + queue_delayed_work(sense->sense_wq , &sense->sense_poll, + round_jiffies(MLX4_SENSE_RANGE)); +} + + +void mlx4_stop_sense(struct mlx4_dev *dev) +{ + mlx4_priv(dev)->sense.resched = 0; +} + +int mlx4_sense_init(struct mlx4_dev *dev) +{ + struct mlx4_priv *priv = mlx4_priv(dev); + struct mlx4_sense *sense = &priv->sense; + int port; + + sense->dev = dev; + sense->sense_wq = create_singlethread_workqueue("mlx4_sense"); + if (!sense->sense_wq) + return -ENOMEM; + + for (port = 1; port <= dev->caps.num_ports; port++) + sense->do_sense_port[port] = 1; + + INIT_DELAYED_WORK_DEFERRABLE(&sense->sense_poll, mlx4_sense_port); + + return 0; +} + +void mlx4_sense_cleanup(struct mlx4_dev *dev) +{ + cancel_delayed_work_sync(&mlx4_priv(dev)->sense.sense_poll); + destroy_workqueue(mlx4_priv(dev)->sense.sense_wq); +} + diff --git a/include/linux/mlx4/cmd.h b/include/linux/mlx4/cmd.h index cf9c679..0f82293 100644 --- a/include/linux/mlx4/cmd.h +++ b/include/linux/mlx4/cmd.h @@ -55,6 +55,7 @@ enum { MLX4_CMD_CLOSE_PORT = 0xa, MLX4_CMD_QUERY_HCA = 0xb, MLX4_CMD_QUERY_PORT = 0x43, + MLX4_CMD_SENSE_PORT = 0x4d, MLX4_CMD_SET_PORT = 0xc, MLX4_CMD_ACCESS_DDR = 0x2e, MLX4_CMD_MAP_ICM = 0xffa, diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index f9638e5..ed1371b 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -155,8 +155,9 @@ enum mlx4_qp_region { }; enum mlx4_port_type { - MLX4_PORT_TYPE_IB = 1 << 0, - MLX4_PORT_TYPE_ETH = 1 << 1, + MLX4_PORT_TYPE_IB = 1, + MLX4_PORT_TYPE_ETH = 2, + MLX4_PORT_TYPE_AUTO = 3 }; enum mlx4_special_vlan_idx { @@ -238,6 +239,7 @@ struct mlx4_caps { enum mlx4_port_type port_type[MLX4_MAX_PORTS + 1]; u8 supported_type[MLX4_MAX_PORTS + 1]; u32 port_mask; + enum mlx4_port_type possible_type[MLX4_MAX_PORTS + 1]; }; struct mlx4_buf_list { -- 1.5.4 From Jeffrey.C.Becker at nasa.gov Tue Oct 28 09:13:09 2008 From: Jeffrey.C.Becker at nasa.gov (Jeff Becker) Date: Tue, 28 Oct 2008 09:13:09 -0700 Subject: [ofa-general] OFED October 27 2008 meeting summary on OFED 1.4 status In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EADD002D1@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EADD002D1@mtlexch01.mtl.com> Message-ID: <49073A15.7050309@nasa.gov> Hi Tziporet. I'm sorry I was unable to attend the meeting. Tziporet Koren wrote: > OFED October 27 2008 meeting summary on OFED 1.4 status > > Meeting minutes on the web: > http://www.openfabrics.org/txt/documentation/linux/EWG_meeting_minutes/ > > Meeting Summary: > ============== > 1. RC4 is planed for next week - Nov 4 > 2. NFS_RDMA will be enabled on kernel 2.6.27 only since its not actually > working on the distros > > I think I can get NFS_RDMA ready on 2.6.27 (it's already close) and 2.6.26 for rc4. > Details: > ====== > 1. Bugs status and decide on their priority > > 1262 cri andy.grover at oracle.com congestion hang > with RDS - not critical > 1298 cri Jeffrey.C.Becker at nasa.gov nfsrdma rh5.1 causes > kernel panic - not critical - will disable it on RHEL5 > 1299 cri Jeffrey.C.Becker at nasa.gov nfs module is missing > symbols in rh5.1 - not critical - same as above > 1287 cri vlad at mellanox.co.il IPoIB datagram mode > initial packet loss - not critical need to check if it happened on OFED > 1.3 too > 1242 cri yannick.cote at qlogic.com kernel panic while > running mpi2007 against ofed1.4 -- ib_... - on work > 1301 maj andy.grover at oracle.com Can not load rds module on RH4 > up7 - Olga will look at this > 1221 maj Jeffrey.C.Becker at nasa.gov SLES10 sp2: remote > logins via ssh fail due to rpcbind and... - will disable on this OS > 1284 maj monis at voltaire.com Bonding - when eth > bonding and IB bonding are configured,... on work > 1286 maj monis at voltaire.com bond does not failover > correctly - on work > 1308 maj vlad at mellanox.co.il path_rec_completion > [ib_ipoib] kernel Oops (Unable to han...- on work > 1164 maj yosefe at voltaire.com iperf over IPoIB UD > fails for 100 tcp connections - reduce to normal > 1288 maj yosefe at voltaire.com bug warning while > disabling/enabling ports from he switch - check with Yossi > > 2. We had a discussion on NFS-RDMA since both RHEL 5.1 and SLES10 SP2 > backports are not working well > We had a debate - do we take it out of OFED since it is not working on > the distros > Leave it in: We can have bug fixes for 1.4.1, and give customers a > platform to play with > Take it out: If someone will try it on the distro experience can be > problematic > Decision: We will leave it for 2.6.27 kernel only. > All testing should be done on this kernel mainly to see that basic > functionality is working > Since I wasn't at the meeting, I'd like to clarify. I am currently working on the distro backports, and plan to have these ready for 1.4.1. Assuming I get this done, can we include these in 1.4.1? Thanks. -jeff > 3. Decide on RC4 target date - Nov 4 > Betsy will check the compilation warnings status - and see if we can > have more improvements for RC4 > > 4. Reviewed BOF presentation from Woody. > General content is good. > For OFED 1.5 - we will add list of OSes that their support will be > dropped > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From tziporet at dev.mellanox.co.il Tue Oct 28 09:13:57 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Tue, 28 Oct 2008 18:13:57 +0200 Subject: [ofa-general] Question about ibv_asyncwatch In-Reply-To: <92eddfb50810280607t4135d701p9ed16b3cb23023d8@mail.gmail.com> References: <92eddfb50810280607t4135d701p9ed16b3cb23023d8@mail.gmail.com> Message-ID: <49073A45.9040702@mellanox.co.il> Karthik Gopalakrishnan wrote: > Hi Folks. > > I have written a standalone program that calls > 'ibv_get_async_event()'. I want to know if that program can get async > events about errors on QPs (IBV_EVENT_PATH_MIG_ERR for example) that > are created by a different process (say some MPI Program). > I am not sure about this but I think this is impossible. Only the process who opened the QP can get the QP events > I also see a utility called 'ibv_asyncwatch' that is shipped as part > of OFED that seems to do something similar. I will be grateful if > someone could throw more light about what it does and point me to its > source. > Sources are under examples of libibverbs Tziporet From mashirle at us.ibm.com Tue Oct 28 09:17:35 2008 From: mashirle at us.ibm.com (Shirley Ma) Date: Tue, 28 Oct 2008 09:17:35 -0700 Subject: [ofa-general] mlx4: EEH test Message-ID: <1225210655.24021.9.camel@IBM-29AB850785D.beaverton.ibm.com> Anyone has done any EEH test for ConnectX? We hit some issue on PPC. I wonder whether it's being fully tested. Thanks Shirley From eli at dev.mellanox.co.il Tue Oct 28 11:49:24 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Tue, 28 Oct 2008 20:49:24 +0200 Subject: [ofa-general] poll CQ failed -2 with connectX In-Reply-To: <200810271838.48510.ricklist@microway.com> References: <200810271838.48510.ricklist@microway.com> Message-ID: <20081028184924.GA13206@mtls03> On Mon, Oct 27, 2008 at 06:38:48PM -0400, Rick Warner wrote: > Hi all, > > I am configuring an opteron cluster with connectX Infiniband. I have a > problem that if I run one of the NAS tests, it works the first, and maybe 2nd > time, but after that the jobs instantly fail with messages like this- > > [Rank 44][cm.c: line 860]poll CQ failed -2 > [Rank 51][cm.c: line 860]poll CQ failed -2 > [Rank 119][cm.c: line 860]poll CQ failed -2 > [Rank 85][cm.c: line 860]poll CQ failed -2 > [Rank 0][cm.c: line 860]poll CQ failed -2 > [Rank 9][cm.c: line 860]poll CQ failed -2 > [Rank 26][cm.c: line 860]poll CQ failed -2[Rank 43][cm.c: line 860] > poll CQ failed -2 > [Rank 94][cm.c: line 860]poll CQ failed -2 > [Rank 111][cm.c: line 860]poll CQ failed -2 This error means that a CQE was polled which belongs to a none existent QP. But, I do remember a case with an Opteron which experienced the same problem and eventually it appeared that it was a system problem that was resolved after a BIOS update. Can you check if there is an update to your system's BIOS? > > I can easily reproduce this with only 2 systems using a 16 process LU job, > class B. > > Here are the configs I've tried- > Suse 11 with distro provided IB driver and libraries,etc, using mvapich as > provided by ohio state > Suse 11 with distro driver, using OFED 1.3.1 libraries and mvapich > Suse 10.3 with OFED 1.3.1, OFED 1.2.5.4, and OFED 1.4rc3 > > They all have the same basic problem. I think one of them reported "Error > polling CQ" instead of "poll CQ failed". > > If I replace the connectX cards with regular DDR cards the problem goes away. > > I'm getting quite stumped at this point and would appreciate any suggestions > or patches. > > Thanks, > Rick > -- > Richard Warner > Lead Systems Integrator > Microway, Inc > (508)732-5517 > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From friedman at ucla.edu Tue Oct 28 12:11:12 2008 From: friedman at ucla.edu (Scott A. Friedman) Date: Tue, 28 Oct 2008 12:11:12 -0700 Subject: [ofa-general] ib_mthca catastrophic error detected In-Reply-To: <4907054E.9080205@mellanox.co.il> References: <4906645D.6010101@ucla.edu> <4907054E.9080205@mellanox.co.il> Message-ID: <490763D0.5020002@ucla.edu> Hi This cluster has OFED 1.2.5.4 running on it. The ib_mthca kernel module reports the following on startup: ib_mthca: Mellanox InfiniBand HCA driver v1.0 (February 28, 2008) The cards in all (22) of the nodes we have seen this error on are as follows: hca_id: mthca0 fw_ver: 1.2.0 vendor_id: 0x02c9 vendor_part_id: 25204 hw_ver: 0xA0 board_id: MT_03B0140001 phys_port_cnt: 1 It appears that when this happens the driver restarts (loads?) itself however the job running at the time of the error is, of course, killed. Scott Tziporet Koren wrote: >> >> ib_mthca 0000:02:00.0: Catastrophic error detected: internal error >> > Can you specify: > Which OFED version you use? (or IB from kernel.org) > Which HCA and FW version? > > Tziporet > > > From rdreier at cisco.com Tue Oct 28 12:41:32 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 28 Oct 2008 12:41:32 -0700 Subject: [ofa-general] mlx4: EEH test In-Reply-To: <1225210655.24021.9.camel@IBM-29AB850785D.beaverton.ibm.com> (Shirley Ma's message of "Tue, 28 Oct 2008 09:17:35 -0700") References: <1225210655.24021.9.camel@IBM-29AB850785D.beaverton.ibm.com> Message-ID: > Anyone has done any EEH test for ConnectX? We hit some issue on PPC. I > wonder whether it's being fully tested. Not sure what you mean. The driver doesn't have PCI error handling support, if that's what your asking. From hal.rosenstock at gmail.com Tue Oct 28 13:26:02 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 28 Oct 2008 16:26:02 -0400 Subject: [ofa-general] Re: [PATCH] opensm: osm_send_trap144() function In-Reply-To: <20081025175201.GP28713@sashak.voltaire.com> References: <20081025175201.GP28713@sashak.voltaire.com> Message-ID: Sasha, On Sat, Oct 25, 2008 at 1:52 PM, Sasha Khapyorsky wrote: > > Add ability to send trap 144 - osm_send_trap144() function. This can be > useful when SMA doesn't support trap sending on some events, such as > CapabilityMask change (ConnectX), OtherLocalChanges (no one supports > this AFAIK). What component beside the SMA would send the ones mentioned above ? Also, how would it know whether or not to do this ? -- Hal > Signed-off-by: Sasha Khapyorsky > --- > opensm/include/opensm/osm_sm.h | 23 +++++++++++++ > opensm/opensm/osm_req.c | 68 ++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 91 insertions(+), 0 deletions(-) > > diff --git a/opensm/include/opensm/osm_sm.h b/opensm/include/opensm/osm_sm.h > index 5d46246..bc87ea6 100644 > --- a/opensm/include/opensm/osm_sm.h > +++ b/opensm/include/opensm/osm_sm.h > @@ -769,5 +769,28 @@ ib_api_status_t osm_sm_state_mgr_check_legality(IN osm_sm_t *sm, > > void osm_report_sm_state(osm_sm_t *sm); > > +/****f* OpenSM: SM State Manager/osm_send_trap144 > +* NAME > +* osm_send_trap144 > +* > +* DESCRIPTION > +* Send trap 144 to the master SM. > +* > +* SYNOPSIS > +*/ > +int osm_send_trap144(osm_sm_t *sm, ib_net16_t local); > +/* > +* PARAMETERS > +* sm > +* [in] Pointer to an osm_sm_t object. > +* > +* local > +* [in] OtherLocalChanges mask in network byte order. > +* > +* RETURN VALUES > +* 0 on success, non-zero value otherwise. > +* > +*********/ > + > END_C_DECLS > #endif /* _OSM_SM_H_ */ > diff --git a/opensm/opensm/osm_req.c b/opensm/opensm/osm_req.c > index 5f93551..0865ce5 100644 > --- a/opensm/opensm/osm_req.c > +++ b/opensm/opensm/osm_req.c > @@ -210,3 +210,71 @@ Exit: > OSM_LOG_EXIT(sm->p_log); > return (status); > } > + > +int osm_send_trap144(osm_sm_t *sm, ib_net16_t local) > +{ > + osm_madw_t *madw; > + ib_smp_t *smp; > + ib_mad_notice_attr_t *ntc; > + osm_port_t *port; > + ib_port_info_t *pi; > + > + port = osm_get_port_by_guid(sm->p_subn, sm->p_subn->sm_port_guid); > + if (!port) { > + OSM_LOG(sm->p_log, OSM_LOG_ERROR, > + "ERR 1104: cannot find SM port by guid 0x%" PRIx64 "\n", > + cl_ntoh64(sm->p_subn->sm_port_guid)); > + return -1; > + } > + > + pi = &port->p_physp->port_info; > + > + /* don't bother with sending trap when SMA supports this */ > + if (!local && > + pi->capability_mask&(IB_PORT_CAP_HAS_TRAP|IB_PORT_CAP_HAS_CAP_NTC)) > + return 0; > + > + madw = osm_mad_pool_get(sm->p_mad_pool, > + osm_sm_mad_ctrl_get_bind_handle(&sm->mad_ctrl), > + MAD_BLOCK_SIZE, NULL); > + if (madw == NULL) { > + OSM_LOG(sm->p_log, OSM_LOG_ERROR, > + "ERR 1105: Unable to acquire MAD\n"); > + return -1; > + } > + > + madw->mad_addr.dest_lid = pi->master_sm_base_lid; > + madw->mad_addr.addr_type.smi.source_lid = pi->base_lid; > + madw->fail_msg = CL_DISP_MSGID_NONE; > + > + smp = osm_madw_get_smp_ptr(madw); > + memset(smp, 0, sizeof(*smp)); > + > + smp->base_ver = 1; > + smp->mgmt_class = IB_MCLASS_SUBN_LID; > + smp->class_ver = 1; > + smp->method = IB_MAD_METHOD_TRAP; > + smp->trans_id = cl_hton64((uint64_t)cl_atomic_inc(&sm->sm_trans_id)); > + smp->attr_id = IB_MAD_ATTR_NOTICE; > + smp->m_key = sm->p_subn->opt.m_key; > + > + ntc = (ib_mad_notice_attr_t *)smp->data; > + > + ntc->generic_type = 0x80 | IB_NOTICE_TYPE_INFO; > + ib_notice_set_prod_type_ho(ntc, IB_NODE_TYPE_CA); > + ntc->g_or_v.generic.trap_num = cl_hton16(144); > + ntc->issuer_lid = pi->base_lid; > + ntc->data_details.ntc_144.lid = pi->base_lid; > + ntc->data_details.ntc_144.local_changes = local ? > + TRAP_144_MASK_OTHER_LOCAL_CHANGES : 0; > + ntc->data_details.ntc_144.new_cap_mask = pi->capability_mask; > + ntc->data_details.ntc_144.change_flgs = local; > + > + OSM_LOG(sm->p_log, OSM_LOG_DEBUG, > + "Sending Trap 144, TID 0x%" PRIx64 " to SM lid %u\n", > + cl_ntoh64(smp->trans_id), cl_ntoh16(pi->master_sm_base_lid)); > + > + osm_vl15_post(sm->p_vl15, madw); > + > + return 0; > +} > -- > 1.6.0.3.517.g759a > > From hal.rosenstock at gmail.com Tue Oct 28 13:27:48 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 28 Oct 2008 16:27:48 -0400 Subject: [ofa-general] Re: [PATCH] opensm: notify master SM with trap 144 (not finished) In-Reply-To: <20081025200127.GR28713@sashak.voltaire.com> References: <20081025175201.GP28713@sashak.voltaire.com> <20081025200127.GR28713@sashak.voltaire.com> Message-ID: Sasha, On Sat, Oct 25, 2008 at 4:01 PM, Sasha Khapyorsky wrote: > > When entering standby state (after discovery) notify master SM about us. > In case when SMA doesn't support trap sending (specifically trap 144 on > PortInfo:CapabilityMask change - isSM bit, example is current ConnectX > firmware - 2.5.0) this is only way to notify the current master SM that > another SM is running. So is the trap sent unconditionally (since there's no way of knowing whether the SMA supports this or not) ? Is the only downside the extra Trap/TrapRepress when the SMA does support this ? Seems to me that the right fix is to the Connect-X SMA. Also, what happens once the Connect-X SMA is fixed ? Does this code persist ? -- Hal > See also bug#1183. > > Signed-off-by: Sasha Khapyorsky > --- > opensm/opensm/osm_state_mgr.c | 2 ++ > 1 files changed, 2 insertions(+), 0 deletions(-) > > diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c > index 174cee6..1576c42 100644 > --- a/opensm/opensm/osm_state_mgr.c > +++ b/opensm/opensm/osm_state_mgr.c > @@ -1142,6 +1142,8 @@ _repeat_discovery: > OSM_SM_SIGNAL_MASTER_OR_HIGHER_SM_DETECTED_DONE); > osm_log_msg_box(sm->p_log, OSM_LOG_VERBOSE, __FUNCTION__, > "ENTERING STANDBY STATE"); > + /* notify master SM about us */ > + osm_send_trap144(sm, 0); > return; > } > > -- > 1.6.0.3.517.g759a > > From ricklist at microway.com Tue Oct 28 13:39:02 2008 From: ricklist at microway.com (Rick Warner) Date: Tue, 28 Oct 2008 16:39:02 -0400 Subject: [ofa-general] poll CQ failed -2 with connectX In-Reply-To: <20081028184924.GA13206@mtls03> References: <200810271838.48510.ricklist@microway.com> <20081028184924.GA13206@mtls03> Message-ID: <200810281639.03422.ricklist@microway.com> Hi Eli, Thanks for the suggestion. Unfortunately, I have now reproduced this same problem on a group of 8 Xeon based systems as well, so the problem is not specific to the Opterons. Thanks, Rick On Tuesday 28 October 2008, Eli Cohen wrote: > On Mon, Oct 27, 2008 at 06:38:48PM -0400, Rick Warner wrote: > > Hi all, > > > > I am configuring an opteron cluster with connectX Infiniband. I have a > > problem that if I run one of the NAS tests, it works the first, and maybe > > 2nd time, but after that the jobs instantly fail with messages like this- > > > > [Rank 44][cm.c: line 860]poll CQ failed -2 > > [Rank 51][cm.c: line 860]poll CQ failed -2 > > [Rank 119][cm.c: line 860]poll CQ failed -2 > > [Rank 85][cm.c: line 860]poll CQ failed -2 > > [Rank 0][cm.c: line 860]poll CQ failed -2 > > [Rank 9][cm.c: line 860]poll CQ failed -2 > > [Rank 26][cm.c: line 860]poll CQ failed -2[Rank 43][cm.c: line 860] > > poll CQ failed -2 > > [Rank 94][cm.c: line 860]poll CQ failed -2 > > [Rank 111][cm.c: line 860]poll CQ failed -2 > > This error means that a CQE was polled which belongs to a none > existent QP. But, I do remember a case with an Opteron which > experienced the same problem and eventually it appeared that it was a > system problem that was resolved after a BIOS update. Can you check if > there is an update to your system's BIOS? > > > I can easily reproduce this with only 2 systems using a 16 process LU > > job, class B. > > > > Here are the configs I've tried- > > Suse 11 with distro provided IB driver and libraries,etc, using mvapich > > as provided by ohio state > > Suse 11 with distro driver, using OFED 1.3.1 libraries and mvapich > > Suse 10.3 with OFED 1.3.1, OFED 1.2.5.4, and OFED 1.4rc3 > > > > They all have the same basic problem. I think one of them reported > > "Error polling CQ" instead of "poll CQ failed". > > > > If I replace the connectX cards with regular DDR cards the problem goes > > away. > > > > I'm getting quite stumped at this point and would appreciate any > > suggestions or patches. > > > > Thanks, > > Rick > > -- > > Richard Warner > > Lead Systems Integrator > > Microway, Inc > > (508)732-5517 > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general -- Richard Warner Lead Systems Integrator Microway, Inc (508)732-5517 From rdreier at cisco.com Tue Oct 28 14:10:26 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 28 Oct 2008 14:10:26 -0700 Subject: [ofa-general] Re: [PATCH] IB/sysfs: Add port_xmit_wait counter. In-Reply-To: <4905F298.8030102@mellanox.co.il> (Vladimir Sokolovsky's message of "Mon, 27 Oct 2008 18:55:52 +0200") References: <20081026100202.GA15179@mellanox.co.il> <4905F298.8030102@mellanox.co.il> Message-ID: > The _counter field is ignored by show_pma_counter, it will be relevant for > set_pma_counter, if we are going to add it. I don't think we plan to have a set method for PMA counters... and if we did, setting the _counter value to 0 for port_xmit_wait would lead to incorrectly clearing symbol_error when someone tried to clear port_xmit_wait, right? - R. From dotanba at gmail.com Tue Oct 28 14:16:58 2008 From: dotanba at gmail.com (Dotan Barak) Date: Tue, 28 Oct 2008 23:16:58 +0200 Subject: [ofa-general] Question about ibv_asyncwatch In-Reply-To: <92eddfb50810280607t4135d701p9ed16b3cb23023d8@mail.gmail.com> References: <92eddfb50810280607t4135d701p9ed16b3cb23023d8@mail.gmail.com> Message-ID: <4907814A.1090100@gmail.com> Karthik Gopalakrishnan wrote: > Hi Folks. > > I have written a standalone program that calls > 'ibv_get_async_event()'. I want to know if that program can get async > events about errors on QPs (IBV_EVENT_PATH_MIG_ERR for example) that > are created by a different process (say some MPI Program). > > I also see a utility called 'ibv_asyncwatch' that is shipped as part > of OFED that seems to do something similar. I will be grateful if > someone could throw more light about what it does and point me to its > source. > Hi. ibv_get_async_event can get: * unaffiliated events: events which are not related to a specific object (for example: port/HCA events) * affiliated events: events which are related to a specific object (QP/CQ/SRQ) of the same process. You cannot get affiliated events that were created in other processes. Dotan From DavidRobb at comsci.co.uk Tue Oct 28 14:18:40 2008 From: DavidRobb at comsci.co.uk (David Robb) Date: Tue, 28 Oct 2008 21:18:40 +0000 Subject: [ofa-general] Poor Performance of OpenIB with small packets c.f. Gigabit Ethernet Message-ID: <490781B0.9040105@comsci.co.uk> We have a data logging application that exhibits poor performance when operated using TCP/IP sockets and IPoIB. With small message sizes ~ 64 bytes, the performance values for our application are OFED 1.2 IPoIB: 2.81MB/s OFED 1.3 IPoIB: 1.37MB/s GB Ethernet: 5.38MB/s It is not until the message sizes reach 16K or so that the Infiniband starts to overtake the Ethernet. Are these values as expected? What further tests could I run to investigate the problem? Are there any settings and or device configuration that we can tweak to improve the small message performance? We are running RH-EL Linux and using Mellanox HCAs and switches. We recently upgrade to OFED 1.3 and have upgraded the HCA firmware to the latest 1.2 version. Many thanks for any help Regards David Robb From rdreier at cisco.com Tue Oct 28 14:26:05 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 28 Oct 2008 14:26:05 -0700 Subject: [ofa-general] [PATCH] libmlx4: Re-calculate number of inline segments In-Reply-To: <48FC0628.3010801@sgi.com> (Vincent Rizza's message of "Mon, 20 Oct 2008 15:16:40 +1100") References: <48FC0628.3010801@sgi.com> Message-ID: > Supplying an ibv_qp_cap.max_inline_data value of 460 for mlx4_create_qp > was getting back ENOMEM when the max should have been 928. Tracked the bug > to the inline segment calculation. Here's the fix. Any more information about what the bug really is here, or a test case? As it stands I don't see anything wrong in theory or practice -- ie all my tests work and I don't see why your patch makes any difference in the value that ends up being calculated. - R. From rdreier at cisco.com Tue Oct 28 14:31:33 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 28 Oct 2008 14:31:33 -0700 Subject: [ofa-general] OOM problem with ib_ipoib? In-Reply-To: <490083D0.5000807@ec.gc.ca> (John Marshall's message of "Thu, 23 Oct 2008 14:01:52 +0000") References: <48FF6DFA.9080409@ec.gc.ca> <48FFA62D.3030305@ec.gc.ca> <490083D0.5000807@ec.gc.ca> Message-ID: > MemTotal: 33274492 kB ... > LowTotal: 638684 kB It looks as if you have a box with 32G of RAM running a 32-bit kernel, which means low (direct kernel-mapped) memory is extremely tight. IPoIB connected mode ties up a signifcant amount of memory in the receive queue -- perhaps around 64M, which is 10% of low memory for you. So loading IPoIB may push you past the tipping point where things really break easily. I'm not surprised that you run into memory management problems with such a system -- 32-bit kernels really have a hard time coping with such an inbalance between total memory and low memory. The simplest solution would probably be to switch to a 64-bit kernel -- note that you don't have to change any userspace, just use a 64-bit kernel. - R. From chien.tin.tung at intel.com Tue Oct 28 14:35:04 2008 From: chien.tin.tung at intel.com (Chien Tung) Date: Tue, 28 Oct 2008 16:35:04 -0500 Subject: [ofa-general] [PATCH 1/2] RDMA/nes: Correct handling of PBL resources Message-ID: <20081028213504.GA6296@ctung-MOBL> From: Chien Tung RDMA/nes: Correct handling of PBL resources. * Roll back allocated structures on failures. * Use GFP_ATOMIC instead of GFP_KERNEL since we are holding a lock. * Acquire nesadapter->pbl_lock when modifying PBL counters. * Decrement PBL counters on deallocation. Signed-off-by: Chien Tung -- drivers/infiniband/hw/nes/nes_verbs.c | 44 ++++++++++++++++++++++++-------- 1 files changed, 33 insertions(+), 11 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index 932e56f..f9b37b3 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -349,7 +349,7 @@ static struct ib_fmr *nes_alloc_fmr(struct ib_pd *ibpd, if (nesfmr->nesmr.pbls_used > nesadapter->free_4kpbl) { spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); ret = -ENOMEM; - goto failed_vpbl_alloc; + goto failed_vpbl_avail; } else { nesadapter->free_4kpbl -= nesfmr->nesmr.pbls_used; } @@ -357,7 +357,7 @@ static struct ib_fmr *nes_alloc_fmr(struct ib_pd *ibpd, if (nesfmr->nesmr.pbls_used > nesadapter->free_256pbl) { spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); ret = -ENOMEM; - goto failed_vpbl_alloc; + goto failed_vpbl_avail; } else { nesadapter->free_256pbl -= nesfmr->nesmr.pbls_used; } @@ -391,14 +391,14 @@ static struct ib_fmr *nes_alloc_fmr(struct ib_pd *ibpd, goto failed_vpbl_alloc; } - nesfmr->root_vpbl.leaf_vpbl = kzalloc(sizeof(*nesfmr->root_vpbl.leaf_vpbl)*1024, GFP_KERNEL); + nesfmr->leaf_pbl_cnt = nesfmr->nesmr.pbls_used-1; + nesfmr->root_vpbl.leaf_vpbl = kzalloc(sizeof(*nesfmr->root_vpbl.leaf_vpbl)*1024, GFP_ATOMIC); if (!nesfmr->root_vpbl.leaf_vpbl) { spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); ret = -ENOMEM; goto failed_leaf_vpbl_alloc; } - nesfmr->leaf_pbl_cnt = nesfmr->nesmr.pbls_used-1; nes_debug(NES_DBG_MR, "two level pbl, root_vpbl.pbl_vbase=%p" " leaf_pbl_cnt=%d root_vpbl.leaf_vpbl=%p\n", nesfmr->root_vpbl.pbl_vbase, nesfmr->leaf_pbl_cnt, nesfmr->root_vpbl.leaf_vpbl); @@ -519,6 +519,16 @@ static struct ib_fmr *nes_alloc_fmr(struct ib_pd *ibpd, nesfmr->root_vpbl.pbl_pbase); failed_vpbl_alloc: + if (nesfmr->nesmr.pbls_used != 0) { + spin_lock_irqsave(&nesadapter->pbl_lock, flags); + if (nesfmr->nesmr.pbl_4k) + nesadapter->free_4kpbl += nesfmr->nesmr.pbls_used; + else + nesadapter->free_256pbl += nesfmr->nesmr.pbls_used; + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); + } + + failed_vpbl_avail: kfree(nesfmr); failed_fmr_alloc: @@ -534,18 +544,14 @@ static struct ib_fmr *nes_alloc_fmr(struct ib_pd *ibpd, */ static int nes_dealloc_fmr(struct ib_fmr *ibfmr) { + unsigned long flags; struct nes_mr *nesmr = to_nesmr_from_ibfmr(ibfmr); struct nes_fmr *nesfmr = to_nesfmr(nesmr); struct nes_vnic *nesvnic = to_nesvnic(ibfmr->device); struct nes_device *nesdev = nesvnic->nesdev; - struct nes_mr temp_nesmr = *nesmr; + struct nes_adapter *nesadapter = nesdev->nesadapter; int i = 0; - temp_nesmr.ibmw.device = ibfmr->device; - temp_nesmr.ibmw.pd = ibfmr->pd; - temp_nesmr.ibmw.rkey = ibfmr->rkey; - temp_nesmr.ibmw.uobject = NULL; - /* free the resources */ if (nesfmr->leaf_pbl_cnt == 0) { /* single PBL case */ @@ -561,8 +567,24 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr) pci_free_consistent(nesdev->pcidev, 8192, nesfmr->root_vpbl.pbl_vbase, nesfmr->root_vpbl.pbl_pbase); } + nesmr->ibmw.device = ibfmr->device; + nesmr->ibmw.pd = ibfmr->pd; + nesmr->ibmw.rkey = ibfmr->rkey; + nesmr->ibmw.uobject = NULL; + + if (nesfmr->nesmr.pbls_used != 0) { + spin_lock_irqsave(&nesadapter->pbl_lock, flags); + if (nesfmr->nesmr.pbl_4k) { + nesadapter->free_4kpbl += nesfmr->nesmr.pbls_used; + BUG_ON(nesadapter->free_4kpbl > nesadapter->max_4kpbl); + } else { + nesadapter->free_256pbl += nesfmr->nesmr.pbls_used; + BUG_ON(nesadapter->free_256pbl > nesadapter->max_256pbl); + } + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); + } - return nes_dealloc_mw(&temp_nesmr.ibmw); + return nes_dealloc_mw(&nesmr->ibmw); } From chien.tin.tung at intel.com Tue Oct 28 14:35:07 2008 From: chien.tin.tung at intel.com (Chien Tung) Date: Tue, 28 Oct 2008 16:35:07 -0500 Subject: [ofa-general] [PATCH 2/2] RDMA/nes: Change CQ allocation scheme for performance applications Message-ID: <20081028213507.GA5680@ctung-MOBL> From: Vadim Makhervaks RDMA/nes: New CQ allocation scheme for performance applications. Change CQ allocation scheme for MCRQ applications. Signed-off-by: Vadim Makhervaks Signed-off-by: Chien Tung -- drivers/infiniband/hw/nes/nes_verbs.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index f9b37b3..51cb1b5 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -1617,7 +1617,7 @@ static struct ib_cq *nes_create_cq(struct ib_device *ibdev, int entries, nes_ucontext->mcrqf = req.mcrqf; if (nes_ucontext->mcrqf) { if (nes_ucontext->mcrqf & 0x80000000) - nescq->hw_cq.cq_number = nesvnic->nic.qp_id + 12 + (nes_ucontext->mcrqf & 0xf) - 1; + nescq->hw_cq.cq_number = nesvnic->nic.qp_id + 28 + 2*((nes_ucontext->mcrqf & 0xf) - 1); else if (nes_ucontext->mcrqf & 0x40000000) nescq->hw_cq.cq_number = nes_ucontext->mcrqf & 0xffff; else From chien.tin.tung at intel.com Tue Oct 28 14:35:10 2008 From: chien.tin.tung at intel.com (Chien Tung) Date: Tue, 28 Oct 2008 16:35:10 -0500 Subject: [ofa-general] [PATCH] RDMA/nes: Mitigate compatibility issue regarding PCI write credits Message-ID: <20081028213510.GA6532@ctung-MOBL> From: Vadim Makhervaks RDMA/nes: New CQ allocation scheme for performance applications. Change CQ allocation scheme for MCRQ applications. Signed-off-by: Vadim Makhervaks Signed-off-by: Chien Tung -- drivers/infiniband/hw/nes/nes_verbs.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index f9b37b3..51cb1b5 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -1617,7 +1617,7 @@ static struct ib_cq *nes_create_cq(struct ib_device *ibdev, int entries, nes_ucontext->mcrqf = req.mcrqf; if (nes_ucontext->mcrqf) { if (nes_ucontext->mcrqf & 0x80000000) - nescq->hw_cq.cq_number = nesvnic->nic.qp_id + 12 + (nes_ucontext->mcrqf & 0xf) - 1; + nescq->hw_cq.cq_number = nesvnic->nic.qp_id + 28 + 2*((nes_ucontext->mcrqf & 0xf) - 1); else if (nes_ucontext->mcrqf & 0x40000000) nescq->hw_cq.cq_number = nes_ucontext->mcrqf & 0xffff; else From rdreier at cisco.com Tue Oct 28 14:37:23 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 28 Oct 2008 14:37:23 -0700 Subject: [ofa-general] Re: [PATCH] RDMA/nes: Mitigate compatibility issue regarding PCI write credits In-Reply-To: <20081028213510.GA6532@ctung-MOBL> (Chien Tung's message of "Tue, 28 Oct 2008 16:35:10 -0500") References: <20081028213510.GA6532@ctung-MOBL> Message-ID: This seems to be a duplicate of this patch: > RDMA/nes: New CQ allocation scheme for performance applications. From rdreier at cisco.com Tue Oct 28 14:39:07 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 28 Oct 2008 14:39:07 -0700 Subject: [ofa-general] Re: [PATCH 2/2] RDMA/nes: Change CQ allocation scheme for performance applications In-Reply-To: <20081028213507.GA5680@ctung-MOBL> (Chien Tung's message of "Tue, 28 Oct 2008 16:35:07 -0500") References: <20081028213507.GA5680@ctung-MOBL> Message-ID: So this is an enhancement, or a fix? Seems like something that can wait for 2.6.29 to me. [A better changelog to make this clearer wouldn't be a problem for me either ;) What is an MCRQ application?] - R. From rdreier at cisco.com Tue Oct 28 14:41:18 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 28 Oct 2008 14:41:18 -0700 Subject: [ofa-general] Re: [PATCH 1/2] RDMA/nes: Correct handling of PBL resources In-Reply-To: <20081028213504.GA6296@ctung-MOBL> (Chien Tung's message of "Tue, 28 Oct 2008 16:35:04 -0500") References: <20081028213504.GA6296@ctung-MOBL> Message-ID: > + if (nesfmr->nesmr.pbls_used != 0) { > + spin_lock_irqsave(&nesadapter->pbl_lock, flags); > + if (nesfmr->nesmr.pbl_4k) { > + nesadapter->free_4kpbl += nesfmr->nesmr.pbls_used; > + BUG_ON(nesadapter->free_4kpbl > nesadapter->max_4kpbl); > + } else { > + nesadapter->free_256pbl += nesfmr->nesmr.pbls_used; > + BUG_ON(nesadapter->free_256pbl > nesadapter->max_256pbl); > + } > + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); Can we make these WARN_ON instead of BUG_ON? Killing the machine just because of a nes driver bug is kind of rude, and it reduces the chance of actually getting the debug output. - R. From gopalakk at cse.ohio-state.edu Tue Oct 28 17:16:07 2008 From: gopalakk at cse.ohio-state.edu (Karthik Gopalakrishnan) Date: Tue, 28 Oct 2008 20:16:07 -0400 Subject: [ofa-general] Question about ibv_asyncwatch In-Reply-To: <4907814A.1090100@gmail.com> References: <92eddfb50810280607t4135d701p9ed16b3cb23023d8@mail.gmail.com> <4907814A.1090100@gmail.com> Message-ID: <92eddfb50810281716y1ce4ff53u1e2192e16d40e687@mail.gmail.com> Hmmm. That makes sense. Thank You very much. Regards, Karthik On 10/28/08, Dotan Barak wrote: > Karthik Gopalakrishnan wrote: > > Hi Folks. > > > > I have written a standalone program that calls > > 'ibv_get_async_event()'. I want to know if that program can get async > > events about errors on QPs (IBV_EVENT_PATH_MIG_ERR for example) that > > are created by a different process (say some MPI Program). > > > > I also see a utility called 'ibv_asyncwatch' that is shipped as part > > of OFED that seems to do something similar. I will be grateful if > > someone could throw more light about what it does and point me to its > > source. > > > > > Hi. > > ibv_get_async_event can get: > * unaffiliated events: events which are not related to a specific object > (for example: port/HCA events) > * affiliated events: events which are related to a specific object > (QP/CQ/SRQ) of the same process. > > You cannot get affiliated events that were created in other processes. > > > Dotan > From joe at perches.com Tue Oct 28 17:16:14 2008 From: joe at perches.com (Joe Perches) Date: Tue, 28 Oct 2008 17:16:14 -0700 Subject: [ofa-general] ***SPAM*** [PATCH] infiniband: Add struct in6_addr addr to union ib_gib In-Reply-To: <1225236128.5269.240.camel@localhost> References: <1225229901.11483.58.camel@brick> <1225234963.5269.228.camel@localhost> <1225236128.5269.240.camel@localhost> Message-ID: <1225239374.5269.251.camel@localhost> ib_gid's can be print'd using the new %p6 facility Signed-off-by: Joe Perches diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 936e333..464ed9d 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -47,6 +47,7 @@ #include #include #include +#include #include #include @@ -57,6 +58,7 @@ union ib_gid { __be64 subnet_prefix; __be64 interface_id; } global; + struct in6_addr addr; }; enum rdma_node_type { From chien.tin.tung at intel.com Tue Oct 28 17:28:53 2008 From: chien.tin.tung at intel.com (Chien Tung) Date: Tue, 28 Oct 2008 19:28:53 -0500 Subject: [ofa-general] [PATCH v2] RDMA/nes: Mitigate compatibility issue regarding PCI write credits Message-ID: <20081029002853.GA3212@ctung-MOBL> From: Chien Tung RDMA/nes: Mitigate compatibility issue regarding PCI write credits. Under heavy load, there is an compatibility issue regarding PCI write credits with certain chipsets. It can be mitigated by limiting read requests to 256 Bytes. This workaround is always enabled for Tbird2 on Gladius. Add a driver parameter to enable workaround for non-Gladius cards. Signed-off-by: Chien Tung -- Roland, Sorry, redirected the wrong patch in my mail script. :wq drivers/infiniband/hw/nes/nes.c | 17 +++++++++++++++++ drivers/infiniband/hw/nes/nes_hw.h | 1 + 2 files changed, 18 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes.c b/drivers/infiniband/hw/nes/nes.c index a2b04d6..a2eb84b 100644 --- a/drivers/infiniband/hw/nes/nes.c +++ b/drivers/infiniband/hw/nes/nes.c @@ -95,6 +95,10 @@ unsigned int wqm_quanta = 0x10000; module_param(wqm_quanta, int, 0644); MODULE_PARM_DESC(wqm_quanta, "WQM quanta"); +static unsigned int limit_maxrdreqsz; +module_param(limit_maxrdreqsz, int, 0644); +MODULE_PARM_DESC(limit_maxrdreqsz, "Limit max read request size to 256B"); + LIST_HEAD(nes_adapter_list); static LIST_HEAD(nes_dev_list); @@ -445,6 +449,7 @@ static int __devinit nes_probe(struct pci_dev *pcidev, const struct pci_device_i struct nes_vnic *nesvnic = NULL; void __iomem *mmio_regs = NULL; u8 hw_rev; + u16 maxrdreqword; assert(pcidev != NULL); assert(ent != NULL); @@ -588,6 +593,18 @@ static int __devinit nes_probe(struct pci_dev *pcidev, const struct pci_device_i nesdev->nesadapter->port_count; } + if ((limit_maxrdreqsz) || + ((nesdev->nesadapter->phy_type[0] == NES_PHY_TYPE_GLADIUS) && + (hw_rev == NE020_REV1))) { + nes_debug(NES_DBG_INIT, + "Set max Read Request Size to 256 bytes\n"); + pci_read_config_word(pcidev, 0x68, &maxrdreqword); + /* set bits 12-14 to 001b = 256 bytes */ + maxrdreqword &= 0x8fff; + maxrdreqword |= 0x1000; + pci_write_config_word(pcidev, 0x68, maxrdreqword); + } + tasklet_init(&nesdev->dpc_tasklet, nes_dpc, (unsigned long)nesdev); /* bring up the Control QP */ diff --git a/drivers/infiniband/hw/nes/nes_hw.h b/drivers/infiniband/hw/nes/nes_hw.h index 610b9d8..bc0b4de 100644 --- a/drivers/infiniband/hw/nes/nes_hw.h +++ b/drivers/infiniband/hw/nes/nes_hw.h @@ -40,6 +40,7 @@ #define NES_PHY_TYPE_ARGUS 4 #define NES_PHY_TYPE_PUMA_1G 5 #define NES_PHY_TYPE_PUMA_10G 6 +#define NES_PHY_TYPE_GLADIUS 7 #define NES_MULTICAST_PF_MAX 8 From rdreier at cisco.com Tue Oct 28 18:13:48 2008 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 28 Oct 2008 18:13:48 -0700 Subject: [ofa-general] Re: [PATCH v2] RDMA/nes: Mitigate compatibility issue regarding PCI write credits In-Reply-To: <20081029002853.GA3212@ctung-MOBL> (Chien Tung's message of "Tue, 28 Oct 2008 19:28:53 -0500") References: <20081029002853.GA3212@ctung-MOBL> Message-ID: > Under heavy load, there is an compatibility issue regarding PCI write > credits with certain chipsets. It can be mitigated by limiting read > requests to 256 Bytes. OK. > +module_param(limit_maxrdreqsz, int, 0644); type can be bool instead of int here? > + if ((limit_maxrdreqsz) || > + ((nesdev->nesadapter->phy_type[0] == NES_PHY_TYPE_GLADIUS) && > + (hw_rev == NE020_REV1))) { > + nes_debug(NES_DBG_INIT, This indentation is hard to read, because the then clause visually runs into the condition being tested. I generally align the follow-on lines to be just inside the opening ( of "if (". And there's no reason to put parentheses around limit_maxrdreqsz... > + pci_read_config_word(pcidev, 0x68, &maxrdreqword); > + /* set bits 12-14 to 001b = 256 bytes */ > + maxrdreqword &= 0x8fff; > + maxrdreqword |= 0x1000; > + pci_write_config_word(pcidev, 0x68, maxrdreqword); I would write this as below, using the standard pcie interfaces and also being defensive so as not to set the max read req to 256 if the BIOS/kernel had limited it to 128 already: if (pcie_get_readrq(pcidev) > 256) if (pcie_set_readrq(pcidev, 256)) { /* report error */ } - R. From chu11 at llnl.gov Tue Oct 28 16:52:45 2008 From: chu11 at llnl.gov (Al Chu) Date: Tue, 28 Oct 2008 19:52:45 -0400 Subject: [ofa-general] [opensm] remove qos_max_vls config?? Message-ID: <1225237965.3358.9.camel@whatsup> Hey Sasha, I was working on a different bug fix on the qos config parsing, when I noticed the qos_*max_vls fields aren't used anywhere. They seem to be parsed from the config, stored, and never used. Maybe it used to be what 'max_op_vls' is now used for? If there's still a purpose for it in the future, obviously no issue on leaving in there. Patch is attached to remove it everywhere I found it. Al -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-remove-max_vls-config.patch Type: application/mbox Size: 8788 bytes Desc: not available URL: From davem at davemloft.net Tue Oct 28 22:39:41 2008 From: davem at davemloft.net (David Miller) Date: Tue, 28 Oct 2008 22:39:41 -0700 (PDT) Subject: [ofa-general] Re: [PATCH] infiniband: Add struct in6_addr addr to union ib_gib In-Reply-To: <1225239374.5269.251.camel@localhost> References: <1225236128.5269.240.camel@localhost> <1225239374.5269.251.camel@localhost> Message-ID: <20081028.223941.18931166.davem@davemloft.net> From: Joe Perches Date: Tue, 28 Oct 2008 17:16:14 -0700 > ib_gid's can be print'd using the new %p6 facility > > Signed-off-by: Joe Perches Joe, please provide something relative to Harvey's patches so that this new union member gets passed into the %p6 uses. Thanks! From olga.shern at gmail.com Tue Oct 28 23:53:47 2008 From: olga.shern at gmail.com (Olga Shern (Voltaire)) Date: Wed, 29 Oct 2008 08:53:47 +0200 Subject: [ofa-general] ***SPAM*** Re: [ewg] OFED October 27 2008 meeting summary on OFED 1.4 status In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EADD002D1@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EADD002D1@mtlexch01.mtl.com> Message-ID: > 2. We had a discussion on NFS-RDMA since both RHEL 5.1 and SLES10 SP2 > backports are not working well > We had a debate - do we take it out of OFED since it is not working on > the distros > Leave it in: We can have bug fixes for 1.4.1, and give customers a > platform to play with > Take it out: If someone will try it on the distro experience can be > problematic > Decision: We will leave it for 2.6.27 kernel only. > All testing should be done on this kernel mainly to see that basic > functionality is working We have tested NFSoRDMA on 2.6.27 and didn't see any of the issues that we see on Distros. So basic functionality is working From joe at perches.com Tue Oct 28 23:58:40 2008 From: joe at perches.com (Joe Perches) Date: Tue, 28 Oct 2008 23:58:40 -0700 Subject: [ofa-general] Re: [PATCH] infiniband: Add struct in6_addr addr to union ib_gib In-Reply-To: <20081028.223941.18931166.davem@davemloft.net> References: <1225236128.5269.240.camel@localhost> <1225239374.5269.251.camel@localhost> <20081028.223941.18931166.davem@davemloft.net> Message-ID: <1225263521.5269.273.camel@localhost> On Tue, 2008-10-28 at 22:39 -0700, David Miller wrote: > From: Joe Perches > Date: Tue, 28 Oct 2008 17:16:14 -0700 > > ib_gid's can be print'd using the new %p6 facility > > Signed-off-by: Joe Perches > Joe, please provide something relative to Harvey's patches > so that this new union member gets passed into the %p6 > uses. Sure. After Harvey's patches show up in http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git;a=summary I'll make a patchset including this union ib_gid addition as well as changes to drivers/infiniband/ Roland, in a separate question, are the infiniband maintainers also the maintainers of include/rdma/? If "F: patterns" ever gets accepted into MAINTAINERS, should include/rdma/ be listed under infiniband? You and Sean Hefty seem to be the primary authors. $ git log include/rdma/ | grep Author: | \ sort | uniq -c | sort -nr | head -5 32 Author: Sean Hefty 23 Author: Roland Dreier 6 Author: Michael S. Tsirkin 5 Author: Or Gerlitz 4 Author: Jack Morgenstein From tziporet at dev.mellanox.co.il Wed Oct 29 00:07:51 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 29 Oct 2008 09:07:51 +0200 Subject: [ofa-general] Re: ***SPAM*** Re: [ewg] OFED October 27 2008 meeting summary on OFED 1.4 status In-Reply-To: References: <5D49E7A8952DC44FB38C38FA0D758EADD002D1@mtlexch01.mtl.com> Message-ID: <49080BC7.7030205@mellanox.co.il> Olga Shern (Voltaire) wrote: > > We have tested NFSoRDMA on 2.6.27 and didn't see any of the issues > that we see on Distros. > So basic functionality is working > Thanks Tziporet From tziporet at dev.mellanox.co.il Wed Oct 29 00:13:53 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 29 Oct 2008 09:13:53 +0200 Subject: [ewg] Re: [ofa-general] OFED October 27 2008 meeting summary on OFED 1.4 status In-Reply-To: <49073A15.7050309@nasa.gov> References: <5D49E7A8952DC44FB38C38FA0D758EADD002D1@mtlexch01.mtl.com> <49073A15.7050309@nasa.gov> Message-ID: <49080D31.1000704@mellanox.co.il> Jeff Becker wrote: > > > I think I can get NFS_RDMA ready on 2.6.27 (it's already close) and > 2.6.26 for rc4. > Yes - please focus on this >> 2. We had a discussion on NFS-RDMA since both RHEL 5.1 and SLES10 SP2 >> backports are not working well >> We had a debate - do we take it out of OFED since it is not working on >> the distros >> Leave it in: We can have bug fixes for 1.4.1, and give customers a >> platform to play with >> Take it out: If someone will try it on the distro experience can be >> problematic >> Decision: We will leave it for 2.6.27 kernel only. >> All testing should be done on this kernel mainly to see that basic >> functionality is working >> >> > Since I wasn't at the meeting, I'd like to clarify. I am currently > working on the distro backports, and plan to have these ready for 1.4.1. > Assuming I get this done, can we include these in 1.4.1? > Sure. If you will have any fixes till 1.4 release we will take them too. The idea to leave NFS-RDMA in OFED 1.4 is to enable later dot releases with such fixes. Tziporet From eli at dev.mellanox.co.il Wed Oct 29 00:39:22 2008 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Wed, 29 Oct 2008 09:39:22 +0200 Subject: [ofa-general] poll CQ failed -2 with connectX In-Reply-To: <200810281639.03422.ricklist@microway.com> References: <200810271838.48510.ricklist@microway.com> <20081028184924.GA13206@mtls03> <200810281639.03422.ricklist@microway.com> Message-ID: <20081029073922.GA14691@mtls03> On Tue, Oct 28, 2008 at 04:39:02PM -0400, Rick Warner wrote: > > Thanks for the suggestion. Unfortunately, I have now reproduced this same > problem on a group of 8 Xeon based systems as well, so the problem is not > specific to the Opterons. > Do you have another, simpler test, that can demonstrate this problem? If not, please send instructions how to reproduce and whatever files needed to reproduce the problem. Alternatively, can you arrange for remote login to these systems? From ogerlitz at voltaire.com Wed Oct 29 01:31:30 2008 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 29 Oct 2008 10:31:30 +0200 Subject: [ofa-general] Re: [PATCH] infiniband: Add struct in6_addr addr to union ib_gib In-Reply-To: <1225239374.5269.251.camel@localhost> References: <1225229901.11483.58.camel@brick> <1225234963.5269.228.camel@localhost> <1225236128.5269.240.camel@localhost> <1225239374.5269.251.camel@localhost> Message-ID: <49081F62.7090508@voltaire.com> Joe Perches wrote: > ib_gid's can be print'd using the new %p6 facility Joe, Harvey One thing which I'd like to find an easy way to get rid of, is the wrong printing of IPoIB devices HW address under bonding. The bonding driver uses the DECLARE_MAC_BUF and print_mac() way so a possible solution would be to enhance this two to work with variable HW address lengths, but this would have the price of changing the 600 or so references to print_mac in the code, etc. A possible solution would be to just change the bonding driver to use some other macro/api, but I believe that the posted patches could not serve for that purpose as is, since the IPoIB HW address is actually more then a GID, its 20 bytes whose lower 16 are a GID. Printing only the GID portion of the HW address would be better then the current situation, so we can take that approach as well. Or. From davem at davemloft.net Wed Oct 29 01:39:02 2008 From: davem at davemloft.net (David Miller) Date: Wed, 29 Oct 2008 01:39:02 -0700 (PDT) Subject: [ofa-general] Re: [PATCH] infiniband: Add struct in6_addr addr to union ib_gib In-Reply-To: <1225263521.5269.273.camel@localhost> References: <1225239374.5269.251.camel@localhost> <20081028.223941.18931166.davem@davemloft.net> <1225263521.5269.273.camel@localhost> Message-ID: <20081029.013902.130933950.davem@davemloft.net> From: Joe Perches Date: Tue, 28 Oct 2008 23:58:40 -0700 > On Tue, 2008-10-28 at 22:39 -0700, David Miller wrote: > > From: Joe Perches > > Date: Tue, 28 Oct 2008 17:16:14 -0700 > > > ib_gid's can be print'd using the new %p6 facility > > > Signed-off-by: Joe Perches > > Joe, please provide something relative to Harvey's patches > > so that this new union member gets passed into the %p6 > > uses. > > Sure. After Harvey's patches show up in > http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git;a=summary They should be there now. From amirv at mellanox.co.il Wed Oct 29 02:07:54 2008 From: amirv at mellanox.co.il (Amir Vadai) Date: Wed, 29 Oct 2008 11:07:54 +0200 Subject: [ofa-general] [PATCH] sdp: timeout when waiting for sdp_fin In-Reply-To: <> References: <> Message-ID: <1225271274-32559-1-git-send-email-amirv@mellanox.co.il> fixes BUG1305: https://bugs.openfabrics.org/show_bug.cgi?id=1305 Signed-off-by: Amir Vadai --- drivers/infiniband/ulp/sdp/sdp.h | 1 + drivers/infiniband/ulp/sdp/sdp_bcopy.c | 3 ++ drivers/infiniband/ulp/sdp/sdp_cma.c | 8 ++++- drivers/infiniband/ulp/sdp/sdp_main.c | 43 ++++++++++++++++++++------------ 4 files changed, 37 insertions(+), 18 deletions(-) diff --git a/drivers/infiniband/ulp/sdp/sdp.h b/drivers/infiniband/ulp/sdp/sdp.h index 8638422..0e7794e 100644 --- a/drivers/infiniband/ulp/sdp/sdp.h +++ b/drivers/infiniband/ulp/sdp/sdp.h @@ -75,6 +75,7 @@ extern int sdp_data_debug_level; #define SDP_ROUTE_TIMEOUT 1000 #define SDP_RETRY_COUNT 5 #define SDP_KEEPALIVE_TIME (120 * 60 * HZ) +#define SDP_FIN_WAIT_TIMEOUT (60 * HZ) #define SDP_TX_SIZE 0x40 #define SDP_RX_SIZE 0x40 diff --git a/drivers/infiniband/ulp/sdp/sdp_bcopy.c b/drivers/infiniband/ulp/sdp/sdp_bcopy.c index a2472e9..f1b3cb0 100644 --- a/drivers/infiniband/ulp/sdp/sdp_bcopy.c +++ b/drivers/infiniband/ulp/sdp/sdp_bcopy.c @@ -119,6 +119,9 @@ static void sdp_fin(struct sock *sk) /* Received a reply FIN - start Infiniband tear down */ sdp_dbg(sk, "%s: Starting Infiniband tear down sending DREQ\n", __func__); + + sdp_cancel_dreq_wait_timeout(sdp_sk(sk)); + sdp_exch_state(sk, TCPF_FIN_WAIT1, TCP_TIME_WAIT); if (sdp_sk(sk)->id) { diff --git a/drivers/infiniband/ulp/sdp/sdp_cma.c b/drivers/infiniband/ulp/sdp/sdp_cma.c index 6206835..64f9f38 100644 --- a/drivers/infiniband/ulp/sdp/sdp_cma.c +++ b/drivers/infiniband/ulp/sdp/sdp_cma.c @@ -498,8 +498,7 @@ int sdp_cma_handler(struct rdma_cm_id *id, struct rdma_cm_event *event) sdp_dbg(sk, "RDMA_CM_EVENT_DISCONNECTED\n"); if (sk->sk_state == TCP_LAST_ACK) { - if (sdp_sk(sk)->dreq_wait_timeout) - sdp_cancel_dreq_wait_timeout(sdp_sk(sk)); + sdp_cancel_dreq_wait_timeout(sdp_sk(sk)); sdp_exch_state(sk, TCPF_LAST_ACK, TCP_TIME_WAIT); @@ -510,6 +509,11 @@ int sdp_cma_handler(struct rdma_cm_id *id, struct rdma_cm_event *event) rdma_disconnect(id); if (sk->sk_state != TCP_TIME_WAIT) { + if (sk->sk_state == TCP_CLOSE_WAIT) { + sdp_dbg(sk, "IB teardown while in TCP_CLOSE_WAIT " + "taking reference to let close() finish the work\n"); + sock_hold(sk, SOCK_REF_CM_TW); + } sdp_set_error(sk, EPIPE); rc = sdp_disconnected_handler(sk); } diff --git a/drivers/infiniband/ulp/sdp/sdp_main.c b/drivers/infiniband/ulp/sdp/sdp_main.c index 17e98bb..cbd1adb 100644 --- a/drivers/infiniband/ulp/sdp/sdp_main.c +++ b/drivers/infiniband/ulp/sdp/sdp_main.c @@ -443,6 +443,10 @@ done: static void sdp_send_disconnect(struct sock *sk) { + queue_delayed_work(sdp_workqueue, &sdp_sk(sk)->dreq_wait_work, + SDP_FIN_WAIT_TIMEOUT); + sdp_sk(sk)->dreq_wait_timeout = 1; + sdp_sk(sk)->sdp_disconnect = 1; sdp_post_sends(sdp_sk(sk), 0); } @@ -451,22 +455,19 @@ static void sdp_send_disconnect(struct sock *sk) * State processing on a close. * TCP_ESTABLISHED -> TCP_FIN_WAIT1 -> TCP_CLOSE */ - static int sdp_close_state(struct sock *sk) { - if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) - return 0; - - if (sk->sk_state == TCP_ESTABLISHED) + switch (sk->sk_state) { + case TCP_ESTABLISHED: sdp_exch_state(sk, TCPF_ESTABLISHED, TCP_FIN_WAIT1); - else if (sk->sk_state == TCP_CLOSE_WAIT) { + break; + case TCP_CLOSE_WAIT: sdp_exch_state(sk, TCPF_CLOSE_WAIT, TCP_LAST_ACK); - - sdp_sk(sk)->dreq_wait_timeout = 1; - queue_delayed_work(sdp_workqueue, &sdp_sk(sk)->dreq_wait_work, - TCP_FIN_TIMEOUT); - } else + break; + default: return 0; + } + return 1; } @@ -836,6 +837,11 @@ static int sdp_ioctl(struct sock *sk, int cmd, unsigned long arg) void sdp_cancel_dreq_wait_timeout(struct sdp_sock *ssk) { + if (!ssk->dreq_wait_timeout) + return; + + sdp_dbg(&ssk->isk.sk, "cancelling dreq wait timeout #####\n"); + ssk->dreq_wait_timeout = 0; cancel_delayed_work(&ssk->dreq_wait_work); atomic_dec(ssk->isk.sk.sk_prot->orphan_count); @@ -847,8 +853,7 @@ void sdp_destroy_work(struct work_struct *work) struct sock *sk = &ssk->isk.sk; sdp_dbg(sk, "%s: refcnt %d\n", __func__, atomic_read(&sk->sk_refcnt)); - if (ssk->dreq_wait_timeout) - sdp_cancel_dreq_wait_timeout(ssk); + sdp_cancel_dreq_wait_timeout(ssk); if (sk->sk_state == TCP_TIME_WAIT) sock_put(sk, SOCK_REF_CM_TW); @@ -868,15 +873,21 @@ void sdp_dreq_wait_timeout_work(struct work_struct *work) lock_sock(sk); - if (!sdp_sk(sk)->dreq_wait_timeout) { + if (!sdp_sk(sk)->dreq_wait_timeout || + !((1 << sk->sk_state) & (TCPF_FIN_WAIT1 | TCPF_LAST_ACK))) { release_sock(sk); return; } - sdp_dbg(sk, "%s: timed out waiting for DREQ\n", __func__); + sdp_warn(sk, "timed out waiting for FIN/DREQ. " + "going into abortive close.\n"); sdp_sk(sk)->dreq_wait_timeout = 0; - sdp_exch_state(sk, TCPF_LAST_ACK, TCP_TIME_WAIT); + + if (sk->sk_state == TCP_FIN_WAIT1) + atomic_dec(ssk->isk.sk.sk_prot->orphan_count); + + sdp_exch_state(sk, TCPF_LAST_ACK | TCPF_FIN_WAIT1, TCP_TIME_WAIT); release_sock(sk); -- 1.5.3 From kliteyn at dev.mellanox.co.il Wed Oct 29 02:13:14 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 29 Oct 2008 11:13:14 +0200 Subject: [ofa-general] [opensm] remove qos_max_vls config?? In-Reply-To: <1225237965.3358.9.camel@whatsup> References: <1225237965.3358.9.camel@whatsup> Message-ID: <4908292A.40004@dev.mellanox.co.il> Al Chu wrote: > Hey Sasha, > > I was working on a different bug fix on the qos config parsing, when I > noticed the qos_*max_vls fields aren't used anywhere. They seem to be > parsed from the config, stored, and never used. Maybe it used to be > what 'max_op_vls' is now used for? I guess that the initial idea was to have an option to configure different operational VLs on different type of nodes in the subnet. The question is, does having such option make sense? -- Yevgeny > If there's still a purpose for it in the future, obviously no issue on > leaving in there. Patch is attached to remove it everywhere I found it. > > Al > > > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From vlad at lists.openfabrics.org Wed Oct 29 03:19:35 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 29 Oct 2008 03:19:35 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081029-0200 daily build status Message-ID: <20081029101935.87F58E60BE2@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: From kliteyn at dev.mellanox.co.il Wed Oct 29 04:14:42 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 29 Oct 2008 13:14:42 +0200 Subject: [ofa-general] Re: [PATCH 1/2] opensm: replace switch's fwd_tbl with simple LFT In-Reply-To: <20081018234814.GU5528@sashak.voltaire.com> References: <48F7B3D7.3070004@dev.mellanox.co.il> <20081018234814.GU5528@sashak.voltaire.com> Message-ID: <490845A2.2060907@dev.mellanox.co.il> Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 23:36 Thu 16 Oct , Yevgeny Kliteynik wrote: >> Replace the unnecessarily complex switch's forwarding table >> implementation with a simple LFT that is implemented as plain >> uint8_t array. >> >> Signed-off-by: Yevgeny Kliteynik >> --- > > [snip...] > >> diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c >> index 9bf76e0..bdfc7d0 100644 >> --- a/opensm/opensm/osm_switch.c >> +++ b/opensm/opensm/osm_switch.c >> @@ -97,9 +97,26 @@ osm_switch_init(IN osm_switch_t * const p_sw, >> p_sw->num_ports = num_ports; >> p_sw->need_update = 2; >> >> - status = osm_fwd_tbl_init(&p_sw->fwd_tbl, p_si); >> - if (status != IB_SUCCESS) >> + /* Initiate the linear forwarding table */ >> + >> + if (!p_si->lin_cap) { >> + /* This switch does not support linear forwarding tables */ >> + status = IB_UNSUPPORTED; >> goto Exit; >> + } >> + >> + /* The capacity reported by the switch includes LID 0, >> + so add 1 to the end of the range here for this assert. */ >> + CL_ASSERT(cl_ntoh16(p_si->lin_cap) <= IB_LID_UCAST_END_HO + 1); > > Maybe there should be run-time check (not sure since lin_cap is not > really used in other places in the code), but not assertion - any bogus > data received from network should not crash OpenSM. I'm removing this. Do we care that the lin_cap of the switch claims to support more than IB_LID_UCAST_END_HO? Don't think so, so I agree - removing this. >> + >> + p_sw->lft = malloc(IB_LID_UCAST_END_HO + 1); >> + if (!p_sw->lft) { >> + status = IB_INSUFFICIENT_MEMORY; >> + goto Exit; >> + } >> + >> + /* Initialize the table to OSM_NO_PATH, which is "invalid port" */ >> + memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); >> >> p_sw->lft_buf = malloc(IB_LID_UCAST_END_HO + 1); >> if (!p_sw->lft_buf) { >> @@ -138,7 +155,8 @@ void osm_switch_delete(IN OUT osm_switch_t ** const pp_sw) >> >> osm_mcast_tbl_destroy(&p_sw->mcast_tbl); >> free(p_sw->p_prof); >> - osm_fwd_tbl_destroy(&p_sw->fwd_tbl); >> + if (p_sw->lft) >> + free(p_sw->lft); >> if (p_sw->lft_buf) >> free(p_sw->lft_buf); >> if (p_sw->hops) { >> @@ -176,44 +194,36 @@ osm_switch_t *osm_switch_new(IN osm_node_t * const p_node, >> /********************************************************************** >> **********************************************************************/ >> boolean_t >> -osm_switch_get_fwd_tbl_block(IN const osm_switch_t * const p_sw, >> - IN const uint32_t block_id, >> - OUT uint8_t * const p_block) >> +osm_switch_get_lft_block(IN const osm_switch_t * const p_sw, >> + IN const uint32_t block_id, >> + OUT uint8_t * const p_block) >> { >> uint16_t base_lid_ho; >> - uint16_t max_lid_ho; >> - uint16_t lid_ho; >> uint16_t block_top_lid_ho; >> - uint32_t lids_per_block; >> - osm_fwd_tbl_t *p_tbl; >> boolean_t return_flag = FALSE; >> >> CL_ASSERT(p_sw); >> CL_ASSERT(p_block); >> >> - p_tbl = osm_switch_get_fwd_tbl_ptr(p_sw); >> - max_lid_ho = p_sw->max_lid_ho; >> - lids_per_block = osm_fwd_tbl_get_lids_per_block(&p_sw->fwd_tbl); >> - base_lid_ho = (uint16_t) (block_id * lids_per_block); >> + base_lid_ho = (uint16_t) (block_id * IB_SMP_DATA_SIZE); >> >> - if (base_lid_ho <= max_lid_ho) { >> + if (base_lid_ho <= p_sw->max_lid_ho) { >> /* Initialize LIDs in block to invalid port number. */ >> memset(p_block, OSM_NO_PATH, IB_SMP_DATA_SIZE); >> /* >> Determine the range of LIDs we can return with this block. >> */ >> block_top_lid_ho = >> - (uint16_t) (base_lid_ho + lids_per_block - 1); >> - if (block_top_lid_ho > max_lid_ho) >> - block_top_lid_ho = max_lid_ho; >> + (uint16_t) (base_lid_ho + IB_SMP_DATA_SIZE - 1); >> + if (block_top_lid_ho > p_sw->max_lid_ho) >> + block_top_lid_ho = p_sw->max_lid_ho; >> >> /* >> Configure the forwarding table with the routing >> information for the specified block of LIDs. >> */ >> - for (lid_ho = base_lid_ho; lid_ho <= block_top_lid_ho; lid_ho++) >> - p_block[lid_ho - base_lid_ho] = >> - osm_fwd_tbl_get(p_tbl, lid_ho); >> + memcpy(p_block, &(p_sw->lft[base_lid_ho]), >> + block_top_lid_ho - base_lid_ho + 1); > > Hmm, why not just > > memcpy(p_block, &p_sw->lft[base_lid_ho], 64); > > ? And then no need initial memset()? Well, I can really simplify this whole function to something like this: boolean_t osm_switch_get_lft_block(IN const osm_switch_t * const p_sw, IN const uint16_t block_id, OUT uint8_t * const p_block) { uint16_t base_lid_ho = block_id * IB_SMP_DATA_SIZE; CL_ASSERT(p_sw); CL_ASSERT(p_block); if (base_lid_ho > p_sw->max_lid_ho) return FALSE; memcpy(p_block, &(p_sw->lft[base_lid_ho]), IB_SMP_DATA_SIZE); return TRUE; } Patch shortly. -- Yevgeny > Sasha > From kliteyn at dev.mellanox.co.il Wed Oct 29 06:01:58 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 29 Oct 2008 15:01:58 +0200 Subject: [ofa-general] [PATCH 2/2 v2] opensm: replace switch's fwd_tbl with simple LFT - remove obsolete files Message-ID: <49085EC6.7060404@dev.mellanox.co.il> Remove all the fwd_tbl files that became obsolete. [v2 - no changes, just rebased] Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_fwd_tbl.h | 373 ------------------------------ opensm/include/opensm/osm_lin_fwd_tbl.h | 359 ---------------------------- opensm/include/opensm/osm_rand_fwd_tbl.h | 337 --------------------------- opensm/opensm/Makefile.am | 7 +- opensm/opensm/osm_fwd_tbl.c | 100 -------- opensm/opensm/osm_lin_fwd_tbl.c | 88 ------- 6 files changed, 2 insertions(+), 1262 deletions(-) delete mode 100644 opensm/include/opensm/osm_fwd_tbl.h delete mode 100644 opensm/include/opensm/osm_lin_fwd_tbl.h delete mode 100644 opensm/include/opensm/osm_rand_fwd_tbl.h delete mode 100644 opensm/opensm/osm_fwd_tbl.c delete mode 100644 opensm/opensm/osm_lin_fwd_tbl.c diff --git a/opensm/include/opensm/osm_fwd_tbl.h b/opensm/include/opensm/osm_fwd_tbl.h deleted file mode 100644 index 55e853f..0000000 --- a/opensm/include/opensm/osm_fwd_tbl.h +++ /dev/null @@ -1,373 +0,0 @@ -/* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. - * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -/* - * Abstract: - * Declaration of osm_fwd_tbl_t. - * This object represents a unicast forwarding table. - * This object is part of the OpenSM family of objects. - */ - -#ifndef _OSM_FWD_TBL_H_ -#define _OSM_FWD_TBL_H_ - -#include -#include -#include -#include - -#ifdef __cplusplus -# define BEGIN_C_DECLS extern "C" { -# define END_C_DECLS } -#else /* !__cplusplus */ -# define BEGIN_C_DECLS -# define END_C_DECLS -#endif /* __cplusplus */ - -BEGIN_C_DECLS -/****h* OpenSM/Forwarding Table -* NAME -* Forwarding Table -* -* DESCRIPTION -* The Forwarding Table objects encapsulate the information -* needed by the OpenSM to manage forwarding tables. The OpenSM -* allocates one Forwarding Table object per switch in the -* IBA subnet. -* -* The Forwarding Table objects are not thread safe, thus -* callers must provide serialization. -* -* AUTHOR -* Steve King, Intel -* -*********/ -/****s* OpenSM: Forwarding Table/osm_fwd_tbl_t -* NAME -* osm_fwd_tbl_t -* -* DESCRIPTION -* Forwarding Table structure. This object hides the type -* of fowarding table (linear or random) actually used by -* the switch. -* -* This object should be treated as opaque and should -* be manipulated only through the provided functions. -* -* SYNOPSIS -*/ -typedef struct osm_fwd_tbl { - osm_rand_fwd_tbl_t *p_rnd_tbl; - osm_lin_fwd_tbl_t *p_lin_tbl; -} osm_fwd_tbl_t; -/* -* FIELDS -* p_rnd_tbl -* Pointer to the switch's Random Forwarding Table object. -* If the switch does not use a Random Forwarding Table, -* then this pointer is NULL. -* -* p_lin_tbl -* Pointer to the switch's Linear Forwarding Table object. -* If the switch does not use a Linear Forwarding Table, -* then this pointer is NULL. -* -* SEE ALSO -* Forwarding Table object, Random Forwarding Table object. -*********/ - -/****f* OpenSM: Forwarding Table/osm_fwd_tbl_init -* NAME -* osm_fwd_tbl_init -* -* DESCRIPTION -* Initializes a Forwarding Table object. -* -* SYNOPSIS -*/ -ib_api_status_t -osm_fwd_tbl_init(IN osm_fwd_tbl_t * const p_tbl, - IN const ib_switch_info_t * const p_si); -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* p_si -* [in] Pointer to the SwitchInfo attribute of the associated -* switch. -* -* RETURN VALUE -* IB_SUCCESS if the operation is successful. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_fwd_tbl_destroy -* NAME -* osm_fwd_tbl_destroy -* -* DESCRIPTION -* Destroys a Forwarding Table object. -* -* SYNOPSIS -*/ -void osm_fwd_tbl_destroy(IN osm_fwd_tbl_t * const p_tbl); -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* None. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_fwd_tbl_get -* NAME -* osm_fwd_tbl_get -* -* DESCRIPTION -* Returns the port that routes the specified LID. -* -* SYNOPSIS -*/ -static inline uint8_t -osm_fwd_tbl_get(IN const osm_fwd_tbl_t * const p_tbl, IN uint16_t const lid_ho) -{ - if (p_tbl->p_lin_tbl) - return (osm_lin_fwd_tbl_get(p_tbl->p_lin_tbl, lid_ho)); - else - return (osm_rand_fwd_tbl_get(p_tbl->p_rnd_tbl, lid_ho)); -} - -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* lid_ho -* [in] LID (host order) for which to find the route. -* -* RETURN VALUE -* Returns the port that routes the specified LID. -* IB_INVALID_PORT_NUM if the table does not have a route for this LID. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_fwd_tbl_set -* NAME -* osm_fwd_tbl_set -* -* DESCRIPTION -* Sets the port to route the specified LID. -* -* SYNOPSIS -*/ -static inline void -osm_fwd_tbl_set(IN osm_fwd_tbl_t * const p_tbl, - IN const uint16_t lid_ho, IN const uint8_t port) -{ - CL_ASSERT(p_tbl); - if (p_tbl->p_lin_tbl) - osm_lin_fwd_tbl_set(p_tbl->p_lin_tbl, lid_ho, port); - else - osm_rand_fwd_tbl_set(p_tbl->p_rnd_tbl, lid_ho, port); -} - -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* lid_ho -* [in] LID value (host order) for which to set the route. -* -* port -* [in] Port to route the specified LID value. -* -* RETURN VALUE -* None. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_fwd_tbl_set_block -* NAME -* osm_fwd_tbl_set_block -* -* DESCRIPTION -* Copies the specified block into the Forwarding Table. -* -* SYNOPSIS -*/ -static inline ib_api_status_t -osm_fwd_tbl_set_block(IN osm_fwd_tbl_t * const p_tbl, - IN const uint8_t * const p_block, - IN const uint32_t block_num) -{ - CL_ASSERT(p_tbl); - if (p_tbl->p_lin_tbl) - return (osm_lin_fwd_tbl_set_block(p_tbl->p_lin_tbl, - p_block, block_num)); - else - return (osm_rand_fwd_tbl_set_block(p_tbl->p_rnd_tbl, - p_block, block_num)); -} - -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* None. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_fwd_tbl_get_size -* NAME -* osm_fwd_tbl_get_size -* -* DESCRIPTION -* Returns the number of entries available in the forwarding table. -* -* SYNOPSIS -*/ -static inline uint16_t -osm_fwd_tbl_get_size(IN const osm_fwd_tbl_t * const p_tbl) -{ - CL_ASSERT(p_tbl); - if (p_tbl->p_lin_tbl) - return (osm_lin_fwd_tbl_get_size(p_tbl->p_lin_tbl)); - else - return (osm_rand_fwd_tbl_get_size(p_tbl->p_rnd_tbl)); -} - -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* Returns the number of entries available in the forwarding table. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_fwd_tbl_get_lids_per_block -* NAME -* osm_fwd_tbl_get_lids_per_block -* -* DESCRIPTION -* Returns the number of LIDs per LID block. -* -* SYNOPSIS -*/ -static inline uint16_t -osm_fwd_tbl_get_lids_per_block(IN const osm_fwd_tbl_t * const p_tbl) -{ - CL_ASSERT(p_tbl); - if (p_tbl->p_lin_tbl) - return (osm_lin_fwd_tbl_get_lids_per_block(p_tbl->p_lin_tbl)); - else - return (osm_rand_fwd_tbl_get_lids_per_block(p_tbl->p_rnd_tbl)); -} - -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* Returns the number of LIDs per LID block. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_fwd_tbl_get_max_block_id_in_use -* NAME -* osm_fwd_tbl_get_max_block_id_in_use -* -* DESCRIPTION -* Returns the number of LIDs per LID block. -* -* SYNOPSIS -*/ -static inline uint16_t -osm_fwd_tbl_get_max_block_id_in_use(IN const osm_fwd_tbl_t * const p_tbl, - IN const uint16_t lid_top_ho) -{ - CL_ASSERT(p_tbl); - if (p_tbl->p_lin_tbl) - return (osm_lin_fwd_tbl_get_max_block_id_in_use - (p_tbl->p_lin_tbl, lid_top_ho)); - else - return (osm_rand_fwd_tbl_get_max_block_id_in_use - (p_tbl->p_rnd_tbl, lid_top_ho)); -} - -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* Returns the number of LIDs per LID block. -* -* NOTES -* -* SEE ALSO -*********/ - -END_C_DECLS -#endif /* _OSM_FWD_TBL_H_ */ diff --git a/opensm/include/opensm/osm_lin_fwd_tbl.h b/opensm/include/opensm/osm_lin_fwd_tbl.h deleted file mode 100644 index be3a3ee..0000000 --- a/opensm/include/opensm/osm_lin_fwd_tbl.h +++ /dev/null @@ -1,359 +0,0 @@ -/* - * Copyright (c) 2004-2007 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. - * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -/* - * Abstract: - * Declaration of osm_lin_fwd_tbl_t. - * This object represents a linear forwarding table. - * This object is part of the OpenSM family of objects. - */ - -#ifndef _OSM_LIN_FWD_TBL_H_ -#define _OSM_LIN_FWD_TBL_H_ - -#include -#include -#include - -#ifdef __cplusplus -# define BEGIN_C_DECLS extern "C" { -# define END_C_DECLS } -#else /* !__cplusplus */ -# define BEGIN_C_DECLS -# define END_C_DECLS -#endif /* __cplusplus */ - -BEGIN_C_DECLS -/****h* OpenSM/Linear Forwarding Table -* NAME -* Linear Forwarding Table -* -* DESCRIPTION -* The Linear Forwarding Table objects encapsulate the information -* needed by the OpenSM to manage linear forwarding tables. The OpenSM -* allocates one Linear Forwarding Table object per switch in the -* IBA subnet, if that switch uses a linear table. -* -* The Linear Forwarding Table objects are not thread safe, thus -* callers must provide serialization. -* -* AUTHOR -* Steve King, Intel -* -*********/ -/****s* OpenSM: Forwarding Table/osm_lin_fwd_tbl_t -* NAME -* osm_lin_fwd_tbl_t -* -* DESCRIPTION -* Linear Forwarding Table structure. -* -* Callers may directly access this object. -* -* SYNOPSIS -*/ -typedef struct osm_lin_fwd_tbl { - uint16_t size; - uint8_t port_tbl[1]; -} osm_lin_fwd_tbl_t; -/* -* FIELDS -* Size -* Number of entries in the linear forwarding table. This value -* is taken from the SwitchInfo attribute. -* -* port_tbl -* The array that specifies the port number which routes the -* corresponding LID. Index is by LID. -* -* SEE ALSO -* Forwarding Table object, Random Forwarding Table object. -*********/ - -/****f* OpenSM: Forwarding Table/osm_lin_tbl_new -* NAME -* osm_lin_tbl_new -* -* DESCRIPTION -* This function creates and initializes a Linear Forwarding Table object. -* -* SYNOPSIS -*/ -osm_lin_fwd_tbl_t *osm_lin_tbl_new(IN uint16_t const size); -/* -* PARAMETERS -* size -* [in] Number of entries in the Linear Forwarding Table. -* -* RETURN VALUE -* On success, returns a pointer to a new Linear Forwarding Table object -* of the specified size. -* NULL otherwise. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_lin_tbl_delete -* NAME -* osm_lin_tbl_delete -* -* DESCRIPTION -* This destroys and deallocates a Linear Forwarding Table object. -* -* SYNOPSIS -*/ -void osm_lin_tbl_delete(IN osm_lin_fwd_tbl_t ** const pp_tbl); -/* -* PARAMETERS -* pp_tbl -* [in] Pointer a Pointer to the Linear Forwarding Table object. -* -* RETURN VALUE -* On success, returns a pointer to a new Linear Forwarding Table object -* of the specified size. -* NULL otherwise. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_lin_fwd_tbl_set -* NAME -* osm_lin_fwd_tbl_set -* -* DESCRIPTION -* Sets the port to route the specified LID. -* -* SYNOPSIS -*/ -static inline void -osm_lin_fwd_tbl_set(IN osm_lin_fwd_tbl_t * const p_tbl, - IN const uint16_t lid_ho, IN const uint8_t port) -{ - CL_ASSERT(lid_ho < p_tbl->size); - if (lid_ho < p_tbl->size) - p_tbl->port_tbl[lid_ho] = port; -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Linear Forwarding Table object. -* -* lid_ho -* [in] LID value (host order) for which to set the route. -* -* port -* [in] Port to route the specified LID value. -* -* RETURN VALUE -* None. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_lin_fwd_tbl_get -* NAME -* osm_lin_fwd_tbl_get -* -* DESCRIPTION -* Returns the port that routes the specified LID. -* -* SYNOPSIS -*/ -static inline uint8_t -osm_lin_fwd_tbl_get(IN const osm_lin_fwd_tbl_t * const p_tbl, - IN const uint16_t lid_ho) -{ - if (lid_ho < p_tbl->size) - return (p_tbl->port_tbl[lid_ho]); - else - return (OSM_NO_PATH); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Linear Forwarding Table object. -* -* lid_ho -* [in] LID value (host order) for which to get the route. -* -* RETURN VALUE -* Returns the port that routes the specified LID. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_lin_fwd_tbl_get_size -* NAME -* osm_lin_fwd_tbl_get_size -* -* DESCRIPTION -* Returns the number of entries available in the forwarding table. -* -* SYNOPSIS -*/ -static inline uint16_t -osm_lin_fwd_tbl_get_size(IN const osm_lin_fwd_tbl_t * const p_tbl) -{ - return (p_tbl->size); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* Returns the number of entries available in the forwarding table. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_lin_fwd_tbl_get_lids_per_block -* NAME -* osm_lin_fwd_tbl_get_lids_per_block -* -* DESCRIPTION -* Returns the number of LIDs per LID block. -* -* SYNOPSIS -*/ -static inline uint16_t -osm_lin_fwd_tbl_get_lids_per_block(IN const osm_lin_fwd_tbl_t * const p_tbl) -{ - UNUSED_PARAM(p_tbl); - return (64); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* Returns the number of LIDs per LID block. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_lin_fwd_tbl_get_max_block_id_in_use -* NAME -* osm_lin_fwd_tbl_get_max_block_id_in_use -* -* DESCRIPTION -* Returns the maximum block ID in actual use by the forwarding table. -* -* SYNOPSIS -*/ -static inline uint16_t -osm_lin_fwd_tbl_get_max_block_id_in_use(IN const osm_lin_fwd_tbl_t * - const p_tbl, - IN const uint16_t lid_top_ho) -{ - return ((uint16_t) (lid_top_ho / - osm_lin_fwd_tbl_get_lids_per_block(p_tbl))); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* Returns the maximum block ID in actual use by the forwarding table. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_lin_fwd_tbl_set_block -* NAME -* osm_lin_fwd_tbl_set_block -* -* DESCRIPTION -* Copies the specified block into the Linear Forwarding Table. -* -* SYNOPSIS -*/ -static inline ib_api_status_t -osm_lin_fwd_tbl_set_block(IN osm_lin_fwd_tbl_t * const p_tbl, - IN const uint8_t * const p_block, - IN const uint32_t block_num) -{ - uint16_t lid_start; - uint16_t num_lids; - - CL_ASSERT(p_tbl); - CL_ASSERT(p_block); - - num_lids = osm_lin_fwd_tbl_get_lids_per_block(p_tbl); - lid_start = (uint16_t) (block_num * num_lids); - - if (lid_start + num_lids > p_tbl->size) - return (IB_INVALID_PARAMETER); - - memcpy(&p_tbl->port_tbl[lid_start], p_block, num_lids); - return (IB_SUCCESS); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Linear Forwarding Table object. -* -* p_block -* [in] Pointer to the Forwarding Table block. -* -* block_num -* [in] Block number of this block. -* -* RETURN VALUE -* None. -* -* NOTES -* -* SEE ALSO -*********/ - -END_C_DECLS -#endif /* _OSM_LIN_FWD_TBL_H_ */ diff --git a/opensm/include/opensm/osm_rand_fwd_tbl.h b/opensm/include/opensm/osm_rand_fwd_tbl.h deleted file mode 100644 index 31098b9..0000000 --- a/opensm/include/opensm/osm_rand_fwd_tbl.h +++ /dev/null @@ -1,337 +0,0 @@ -/* - * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. - * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -/* - * Abstract: - * Declaration of osm_rand_fwd_tbl_t. - * This object represents a random forwarding table. - * This object is part of the OpenSM family of objects. - */ - -#ifndef _OSM_RAND_FWD_TBL_H_ -#define _OSM_RAND_FWD_TBL_H_ - -#include -#include -#include - -#ifdef __cplusplus -# define BEGIN_C_DECLS extern "C" { -# define END_C_DECLS } -#else /* !__cplusplus */ -# define BEGIN_C_DECLS -# define END_C_DECLS -#endif /* __cplusplus */ - -BEGIN_C_DECLS -/****h* OpenSM/Random Forwarding Table -* NAME -* Random Forwarding Table -* -* DESCRIPTION -* The Random Forwarding Table objects encapsulate the information -* needed by the OpenSM to manage random forwarding tables. The OpenSM -* allocates one Random Forwarding Table object per switch in the -* IBA subnet, if that switch uses a random forwarding table. -* -* The Random Forwarding Table objects are not thread safe, thus -* callers must provide serialization. -* -* ** RANDOM FORWARDING TABLES ARE NOT SUPPORTED IN THE CURRENT VERSION ** -* -* AUTHOR -* Steve King, Intel -* -*********/ -/****s* OpenSM: Forwarding Table/osm_rand_fwd_tbl_t -* NAME -* osm_rand_fwd_tbl_t -* -* DESCRIPTION -* Random Forwarding Table structure. -* -* THIS OBJECT IS PLACE HOLDER. SUPPORT FOR SWITCHES WITH -* RANDOM FORWARDING TABLES HAS NOT BEEN IMPLEMENTED YET. -* -* SYNOPSIS -*/ -typedef struct osm_rand_fwd_tbl { - /* PLACE HOLDER STRUCTURE ONLY!! */ - uint32_t size; -} osm_rand_fwd_tbl_t; -/* -* FIELDS -* RANDOM FORWARDING TABLES ARE NOT SUPPORTED YET!! -* -* SEE ALSO -* Forwarding Table object, Random Forwarding Table object. -*********/ - -/****f* OpenSM: Forwarding Table/osm_rand_tbl_delete -* NAME -* osm_rand_tbl_delete -* -* DESCRIPTION -* This destroys and deallocates a Random Forwarding Table object. -* -* SYNOPSIS -*/ -static inline void osm_rand_tbl_delete(IN osm_rand_fwd_tbl_t ** const pp_tbl) -{ - /* - TO DO - This is a place holder function only! - */ - free(*pp_tbl); - *pp_tbl = NULL; -} -/* -* PARAMETERS -* pp_tbl -* [in] Pointer a Pointer to the Random Forwarding Table object. -* -* RETURN VALUE -* On success, returns a pointer to a new Random Forwarding Table object -* of the specified size. -* NULL otherwise. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_rand_fwd_tbl_set -* NAME -* osm_rand_fwd_tbl_set -* -* DESCRIPTION -* Sets the port to route the specified LID. -* -* SYNOPSIS -*/ -static inline void -osm_rand_fwd_tbl_set(IN osm_rand_fwd_tbl_t * const p_tbl, - IN const uint16_t lid_ho, IN const uint8_t port) -{ - /* Random forwarding tables not supported yet. */ - UNUSED_PARAM(p_tbl); - UNUSED_PARAM(lid_ho); - UNUSED_PARAM(port); - CL_ASSERT(FALSE); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Random Forwarding Table object. -* -* lid_ho -* [in] LID value (host order) for which to set the route. -* -* port -* [in] Port to route the specified LID value. -* -* RETURN VALUE -* None. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_rand_fwd_tbl_set_block -* NAME -* osm_rand_fwd_tbl_set_block -* -* DESCRIPTION -* Copies the specified block into the Random Forwarding Table. -* -* SYNOPSIS -*/ -static inline ib_api_status_t -osm_rand_fwd_tbl_set_block(IN osm_rand_fwd_tbl_t * const p_tbl, - IN const uint8_t * const p_block, - IN const uint32_t block_num) -{ - /* Random forwarding tables not supported yet. */ - UNUSED_PARAM(p_tbl); - UNUSED_PARAM(p_block); - UNUSED_PARAM(block_num); - CL_ASSERT(FALSE); - return (IB_ERROR); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Random Forwarding Table object. -* -* p_block -* [in] Pointer to the Forwarding Table block. -* -* block_num -* [in] Block number of this block. -* -* RETURN VALUE -* None. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_rand_fwd_tbl_get -* NAME -* osm_rand_fwd_tbl_get -* -* DESCRIPTION -* Returns the port that routes the specified LID. -* -* SYNOPSIS -*/ -static inline uint8_t -osm_rand_fwd_tbl_get(IN const osm_rand_fwd_tbl_t * const p_tbl, - IN const uint16_t lid_ho) -{ - CL_ASSERT(FALSE); - UNUSED_PARAM(p_tbl); - UNUSED_PARAM(lid_ho); - - return (OSM_NO_PATH); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Random Forwarding Table object. -* -* lid_ho -* [in] LID value (host order) for which to get the route. -* -* RETURN VALUE -* Returns the port that routes the specified LID. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_rand_fwd_tbl_get_lids_per_block -* NAME -* osm_rand_fwd_tbl_get_lids_per_block -* -* DESCRIPTION -* Returns the number of LIDs per LID block. -* -* SYNOPSIS -*/ -static inline uint16_t -osm_rand_fwd_tbl_get_lids_per_block(IN const osm_rand_fwd_tbl_t * const p_tbl) -{ - UNUSED_PARAM(p_tbl); - return (16); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* Returns the number of LIDs per LID block. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_rand_fwd_tbl_get_max_block_id_in_use -* NAME -* osm_rand_fwd_tbl_get_max_block_id_in_use -* -* DESCRIPTION -* Returns the maximum block ID in actual use by the forwarding table. -* -* SYNOPSIS -*/ -static inline uint16_t -osm_rand_fwd_tbl_get_max_block_id_in_use(IN const osm_rand_fwd_tbl_t * - const p_tbl, - IN const uint16_t lid_top_ho) -{ - UNUSED_PARAM(p_tbl); - UNUSED_PARAM(lid_top_ho); - CL_ASSERT(FALSE); - return (0); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* Returns the maximum block ID in actual use by the forwarding table. -* -* NOTES -* -* SEE ALSO -*********/ - -/****f* OpenSM: Forwarding Table/osm_rand_fwd_tbl_get_size -* NAME -* osm_rand_fwd_tbl_get_size -* -* DESCRIPTION -* Returns the number of entries available in the forwarding table. -* -* SYNOPSIS -*/ -static inline uint16_t -osm_rand_fwd_tbl_get_size(IN const osm_rand_fwd_tbl_t * const p_tbl) -{ - UNUSED_PARAM(p_tbl); - CL_ASSERT(FALSE); - return (0); -} -/* -* PARAMETERS -* p_tbl -* [in] Pointer to the Forwarding Table object. -* -* RETURN VALUE -* Returns the number of entries available in the forwarding table. -* -* NOTES -* -* SEE ALSO -*********/ - -END_C_DECLS -#endif /* _OSM_RAND_FWD_TBL_H_ */ diff --git a/opensm/opensm/Makefile.am b/opensm/opensm/Makefile.am index 1d345a5..01573d2 100644 --- a/opensm/opensm/Makefile.am +++ b/opensm/opensm/Makefile.am @@ -27,9 +27,9 @@ libopensm_la_DEPENDENCIES = $(srcdir)/libopensm.map sbin_PROGRAMS = opensm opensm_DEPENDENCIES = libopensm.la opensm_SOURCES = main.c osm_console_io.c osm_console.c osm_db_files.c \ - osm_db_pack.c osm_drop_mgr.c osm_fwd_tbl.c \ + osm_db_pack.c osm_drop_mgr.c \ osm_inform.c osm_lid_mgr.c osm_lin_fwd_rcv.c \ - osm_lin_fwd_tbl.c osm_link_mgr.c osm_mcast_fwd_rcv.c \ + osm_link_mgr.c osm_mcast_fwd_rcv.c \ osm_mcast_mgr.c osm_mcast_tbl.c osm_mcm_info.c \ osm_mcm_port.c osm_mtree.c osm_multicast.c osm_node.c \ osm_node_desc_rcv.c osm_node_info_rcv.c \ @@ -74,11 +74,9 @@ opensminclude_HEADERS = \ $(srcdir)/../include/opensm/osm_db_pack.h \ $(srcdir)/../include/opensm/osm_event_plugin.h \ $(srcdir)/../include/opensm/osm_errors.h \ - $(srcdir)/../include/opensm/osm_fwd_tbl.h \ $(srcdir)/../include/opensm/osm_helper.h \ $(srcdir)/../include/opensm/osm_inform.h \ $(srcdir)/../include/opensm/osm_lid_mgr.h \ - $(srcdir)/../include/opensm/osm_lin_fwd_tbl.h \ $(srcdir)/../include/opensm/osm_log.h \ $(srcdir)/../include/opensm/osm_mad_pool.h \ $(srcdir)/../include/opensm/osm_madw.h \ @@ -100,7 +98,6 @@ opensminclude_HEADERS = \ $(srcdir)/../include/opensm/osm_port_profile.h \ $(srcdir)/../include/opensm/osm_prefix_route.h \ $(srcdir)/../include/opensm/osm_qos_policy.h \ - $(srcdir)/../include/opensm/osm_rand_fwd_tbl.h \ $(srcdir)/../include/opensm/osm_remote_sm.h \ $(srcdir)/../include/opensm/osm_router.h \ $(srcdir)/../include/opensm/osm_sa.h \ diff --git a/opensm/opensm/osm_fwd_tbl.c b/opensm/opensm/osm_fwd_tbl.c deleted file mode 100644 index 2ea74af..0000000 --- a/opensm/opensm/osm_fwd_tbl.c +++ /dev/null @@ -1,100 +0,0 @@ -/* - * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. - * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -/* - * Abstract: - * Implementation of osm_fwd_tbl_t. - * This object represents a unicast forwarding table. - * This object is part of the opensm family of objects. - */ - -#if HAVE_CONFIG_H -# include -#endif /* HAVE_CONFIG_H */ - -#include -#include -#include - -/********************************************************************** - **********************************************************************/ -ib_api_status_t -osm_fwd_tbl_init(IN osm_fwd_tbl_t * const p_tbl, - IN const ib_switch_info_t * const p_si) -{ - uint16_t tbl_cap; - ib_api_status_t status = IB_SUCCESS; - - /* - Determine the type and size of the forwarding table - used by this switch, then initialize accordingly. - The current implementation only supports switches - with linear forwarding tables. - */ - tbl_cap = cl_ntoh16(p_si->lin_cap); - - if (tbl_cap == 0) { - /* - This switch does not support linear forwarding - tables. Error out for now. - */ - status = IB_UNSUPPORTED; - goto Exit; - } - - p_tbl->p_rnd_tbl = NULL; - - p_tbl->p_lin_tbl = osm_lin_tbl_new(tbl_cap); - - if (p_tbl->p_lin_tbl == NULL) { - status = IB_INSUFFICIENT_MEMORY; - goto Exit; - } - -Exit: - return (status); -} - -/********************************************************************** - **********************************************************************/ -void osm_fwd_tbl_destroy(IN osm_fwd_tbl_t * const p_tbl) -{ - if (p_tbl->p_lin_tbl) { - CL_ASSERT(p_tbl->p_rnd_tbl == NULL); - osm_lin_tbl_delete(&p_tbl->p_lin_tbl); - } else { - osm_rand_tbl_delete(&p_tbl->p_rnd_tbl); - } -} diff --git a/opensm/opensm/osm_lin_fwd_tbl.c b/opensm/opensm/osm_lin_fwd_tbl.c deleted file mode 100644 index 7d1eeff..0000000 --- a/opensm/opensm/osm_lin_fwd_tbl.c +++ /dev/null @@ -1,88 +0,0 @@ -/* - * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved. - * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. - * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -/* - * Abstract: - * Implementation of osm_lin_fwd_tbl_t. - * This object represents an linear forwarding table. - * This object is part of the opensm family of objects. - */ - -#if HAVE_CONFIG_H -# include -#endif /* HAVE_CONFIG_H */ - -#include -#include -#include -#include -#include - -static inline size_t __osm_lin_tbl_compute_obj_size(IN const uint16_t num_lids) -{ - return (sizeof(osm_lin_fwd_tbl_t) + (num_lids - 1)); -} - -/********************************************************************** - **********************************************************************/ -osm_lin_fwd_tbl_t *osm_lin_tbl_new(IN uint16_t const size) -{ - osm_lin_fwd_tbl_t *p_tbl; - - /* - The capacity reported by the switch includes LID 0, - so add 1 to the end of the range here for this assert. - */ - CL_ASSERT(size <= IB_LID_UCAST_END_HO + 1); - p_tbl = - (osm_lin_fwd_tbl_t *) malloc(__osm_lin_tbl_compute_obj_size(size)); - - /* - Initialize the table to OSM_NO_PATH, which means "invalid port" - */ - if (p_tbl != NULL) { - memset(p_tbl, OSM_NO_PATH, __osm_lin_tbl_compute_obj_size(size)); - p_tbl->size = size; - } - return (p_tbl); -} - -/********************************************************************** - **********************************************************************/ -void osm_lin_tbl_delete(IN osm_lin_fwd_tbl_t ** const pp_tbl) -{ - free(*pp_tbl); - *pp_tbl = NULL; -} -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Wed Oct 29 06:01:48 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 29 Oct 2008 15:01:48 +0200 Subject: [ofa-general] [PATCH 1/2 v2] opensm: replace switch's fwd_tbl with simple LFT Message-ID: <49085EBC.4020501@dev.mellanox.co.il> Replace the unnecessarily complex switch's forwarding table implementation with a simple LFT that is implemented as plain uint8_t array. [v2- fixed two remarks that I got with the previous version] Signed-off-by: Yevgeny Kliteynik --- opensm/include/opensm/osm_port_profile.h | 1 - opensm/include/opensm/osm_router.h | 1 - opensm/include/opensm/osm_switch.h | 138 +++++++----------------------- opensm/opensm/osm_console.c | 5 +- opensm/opensm/osm_lin_fwd_rcv.c | 2 +- opensm/opensm/osm_sa_lft_record.c | 2 +- opensm/opensm/osm_sw_info_rcv.c | 4 +- opensm/opensm/osm_switch.c | 68 ++++++--------- opensm/opensm/osm_ucast_file.c | 2 +- opensm/opensm/osm_ucast_lash.c | 1 - opensm/opensm/osm_ucast_mgr.c | 12 ++- 11 files changed, 73 insertions(+), 163 deletions(-) diff --git a/opensm/include/opensm/osm_port_profile.h b/opensm/include/opensm/osm_port_profile.h index 00d83e4..fd22719 100644 --- a/opensm/include/opensm/osm_port_profile.h +++ b/opensm/include/opensm/osm_port_profile.h @@ -51,7 +51,6 @@ #include #include #include -#include #include #ifdef __cplusplus diff --git a/opensm/include/opensm/osm_router.h b/opensm/include/opensm/osm_router.h index 8cabdf8..4901aca 100644 --- a/opensm/include/opensm/osm_router.h +++ b/opensm/include/opensm/osm_router.h @@ -48,7 +48,6 @@ #include #include #include -#include #include #include diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h index 98eb64d..a27af20 100644 --- a/opensm/include/opensm/osm_switch.h +++ b/opensm/include/opensm/osm_switch.h @@ -48,7 +48,6 @@ #include #include #include -#include #include #include @@ -101,7 +100,7 @@ typedef struct osm_switch { uint16_t num_hops; uint8_t **hops; osm_port_profile_t *p_prof; - osm_fwd_tbl_t fwd_tbl; + uint8_t *lft; uint8_t *lft_buf; osm_mcast_tbl_t mcast_tbl; uint32_t discovery_count; @@ -135,8 +134,8 @@ typedef struct osm_switch { * p_prof * Pointer to array of Port Profile objects for this switch. * -* fwd_tbl -* This switch's forwarding table. +* lft +* This switch's linear forwarding table. * * lft_buf * This switch's linear forwarding table, as was @@ -275,33 +274,6 @@ osm_switch_get_hop_count(IN const osm_switch_t * const p_sw, * SEE ALSO *********/ -/****f* OpenSM: Switch/osm_switch_get_fwd_tbl_ptr -* NAME -* osm_switch_get_fwd_tbl_ptr -* -* DESCRIPTION -* Returns a pointer to the switch's forwarding table. -* -* SYNOPSIS -*/ -static inline osm_fwd_tbl_t *osm_switch_get_fwd_tbl_ptr(IN const osm_switch_t * - const p_sw) -{ - return ((osm_fwd_tbl_t *) & p_sw->fwd_tbl); -} -/* -* PARAMETERS -* p_sw -* [in] Pointer to a Switch object. -* -* RETURN VALUES -* Returns a pointer to the switch's forwarding table. -* -* NOTES -* -* SEE ALSO -*********/ - /****f* OpenSM: Switch/osm_switch_set_hops * NAME * osm_switch_set_hops @@ -437,7 +409,9 @@ static inline uint8_t osm_switch_get_port_by_lid(IN const osm_switch_t * const p_sw, IN const uint16_t lid_ho) { - return (osm_fwd_tbl_get(&p_sw->fwd_tbl, lid_ho)); + if (lid_ho == 0 || lid_ho > IB_LID_UCAST_END_HO) + return OSM_NO_PATH; + return p_sw->lft[lid_ho]; } /* * PARAMETERS @@ -500,12 +474,13 @@ static inline osm_physp_t *osm_switch_get_route_by_lid(IN const osm_switch_t * const p_sw, IN const ib_net16_t lid) { - uint8_t port_num; + uint8_t port_num = OSM_NO_PATH; CL_ASSERT(p_sw); CL_ASSERT(lid); - port_num = osm_fwd_tbl_get(&p_sw->fwd_tbl, cl_ntoh16(lid)); + port_num = osm_switch_get_port_by_lid(p_sw, cl_ntoh16(lid)); + /* In order to avoid holes in the subnet (usually happens when running UPDN algorithm), i.e. cases where port is @@ -572,35 +547,6 @@ osm_switch_sp0_is_lmc_capable(IN const osm_switch_t * const p_sw, * SEE ALSO *********/ -/****f* OpenSM: Switch/osm_switch_get_max_block_id -* NAME -* osm_switch_get_max_block_id -* -* DESCRIPTION -* Returns the maximum block ID (host order) of this switch. -* -* SYNOPSIS -*/ -static inline uint32_t -osm_switch_get_max_block_id(IN const osm_switch_t * const p_sw) -{ - return ((uint32_t) (osm_fwd_tbl_get_size(&p_sw->fwd_tbl) / - osm_fwd_tbl_get_lids_per_block(&p_sw->fwd_tbl))); -} -/* -* PARAMETERS -* p_sw -* [in] Pointer to an osm_switch_t object. -* -* RETURN VALUES -* Returns the maximum block ID (host order) of this switch. -* -* NOTES -* -* SEE ALSO -* Switch object -*********/ - /****f* OpenSM: Switch/osm_switch_get_max_block_id_in_use * NAME * osm_switch_get_max_block_id_in_use @@ -614,9 +560,8 @@ osm_switch_get_max_block_id(IN const osm_switch_t * const p_sw) static inline uint16_t osm_switch_get_max_block_id_in_use(IN const osm_switch_t * const p_sw) { - return (osm_fwd_tbl_get_max_block_id_in_use(&p_sw->fwd_tbl, - cl_ntoh16(p_sw->switch_info. - lin_top))); + return (uint16_t)(cl_ntoh16(p_sw->switch_info.lin_top) / + IB_SMP_DATA_SIZE); } /* * PARAMETERS @@ -632,19 +577,19 @@ osm_switch_get_max_block_id_in_use(IN const osm_switch_t * const p_sw) * Switch object *********/ -/****f* OpenSM: Switch/osm_switch_get_fwd_tbl_block +/****f* OpenSM: Switch/osm_switch_get_lft_block * NAME -* osm_switch_get_fwd_tbl_block +* osm_switch_get_lft_block * * DESCRIPTION -* Retrieve a forwarding table block. +* Retrieve a linear forwarding table block. * * SYNOPSIS */ boolean_t -osm_switch_get_fwd_tbl_block(IN const osm_switch_t * const p_sw, - IN const uint32_t block_id, - OUT uint8_t * const p_block); +osm_switch_get_lft_block(IN const osm_switch_t * const p_sw, + IN const uint16_t block_id, + OUT uint8_t * const p_block); /* * PARAMETERS * p_sw @@ -758,22 +703,30 @@ osm_switch_count_path(IN osm_switch_t * const p_sw, IN const uint8_t port) * SEE ALSO *********/ -/****f* OpenSM: Switch/osm_switch_set_ft_block +/****f* OpenSM: Switch/osm_switch_set_lft_block * NAME -* osm_switch_set_ft_block +* osm_switch_set_lft_block * * DESCRIPTION -* Copies in the specified block into the switch's Forwarding Table object. +* Copies in the specified block into +* the switch's Linear Forwarding Table. * * SYNOPSIS */ static inline ib_api_status_t -osm_switch_set_ft_block(IN osm_switch_t * const p_sw, - IN const uint8_t * const p_block, - IN const uint32_t block_num) +osm_switch_set_lft_block(IN osm_switch_t * const p_sw, + IN const uint8_t * const p_block, + IN const uint32_t block_num) { + uint16_t lid_start = + (uint16_t) (block_num * IB_SMP_DATA_SIZE); CL_ASSERT(p_sw); - return (osm_fwd_tbl_set_block(&p_sw->fwd_tbl, p_block, block_num)); + + if (lid_start + IB_SMP_DATA_SIZE > IB_LID_UCAST_END_HO) + return IB_INVALID_PARAMETER; + + memcpy(&p_sw->lft[lid_start], p_block, IB_SMP_DATA_SIZE); + return IB_SUCCESS; } /* * PARAMETERS @@ -1044,33 +997,6 @@ osm_switch_recommend_mcast_path(IN osm_switch_t * const p_sw, * SEE ALSO *********/ -/****f* OpenSM: Switch/osm_switch_get_fwd_tbl_size -* NAME -* osm_switch_get_fwd_tbl_size -* -* DESCRIPTION -* Returns the number of entries available in the forwarding table. -* -* SYNOPSIS -*/ -static inline uint16_t -osm_switch_get_fwd_tbl_size(IN const osm_switch_t * const p_sw) -{ - return (osm_fwd_tbl_get_size(&p_sw->fwd_tbl)); -} -/* -* PARAMETERS -* p_sw -* [in] Pointer to the switch. -* -* RETURN VALUE -* Returns the number of entries available in the forwarding table. -* -* NOTES -* -* SEE ALSO -*********/ - /****f* OpenSM: Switch/osm_switch_get_mcast_fwd_tbl_size * NAME * osm_switch_get_mcast_fwd_tbl_size diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index 18168ff..d9bbbc2 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -52,7 +52,6 @@ #include #include #include -#include struct command { char *name; @@ -765,7 +764,7 @@ static void switchbalance_check(osm_opensm_t * p_osm, continue; for (lid_ho = min_lid_ho; lid_ho <= max_lid_ho; lid_ho++) { - port_num = osm_fwd_tbl_get(&(p_sw->fwd_tbl), lid_ho); + port_num = osm_switch_get_port_by_lid(p_sw, lid_ho); if (port_num == OSM_NO_PATH) continue; @@ -915,7 +914,7 @@ static void lidbalance_check(osm_opensm_t * p_osm, boolean_t rem_node_found = FALSE; unsigned int indx = 0; - port_num = osm_fwd_tbl_get(&(p_sw->fwd_tbl), lid_ho); + port_num = osm_switch_get_port_by_lid(p_sw, lid_ho); if (port_num == OSM_NO_PATH) continue; diff --git a/opensm/opensm/osm_lin_fwd_rcv.c b/opensm/opensm/osm_lin_fwd_rcv.c index c5cbfb5..c3d8633 100644 --- a/opensm/opensm/osm_lin_fwd_rcv.c +++ b/opensm/opensm/osm_lin_fwd_rcv.c @@ -87,7 +87,7 @@ void osm_lft_rcv_process(IN void *context, IN void *data) "LFT received for nonexistent node " "0x%" PRIx64 "\n", cl_ntoh64(node_guid)); } else { - status = osm_switch_set_ft_block(p_sw, p_block, block_num); + status = osm_switch_set_lft_block(p_sw, p_block, block_num); if (status != IB_SUCCESS) { OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0402: " "Setting forwarding table block failed (%s)" diff --git a/opensm/opensm/osm_sa_lft_record.c b/opensm/opensm/osm_sa_lft_record.c index cdca430..d84a6a5 100644 --- a/opensm/opensm/osm_sa_lft_record.c +++ b/opensm/opensm/osm_sa_lft_record.c @@ -100,7 +100,7 @@ __osm_lftr_rcv_new_lftr(IN osm_sa_t * sa, p_rec_item->rec.block_num = cl_hton16(block); /* copy the lft block */ - osm_switch_get_fwd_tbl_block(p_sw, block, p_rec_item->rec.lft); + osm_switch_get_lft_block(p_sw, block, p_rec_item->rec.lft); cl_qlist_insert_tail(p_list, &p_rec_item->list_item); diff --git a/opensm/opensm/osm_sw_info_rcv.c b/opensm/opensm/osm_sw_info_rcv.c index 6ee1538..e9973e3 100644 --- a/opensm/opensm/osm_sw_info_rcv.c +++ b/opensm/opensm/osm_sw_info_rcv.c @@ -298,8 +298,8 @@ __osm_si_rcv_process_new(IN osm_sm_t * sm, } /* set subnet max unicast lid to the minimum LinearFDBCap of all switches */ - if (p_sw->fwd_tbl.p_lin_tbl->size < sm->p_subn->max_ucast_lid_ho) { - sm->p_subn->max_ucast_lid_ho = p_sw->fwd_tbl.p_lin_tbl->size; + if (cl_ntoh16(p_si->lin_cap) < sm->p_subn->max_ucast_lid_ho) { + sm->p_subn->max_ucast_lid_ho = cl_ntoh16(p_si->lin_cap); OSM_LOG(sm->p_log, OSM_LOG_VERBOSE, "Subnet max unicast lid is 0x%X\n", sm->p_subn->max_ucast_lid_ho); diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c index 9bf76e0..4b07dbc 100644 --- a/opensm/opensm/osm_switch.c +++ b/opensm/opensm/osm_switch.c @@ -97,9 +97,22 @@ osm_switch_init(IN osm_switch_t * const p_sw, p_sw->num_ports = num_ports; p_sw->need_update = 2; - status = osm_fwd_tbl_init(&p_sw->fwd_tbl, p_si); - if (status != IB_SUCCESS) + /* Initiate the linear forwarding table */ + + if (!p_si->lin_cap) { + /* This switch does not support linear forwarding tables */ + status = IB_UNSUPPORTED; + goto Exit; + } + + p_sw->lft = malloc(IB_LID_UCAST_END_HO + 1); + if (!p_sw->lft) { + status = IB_INSUFFICIENT_MEMORY; goto Exit; + } + + /* Initialize the table to OSM_NO_PATH, which is "invalid port" */ + memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); p_sw->lft_buf = malloc(IB_LID_UCAST_END_HO + 1); if (!p_sw->lft_buf) { @@ -138,7 +151,8 @@ void osm_switch_delete(IN OUT osm_switch_t ** const pp_sw) osm_mcast_tbl_destroy(&p_sw->mcast_tbl); free(p_sw->p_prof); - osm_fwd_tbl_destroy(&p_sw->fwd_tbl); + if (p_sw->lft) + free(p_sw->lft); if (p_sw->lft_buf) free(p_sw->lft_buf); if (p_sw->hops) { @@ -176,49 +190,21 @@ osm_switch_t *osm_switch_new(IN osm_node_t * const p_node, /********************************************************************** **********************************************************************/ boolean_t -osm_switch_get_fwd_tbl_block(IN const osm_switch_t * const p_sw, - IN const uint32_t block_id, - OUT uint8_t * const p_block) +osm_switch_get_lft_block(IN const osm_switch_t * const p_sw, + IN const uint16_t block_id, + OUT uint8_t * const p_block) { - uint16_t base_lid_ho; - uint16_t max_lid_ho; - uint16_t lid_ho; - uint16_t block_top_lid_ho; - uint32_t lids_per_block; - osm_fwd_tbl_t *p_tbl; - boolean_t return_flag = FALSE; + uint16_t base_lid_ho = block_id * IB_SMP_DATA_SIZE; CL_ASSERT(p_sw); CL_ASSERT(p_block); - p_tbl = osm_switch_get_fwd_tbl_ptr(p_sw); - max_lid_ho = p_sw->max_lid_ho; - lids_per_block = osm_fwd_tbl_get_lids_per_block(&p_sw->fwd_tbl); - base_lid_ho = (uint16_t) (block_id * lids_per_block); - - if (base_lid_ho <= max_lid_ho) { - /* Initialize LIDs in block to invalid port number. */ - memset(p_block, OSM_NO_PATH, IB_SMP_DATA_SIZE); - /* - Determine the range of LIDs we can return with this block. - */ - block_top_lid_ho = - (uint16_t) (base_lid_ho + lids_per_block - 1); - if (block_top_lid_ho > max_lid_ho) - block_top_lid_ho = max_lid_ho; - - /* - Configure the forwarding table with the routing - information for the specified block of LIDs. - */ - for (lid_ho = base_lid_ho; lid_ho <= block_top_lid_ho; lid_ho++) - p_block[lid_ho - base_lid_ho] = - osm_fwd_tbl_get(p_tbl, lid_ho); - - return_flag = TRUE; - } + if (base_lid_ho > p_sw->max_lid_ho) + return FALSE; - return (return_flag); + CL_ASSERT(base_lid_ho + IB_SMP_DATA_SIZE <= IB_LID_UCAST_END_HO); + memcpy(p_block, &(p_sw->lft[base_lid_ho]), IB_SMP_DATA_SIZE); + return TRUE; } /********************************************************************** @@ -359,7 +345,7 @@ osm_switch_recommend_path(IN const osm_switch_t * const p_sw, 4. the port has min-hops to the target (avoid loops) */ if (!ignore_existing) { - port_num = osm_fwd_tbl_get(&p_sw->fwd_tbl, lid_ho); + port_num = osm_switch_get_port_by_lid(p_sw, lid_ho); if (port_num != OSM_NO_PATH) { CL_ASSERT(port_num < num_ports); diff --git a/opensm/opensm/osm_ucast_file.c b/opensm/opensm/osm_ucast_file.c index a6edf5d..865ad82 100644 --- a/opensm/opensm/osm_ucast_file.c +++ b/opensm/opensm/osm_ucast_file.c @@ -83,7 +83,7 @@ static void add_path(osm_opensm_t * p_osm, uint8_t old_port; new_lid = port_guid ? remap_lid(p_osm, lid, port_guid) : lid; - old_port = osm_fwd_tbl_get(osm_switch_get_fwd_tbl_ptr(p_sw), new_lid); + old_port = osm_switch_get_port_by_lid(p_sw, new_lid); if (old_port != OSM_NO_PATH && old_port != port_num) { OSM_LOG(&p_osm->log, OSM_LOG_VERBOSE, "LID collision is detected on switch " diff --git a/opensm/opensm/osm_ucast_lash.c b/opensm/opensm/osm_ucast_lash.c index 1036c9f..c082798 100644 --- a/opensm/opensm/osm_ucast_lash.c +++ b/opensm/opensm/osm_ucast_lash.c @@ -52,7 +52,6 @@ #include #include #include -#include /* //////////////////////////// */ /* Local types */ diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index 3bc3912..adb6688 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -247,7 +247,7 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr, lid_ho, min_lid_ho, max_lid_ho); /* TODO - This should be runtime error, not a CL_ASSERT() */ - CL_ASSERT(max_lid_ho < osm_switch_get_fwd_tbl_size(p_sw)); + CL_ASSERT(max_lid_ho <= IB_LID_UCAST_END_HO); node_guid = osm_node_get_node_guid(p_sw->p_node); @@ -320,7 +320,7 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr, osm_madw_context_t context; ib_api_status_t status; ib_switch_info_t si; - uint32_t block_id_ho = 0; + uint16_t block_id_ho = 0; uint8_t block[IB_SMP_DATA_SIZE]; boolean_t set_swinfo_require = FALSE; uint16_t lin_top; @@ -393,17 +393,19 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr, context.lft_context.set_method = TRUE; for (block_id_ho = 0; - osm_switch_get_fwd_tbl_block(p_sw, block_id_ho, block); + osm_switch_get_lft_block(p_sw, block_id_ho, block); block_id_ho++) { if (!p_sw->need_update && - !memcmp(block, p_sw->lft_buf + block_id_ho * 64, 64)) + !memcmp(block, + p_sw->lft_buf + block_id_ho * IB_SMP_DATA_SIZE, + IB_SMP_DATA_SIZE)) continue; OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, "Writing FT block %u\n", block_id_ho); status = osm_req_set(p_mgr->sm, p_path, - p_sw->lft_buf + block_id_ho * 64, + p_sw->lft_buf + block_id_ho * IB_SMP_DATA_SIZE, sizeof(block), IB_MAD_ATTR_LIN_FWD_TBL, cl_hton32(block_id_ho), -- 1.5.1.4 From hal.rosenstock at gmail.com Wed Oct 29 06:06:09 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 29 Oct 2008 09:06:09 -0400 Subject: [ofa-general] [opensm] remove qos_max_vls config?? In-Reply-To: <4908292A.40004@dev.mellanox.co.il> References: <1225237965.3358.9.camel@whatsup> <4908292A.40004@dev.mellanox.co.il> Message-ID: On Wed, Oct 29, 2008 at 5:13 AM, Yevgeny Kliteynik wrote: > Al Chu wrote: >> >> Hey Sasha, >> >> I was working on a different bug fix on the qos config parsing, when I >> noticed the qos_*max_vls fields aren't used anywhere. They seem to be >> parsed from the config, stored, and never used. Maybe it used to be >> what 'max_op_vls' is now used for? > > I guess that the initial idea was to have an option to configure > different operational VLs on different type of nodes in the subnet. > The question is, does having such option make sense? Does it impact buffering ? If so, in those cases it would be worth configuring (assuming it gets acted on elsewhere). -- Hal > -- Yevgeny > >> If there's still a purpose for it in the future, obviously no issue on >> leaving in there. Patch is attached to remove it everywhere I found it. >> >> Al >> >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Wed Oct 29 06:41:21 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 29 Oct 2008 15:41:21 +0200 Subject: [ofa-general] Re: [PATCH 1/2] opensm: replace switch's fwd_tbl with simple LFT In-Reply-To: <490845A2.2060907@dev.mellanox.co.il> References: <48F7B3D7.3070004@dev.mellanox.co.il> <20081018234814.GU5528@sashak.voltaire.com> <490845A2.2060907@dev.mellanox.co.il> Message-ID: <20081029134121.GA15321@sashak.voltaire.com> On 13:14 Wed 29 Oct , Yevgeny Kliteynik wrote: > Sasha Khapyorsky wrote: >> Hi Yevgeny, >> On 23:36 Thu 16 Oct , Yevgeny Kliteynik wrote: >>> Replace the unnecessarily complex switch's forwarding table >>> implementation with a simple LFT that is implemented as plain >>> uint8_t array. >>> >>> Signed-off-by: Yevgeny Kliteynik >>> --- >> [snip...] >>> diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c >>> index 9bf76e0..bdfc7d0 100644 >>> --- a/opensm/opensm/osm_switch.c >>> +++ b/opensm/opensm/osm_switch.c >>> @@ -97,9 +97,26 @@ osm_switch_init(IN osm_switch_t * const p_sw, >>> p_sw->num_ports = num_ports; >>> p_sw->need_update = 2; >>> >>> - status = osm_fwd_tbl_init(&p_sw->fwd_tbl, p_si); >>> - if (status != IB_SUCCESS) >>> + /* Initiate the linear forwarding table */ >>> + >>> + if (!p_si->lin_cap) { >>> + /* This switch does not support linear forwarding tables */ >>> + status = IB_UNSUPPORTED; >>> goto Exit; >>> + } >>> + >>> + /* The capacity reported by the switch includes LID 0, >>> + so add 1 to the end of the range here for this assert. */ >>> + CL_ASSERT(cl_ntoh16(p_si->lin_cap) <= IB_LID_UCAST_END_HO + 1); >> Maybe there should be run-time check (not sure since lin_cap is not >> really used in other places in the code), but not assertion - any bogus >> data received from network should not crash OpenSM. I'm removing this. > > Do we care that the lin_cap of the switch claims to support more > than IB_LID_UCAST_END_HO? Don't think so, so I agree - removing this. It means buggy switch, we can drop a warning in run-time, but this is not a reason for OpenSM to crash with CL_ASSERT(). >>> + >>> + p_sw->lft = malloc(IB_LID_UCAST_END_HO + 1); >>> + if (!p_sw->lft) { >>> + status = IB_INSUFFICIENT_MEMORY; >>> + goto Exit; >>> + } >>> + >>> + /* Initialize the table to OSM_NO_PATH, which is "invalid port" */ >>> + memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); >>> >>> p_sw->lft_buf = malloc(IB_LID_UCAST_END_HO + 1); >>> if (!p_sw->lft_buf) { >>> @@ -138,7 +155,8 @@ void osm_switch_delete(IN OUT osm_switch_t ** const >>> pp_sw) >>> >>> osm_mcast_tbl_destroy(&p_sw->mcast_tbl); >>> free(p_sw->p_prof); >>> - osm_fwd_tbl_destroy(&p_sw->fwd_tbl); >>> + if (p_sw->lft) >>> + free(p_sw->lft); >>> if (p_sw->lft_buf) >>> free(p_sw->lft_buf); >>> if (p_sw->hops) { >>> @@ -176,44 +194,36 @@ osm_switch_t *osm_switch_new(IN osm_node_t * const >>> p_node, >>> /********************************************************************** >>> **********************************************************************/ >>> boolean_t >>> -osm_switch_get_fwd_tbl_block(IN const osm_switch_t * const p_sw, >>> - IN const uint32_t block_id, >>> - OUT uint8_t * const p_block) >>> +osm_switch_get_lft_block(IN const osm_switch_t * const p_sw, >>> + IN const uint32_t block_id, >>> + OUT uint8_t * const p_block) >>> { >>> uint16_t base_lid_ho; >>> - uint16_t max_lid_ho; >>> - uint16_t lid_ho; >>> uint16_t block_top_lid_ho; >>> - uint32_t lids_per_block; >>> - osm_fwd_tbl_t *p_tbl; >>> boolean_t return_flag = FALSE; >>> >>> CL_ASSERT(p_sw); >>> CL_ASSERT(p_block); >>> >>> - p_tbl = osm_switch_get_fwd_tbl_ptr(p_sw); >>> - max_lid_ho = p_sw->max_lid_ho; >>> - lids_per_block = osm_fwd_tbl_get_lids_per_block(&p_sw->fwd_tbl); >>> - base_lid_ho = (uint16_t) (block_id * lids_per_block); >>> + base_lid_ho = (uint16_t) (block_id * IB_SMP_DATA_SIZE); >>> >>> - if (base_lid_ho <= max_lid_ho) { >>> + if (base_lid_ho <= p_sw->max_lid_ho) { >>> /* Initialize LIDs in block to invalid port number. */ >>> memset(p_block, OSM_NO_PATH, IB_SMP_DATA_SIZE); >>> /* >>> Determine the range of LIDs we can return with this block. >>> */ >>> block_top_lid_ho = >>> - (uint16_t) (base_lid_ho + lids_per_block - 1); >>> - if (block_top_lid_ho > max_lid_ho) >>> - block_top_lid_ho = max_lid_ho; >>> + (uint16_t) (base_lid_ho + IB_SMP_DATA_SIZE - 1); >>> + if (block_top_lid_ho > p_sw->max_lid_ho) >>> + block_top_lid_ho = p_sw->max_lid_ho; >>> >>> /* >>> Configure the forwarding table with the routing >>> information for the specified block of LIDs. >>> */ >>> - for (lid_ho = base_lid_ho; lid_ho <= block_top_lid_ho; lid_ho++) >>> - p_block[lid_ho - base_lid_ho] = >>> - osm_fwd_tbl_get(p_tbl, lid_ho); >>> + memcpy(p_block, &(p_sw->lft[base_lid_ho]), >>> + block_top_lid_ho - base_lid_ho + 1); >> Hmm, why not just >> memcpy(p_block, &p_sw->lft[base_lid_ho], 64); >> ? And then no need initial memset()? > > Well, I can really simplify this whole function to > something like this: > > boolean_t > osm_switch_get_lft_block(IN const osm_switch_t * const p_sw, > IN const uint16_t block_id, > OUT uint8_t * const p_block) > { > uint16_t base_lid_ho = block_id * IB_SMP_DATA_SIZE; > CL_ASSERT(p_sw); > CL_ASSERT(p_block); > if (base_lid_ho > p_sw->max_lid_ho) > return FALSE; > memcpy(p_block, &(p_sw->lft[base_lid_ho]), IB_SMP_DATA_SIZE); > return TRUE; > } Good. > Patch shortly. Thanks. Sasha From tziporet at dev.mellanox.co.il Wed Oct 29 06:47:02 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 29 Oct 2008 15:47:02 +0200 Subject: [ofa-general] Re: [PATCH 2/2 V2] ofed/docs/README.txt: fixing opensm daemon and cfg file names In-Reply-To: <48FF4EF1.1020502@dev.mellanox.co.il> References: <48FF4EF1.1020502@dev.mellanox.co.il> Message-ID: <49086956.4070801@mellanox.co.il> Yevgeny Kliteynik wrote: > Tziporet, > > Fixing opensm daemon name from 'opensm' to 'opensmd' > and configuration file from old '/etc/sysconfig/opensm' > to new '/etc/opensm/opensm.conf'. > > Please apply OFED 1.4 docs. > > applied Tziporet From tziporet at dev.mellanox.co.il Wed Oct 29 06:47:40 2008 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Wed, 29 Oct 2008 15:47:40 +0200 Subject: [ofa-general] Re: [PATCH 1/2] ofed/docs/README.txt: fixing white space mess In-Reply-To: <48FF465A.4080708@dev.mellanox.co.il> References: <48FF465A.4080708@dev.mellanox.co.il> Message-ID: <4908697C.9010809@mellanox.co.il> Yevgeny Kliteynik wrote: > Hi Tziporet, > > Fixing some white space mess in README.txt: removed trailing > blanks, fixed mixed usage of tabs and spaces, removed empty > lines at the end of the file. > > Applied Tziporet From hal.rosenstock at gmail.com Wed Oct 29 06:50:34 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 29 Oct 2008 09:50:34 -0400 Subject: [ofa-general] Re: [PATCH] opensm: notify master SM with trap 144 (not finished) In-Reply-To: References: <20081025175201.GP28713@sashak.voltaire.com> <20081025200127.GR28713@sashak.voltaire.com> Message-ID: Sasha, On Tue, Oct 28, 2008 at 4:27 PM, Hal Rosenstock wrote: > Sasha, > > On Sat, Oct 25, 2008 at 4:01 PM, Sasha Khapyorsky wrote: >> >> When entering standby state (after discovery) notify master SM about us. >> In case when SMA doesn't support trap sending (specifically trap 144 on >> PortInfo:CapabilityMask change - isSM bit, example is current ConnectX >> firmware - 2.5.0) this is only way to notify the current master SM that >> another SM is running. > > So is the trap sent unconditionally (since there's no way of knowing > whether the SMA supports this or not) ? Is the only downside the extra > Trap/TrapRepress when the SMA does support this ? > > Seems to me that the right fix is to the Connect-X SMA. > > Also, what happens once the Connect-X SMA is fixed ? Does this code persist ? One approach might be to conditionalize this trap when Connect-X and ultimately on the firmware versions which don't support this. Another would be to add an option which defaults to off and needs to be set for Connect-X manually. -- Hal > > -- Hal > >> See also bug#1183. >> >> Signed-off-by: Sasha Khapyorsky >> --- >> opensm/opensm/osm_state_mgr.c | 2 ++ >> 1 files changed, 2 insertions(+), 0 deletions(-) >> >> diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c >> index 174cee6..1576c42 100644 >> --- a/opensm/opensm/osm_state_mgr.c >> +++ b/opensm/opensm/osm_state_mgr.c >> @@ -1142,6 +1142,8 @@ _repeat_discovery: >> OSM_SM_SIGNAL_MASTER_OR_HIGHER_SM_DETECTED_DONE); >> osm_log_msg_box(sm->p_log, OSM_LOG_VERBOSE, __FUNCTION__, >> "ENTERING STANDBY STATE"); >> + /* notify master SM about us */ >> + osm_send_trap144(sm, 0); >> return; >> } >> >> -- >> 1.6.0.3.517.g759a >> >> > From sashak at voltaire.com Wed Oct 29 06:52:04 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 29 Oct 2008 15:52:04 +0200 Subject: [ofa-general] Re: [PATCH] opensm: notify master SM with trap 144 (not finished) In-Reply-To: References: <20081025175201.GP28713@sashak.voltaire.com> <20081025200127.GR28713@sashak.voltaire.com> Message-ID: <20081029135204.GC15321@sashak.voltaire.com> Hi Hal, On 16:27 Tue 28 Oct , Hal Rosenstock wrote: > On Sat, Oct 25, 2008 at 4:01 PM, Sasha Khapyorsky wrote: > > > > When entering standby state (after discovery) notify master SM about us. > > In case when SMA doesn't support trap sending (specifically trap 144 on > > PortInfo:CapabilityMask change - isSM bit, example is current ConnectX > > firmware - 2.5.0) this is only way to notify the current master SM that > > another SM is running. > > So is the trap sent unconditionally (since there's no way of knowing > whether the SMA supports this or not) ? Is the only downside the extra > Trap/TrapRepress when the SMA does support this ? It is not unconditional. There is such code at beginning of osm_send_trap144(): /* don't bother with sending trap when SMA supports this */ if (!local && pi->capability_mask&(IB_PORT_CAP_HAS_TRAP|IB_PORT_CAP_HAS_CAP_NTC)) return 0; > Seems to me that the right fix is to the Connect-X SMA. Agree. But it is not there yet. > Also, what happens once the Connect-X SMA is fixed ? Does this code persist ? Then osm_send_trap144(..., 0) will do nothing following PortInfo:CapabilityMask. And actually if the problem will become obsolete we can remove this call safely. Sasha From sashak at voltaire.com Wed Oct 29 06:56:27 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 29 Oct 2008 15:56:27 +0200 Subject: [ofa-general] Re: [PATCH] opensm: osm_send_trap144() function In-Reply-To: References: <20081025175201.GP28713@sashak.voltaire.com> Message-ID: <20081029135627.GD15321@sashak.voltaire.com> On 16:26 Tue 28 Oct , Hal Rosenstock wrote: > On Sat, Oct 25, 2008 at 1:52 PM, Sasha Khapyorsky wrote: > > > > Add ability to send trap 144 - osm_send_trap144() function. This can be > > useful when SMA doesn't support trap sending on some events, such as > > CapabilityMask change (ConnectX), OtherLocalChanges (no one supports > > this AFAIK). > > What component beside the SMA would send the ones mentioned above ? > Also, how would it know whether or not to do this ? I hope I answered this in previous email, anyway this is the code: > > + /* don't bother with sending trap when SMA supports this */ > > + if (!local && > > + pi->capability_mask&(IB_PORT_CAP_HAS_TRAP|IB_PORT_CAP_HAS_CAP_NTC)) > > + return 0; > > + That should work fine for CapabilityMask change. And I have no a good idea about how to track capability of OtherLocalChanges trap sending. Sasha From sashak at voltaire.com Wed Oct 29 07:04:00 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 29 Oct 2008 16:04:00 +0200 Subject: [ofa-general] Re: [PATCH] opensm: notify master SM with trap 144 (not finished) In-Reply-To: References: <20081025175201.GP28713@sashak.voltaire.com> <20081025200127.GR28713@sashak.voltaire.com> Message-ID: <20081029140400.GF15321@sashak.voltaire.com> On 09:50 Wed 29 Oct , Hal Rosenstock wrote: > Sasha, > > On Tue, Oct 28, 2008 at 4:27 PM, Hal Rosenstock > wrote: > > Sasha, > > > > On Sat, Oct 25, 2008 at 4:01 PM, Sasha Khapyorsky wrote: > >> > >> When entering standby state (after discovery) notify master SM about us. > >> In case when SMA doesn't support trap sending (specifically trap 144 on > >> PortInfo:CapabilityMask change - isSM bit, example is current ConnectX > >> firmware - 2.5.0) this is only way to notify the current master SM that > >> another SM is running. > > > > So is the trap sent unconditionally (since there's no way of knowing > > whether the SMA supports this or not) ? Is the only downside the extra > > Trap/TrapRepress when the SMA does support this ? > > > > Seems to me that the right fix is to the Connect-X SMA. > > > > Also, what happens once the Connect-X SMA is fixed ? Does this code persist ? > > One approach might be to conditionalize this trap when Connect-X and > ultimately on the firmware versions which don't support this. Another > would be to add an option which defaults to off and needs to be set > for Connect-X manually. Does not CapabilityMask checking do the same (but w/out FW version tricks)? Sasha From hal.rosenstock at gmail.com Wed Oct 29 07:03:17 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 29 Oct 2008 10:03:17 -0400 Subject: [ofa-general] Re: [PATCH] opensm: notify master SM with trap 144 (not finished) In-Reply-To: <20081029135204.GC15321@sashak.voltaire.com> References: <20081025175201.GP28713@sashak.voltaire.com> <20081025200127.GR28713@sashak.voltaire.com> <20081029135204.GC15321@sashak.voltaire.com> Message-ID: Sasha, On Wed, Oct 29, 2008 at 9:52 AM, Sasha Khapyorsky wrote: > Hi Hal, > > On 16:27 Tue 28 Oct , Hal Rosenstock wrote: >> On Sat, Oct 25, 2008 at 4:01 PM, Sasha Khapyorsky wrote: >> > >> > When entering standby state (after discovery) notify master SM about us. >> > In case when SMA doesn't support trap sending (specifically trap 144 on >> > PortInfo:CapabilityMask change - isSM bit, example is current ConnectX >> > firmware - 2.5.0) this is only way to notify the current master SM that >> > another SM is running. >> >> So is the trap sent unconditionally (since there's no way of knowing >> whether the SMA supports this or not) ? Is the only downside the extra >> Trap/TrapRepress when the SMA does support this ? > > It is not unconditional. There is such code at beginning of > osm_send_trap144(): > > /* don't bother with sending trap when SMA supports this */ > if (!local && > pi->capability_mask&(IB_PORT_CAP_HAS_TRAP|IB_PORT_CAP_HAS_CAP_NTC)) > return 0; Oh, I see: those bits are not on in PortInfo:CapabilityMask in the C-X SMA. Should that just be checked against HAS_CAP_NTC as there might be other traps supported ? -- Hal >> Seems to me that the right fix is to the Connect-X SMA. > > Agree. But it is not there yet. > >> Also, what happens once the Connect-X SMA is fixed ? Does this code persist ? > > Then osm_send_trap144(..., 0) will do nothing following > PortInfo:CapabilityMask. And actually if the problem will become > obsolete we can remove this call safely. > > Sasha > From sashak at voltaire.com Wed Oct 29 07:09:41 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 29 Oct 2008 16:09:41 +0200 Subject: [ofa-general] Re: [PATCH] opensm: notify master SM with trap 144 (not finished) In-Reply-To: References: <20081025175201.GP28713@sashak.voltaire.com> <20081025200127.GR28713@sashak.voltaire.com> <20081029135204.GC15321@sashak.voltaire.com> Message-ID: <20081029140941.GH15321@sashak.voltaire.com> On 10:03 Wed 29 Oct , Hal Rosenstock wrote: > > > > It is not unconditional. There is such code at beginning of > > osm_send_trap144(): > > > > /* don't bother with sending trap when SMA supports this */ > > if (!local && > > pi->capability_mask&(IB_PORT_CAP_HAS_TRAP|IB_PORT_CAP_HAS_CAP_NTC)) > > return 0; > > Oh, I see: those bits are not on in PortInfo:CapabilityMask in the C-X SMA. > > Should that just be checked against HAS_CAP_NTC as there might be > other traps supported ? Basically yes, HAS_CAP_NTC checking could be enough (assuming we will not have some SMA bugs). Sasha From sashak at voltaire.com Wed Oct 29 07:19:51 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 29 Oct 2008 16:19:51 +0200 Subject: [ofa-general] opensm as service - cfg files In-Reply-To: <1225127404.1197.458.camel@cardanus.llnl.gov> References: <48FF22FC.6000606@dev.mellanox.co.il> <490073C0.70109@cea.fr> <1224786733.1197.398.camel@cardanus.llnl.gov> <49058C81.6000007@cea.fr> <1225127404.1197.458.camel@cardanus.llnl.gov> Message-ID: <20081029141951.GI15321@sashak.voltaire.com> On 10:10 Mon 27 Oct , Al Chu wrote: > > init scripts generally execute/source some configuration file located > > in /etc/sysconfig/ to set some variables used in the script. These > > variables can be used to distinguish pid filename and log filename for > > different opensm instances. If these variables are not defined in the > > conf file, they should be build from the parameter value e.g : > > opensm.log.ddn12 or opensm.pid.ddn12 > > My point was should the script automatically handle this, or is it the > user's responsibility to set everything up? As Ira mentioned in a later > post, the console port is supposed to be at a known port value so users > know what port to connect to. So is it wise for the script to auto- > magically select different different port values for different opensm > instances? Personally I don't think so. > > I was initially thinking the init script could take command line > arguments that could be passed directly to the init.d scripts. So for > example, you can say: > > service opensmd start "--config ddn.conf" > service opensmd start "--config lsi.conf" > > This puts alternate log file names and console port numbers into the > responsibility of the user. Why to not just copy /etc/init.d/opensmd to let's say /etc/init.d/opensmd2? Sligthly edit this (add different config file, etc.). Sasha From chien.tin.tung at intel.com Wed Oct 29 07:40:31 2008 From: chien.tin.tung at intel.com (Tung, Chien Tin) Date: Wed, 29 Oct 2008 07:40:31 -0700 Subject: [ofa-general] RE: [PATCH v2] RDMA/nes: Mitigate compatibility issue regarding PCI write credits In-Reply-To: References: <20081029002853.GA3212@ctung-MOBL> Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA383030EF62125@azsmsx501.amr.corp.intel.com> > > +module_param(limit_maxrdreqsz, int, 0644); > >type can be bool instead of int here? Yes, it should be bool. > > + if ((limit_maxrdreqsz) || > > + ((nesdev->nesadapter->phy_type[0] == >NES_PHY_TYPE_GLADIUS) && > > + (hw_rev == NE020_REV1))) { > > + nes_debug(NES_DBG_INIT, > >This indentation is hard to read, because the then clause visually runs >into the condition being tested. I generally align the follow-on lines >to be just inside the opening ( of "if (". And there's no >reason to put >parentheses around limit_maxrdreqsz... I normally don't indent that way either but CodingStyle doc said I can't use spaces to indent... "Outside of comments, documentation and except in Kconfig, spaces are never used for indentation, and the above example is deliberately broken." > > + pci_read_config_word(pcidev, 0x68, &maxrdreqword); > > + /* set bits 12-14 to 001b = 256 bytes */ > > + maxrdreqword &= 0x8fff; > > + maxrdreqword |= 0x1000; > > + pci_write_config_word(pcidev, 0x68, maxrdreqword); > >I would write this as below, using the standard pcie >interfaces and also >being defensive so as not to set the max read req to 256 if the >BIOS/kernel had limited it to 128 already: > > if (pcie_get_readrq(pcidev) > 256) > if (pcie_set_readrq(pcidev, 256)) { > /* report error */ > } Thanks for the change. Want a v3 for this patch? Chien From chien.tin.tung at intel.com Wed Oct 29 07:41:49 2008 From: chien.tin.tung at intel.com (Tung, Chien Tin) Date: Wed, 29 Oct 2008 07:41:49 -0700 Subject: [ofa-general] RE: [PATCH 1/2] RDMA/nes: Correct handling of PBL resources In-Reply-To: References: <20081028213504.GA6296@ctung-MOBL> Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA383030EF6212B@azsmsx501.amr.corp.intel.com> > > + if (nesfmr->nesmr.pbls_used != 0) { > > + spin_lock_irqsave(&nesadapter->pbl_lock, flags); > > + if (nesfmr->nesmr.pbl_4k) { > > + nesadapter->free_4kpbl += >nesfmr->nesmr.pbls_used; > > + BUG_ON(nesadapter->free_4kpbl > >nesadapter->max_4kpbl); > > + } else { > > + nesadapter->free_256pbl += >nesfmr->nesmr.pbls_used; > > + BUG_ON(nesadapter->free_256pbl > >nesadapter->max_256pbl); > > + } > > + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); > >Can we make these WARN_ON instead of BUG_ON? Killing the machine just >because of a nes driver bug is kind of rude, and it reduces the chance >of actually getting the debug output. In these two cases we can change them. I will review nes driver on BUG_ON usage. Chein From John.Marshall at ec.gc.ca Wed Oct 29 07:44:47 2008 From: John.Marshall at ec.gc.ca (John Marshall) Date: Wed, 29 Oct 2008 14:44:47 +0000 Subject: [ofa-general] OOM problem with ib_ipoib? In-Reply-To: References: <48FF6DFA.9080409@ec.gc.ca> <48FFA62D.3030305@ec.gc.ca> <490083D0.5000807@ec.gc.ca> Message-ID: <490876DF.2020705@ec.gc.ca> Roland Dreier wrote: > > MemTotal: 33274492 kB > ... > > LowTotal: 638684 kB > > It looks as if you have a box with 32G of RAM running a 32-bit kernel, > which means low (direct kernel-mapped) memory is extremely tight. IPoIB > connected mode ties up a signifcant amount of memory in the receive > queue -- perhaps around 64M, which is 10% of low memory for you. So > loading IPoIB may push you past the tipping point where things really > break easily. > The curious thing is that the OOM occurs even when the ib interfaces are _not even UP_, although the ib_ipoib module is loaded. So, I cannot see how it can be an allocation issue in such a case related to usage. Am I missing something here? As well, shouldn't the OS handle this transparently via the pdflush which will write out the data and free up memory? Or does the pdflush not distinguish between total memory and low memory so that a problem occurs (yet the OOM happens even when the interfaces are not UP!)? > I'm not surprised that you run into memory management problems with such > a system -- 32-bit kernels really have a hard time coping with such an > inbalance between total memory and low memory. The simplest solution > would probably be to switch to a 64-bit kernel -- note that you don't > have to change any userspace, just use a 64-bit kernel. > I will give it a shot. Thanks, John From chien.tin.tung at intel.com Wed Oct 29 08:09:11 2008 From: chien.tin.tung at intel.com (Tung, Chien Tin) Date: Wed, 29 Oct 2008 08:09:11 -0700 Subject: [ofa-general] RE: [PATCH 2/2] RDMA/nes: Change CQ allocation scheme for performance applications In-Reply-To: References: <20081028213507.GA5680@ctung-MOBL> Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA383030EF621CD@azsmsx501.amr.corp.intel.com> >So this is an enhancement, or a fix? It is a fix. Before this patch, a MultiCast Receive Queue(MCRQ) application would use the wrong CQ for the NIC. This patch fixes the mapping. Chien From philippe.gregoire at cea.fr Wed Oct 29 08:12:05 2008 From: philippe.gregoire at cea.fr (Philippe Gregoire) Date: Wed, 29 Oct 2008 16:12:05 +0100 Subject: [ofa-general] opensm as service - cfg files In-Reply-To: <20081029141951.GI15321@sashak.voltaire.com> References: <48FF22FC.6000606@dev.mellanox.co.il> <490073C0.70109@cea.fr> <1224786733.1197.398.camel@cardanus.llnl.gov> <49058C81.6000007@cea.fr> <1225127404.1197.458.camel@cardanus.llnl.gov> <20081029141951.GI15321@sashak.voltaire.com> Message-ID: <49087D45.2090202@cea.fr> Sasha Khapyorsky a écrit : > On 10:10 Mon 27 Oct , Al Chu wrote: > >>> init scripts generally execute/source some configuration file located >>> in /etc/sysconfig/ to set some variables used in the script. These >>> variables can be used to distinguish pid filename and log filename for >>> different opensm instances. If these variables are not defined in the >>> conf file, they should be build from the parameter value e.g : >>> opensm.log.ddn12 or opensm.pid.ddn12 >>> >> My point was should the script automatically handle this, or is it the >> user's responsibility to set everything up? As Ira mentioned in a later >> post, the console port is supposed to be at a known port value so users >> know what port to connect to. So is it wise for the script to auto- >> magically select different different port values for different opensm >> instances? Personally I don't think so. >> >> I was initially thinking the init script could take command line >> arguments that could be passed directly to the init.d scripts. So for >> example, you can say: >> >> service opensmd start "--config ddn.conf" >> service opensmd start "--config lsi.conf" >> >> This puts alternate log file names and console port numbers into the >> responsibility of the user. >> > > Why to not just copy /etc/init.d/opensmd to let's say > /etc/init.d/opensmd2? Sligthly edit this (add different config file, > etc.). > > > Sasha > > Sasha, it is just what I want to avoid . Depending on how the initial script is written, it will require big modifications to execute without undesired interactions (think about stopping one opensmd daemon ) Why not using a directory /etc/opensm.d , put all the config files in this directory and let the script /etc/init.d/opensmd starts one daemon for each file if it is executed without argument and only for the configuration file if tis given as an argument. And let the system admin provide the good configuration file for different instances. If the script manages different pid files , you will be able to do: service opensmd start service opensmd stop to start and stop all the instances or service opensmd start ddn.conf service opensmd stop ddn.conf to start /stop the ddn instance using /etc/opensmd.d/ddn.conf configuration file From kliteyn at dev.mellanox.co.il Wed Oct 29 08:19:22 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Wed, 29 Oct 2008 17:19:22 +0200 Subject: [ofa-general] [opensm] remove qos_max_vls config?? In-Reply-To: References: <1225237965.3358.9.camel@whatsup> <4908292A.40004@dev.mellanox.co.il> Message-ID: <49087EFA.4040401@dev.mellanox.co.il> Hal Rosenstock wrote: > On Wed, Oct 29, 2008 at 5:13 AM, Yevgeny Kliteynik > wrote: >> Al Chu wrote: >>> Hey Sasha, >>> >>> I was working on a different bug fix on the qos config parsing, when I >>> noticed the qos_*max_vls fields aren't used anywhere. They seem to be >>> parsed from the config, stored, and never used. Maybe it used to be >>> what 'max_op_vls' is now used for? >> I guess that the initial idea was to have an option to configure >> different operational VLs on different type of nodes in the subnet. >> The question is, does having such option make sense? > > Does it impact buffering ? If so, in those cases it would be worth > configuring (assuming it gets acted on elsewhere). Right, it does impact buffering. I think that OpenSM always sets the same op_vls on both sides of the link (if there is a mismatch, SM will set the lowest value), so we can have different num. of VLs on switch-2-switch links and CA-2-switch links. Not sure how much value does this ability add, but perhaps we need to implement this configuration instead of removing the parameters... -- Yevgeny > -- Hal > >> -- Yevgeny >> >>> If there's still a purpose for it in the future, obviously no issue on >>> leaving in there. Patch is attached to remove it everywhere I found it. >>> >>> Al >>> >>> >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > From sashak at voltaire.com Wed Oct 29 08:57:08 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 29 Oct 2008 17:57:08 +0200 Subject: [ofa-general] opensm as service - cfg files In-Reply-To: <49087D45.2090202@cea.fr> References: <48FF22FC.6000606@dev.mellanox.co.il> <490073C0.70109@cea.fr> <1224786733.1197.398.camel@cardanus.llnl.gov> <49058C81.6000007@cea.fr> <1225127404.1197.458.camel@cardanus.llnl.gov> <20081029141951.GI15321@sashak.voltaire.com> <49087D45.2090202@cea.fr> Message-ID: <20081029155708.GR15321@sashak.voltaire.com> On 16:12 Wed 29 Oct , Philippe Gregoire wrote: > Sasha Khapyorsky a ?crit : >> On 10:10 Mon 27 Oct , Al Chu wrote: >> >>>> init scripts generally execute/source some configuration file located >>>> in /etc/sysconfig/ to set some variables used in the script. These >>>> variables can be used to distinguish pid filename and log filename for >>>> different opensm instances. If these variables are not defined in the >>>> conf file, they should be build from the parameter value e.g : >>>> opensm.log.ddn12 or opensm.pid.ddn12 >>> My point was should the script automatically handle this, or is it the >>> user's responsibility to set everything up? As Ira mentioned in a later >>> post, the console port is supposed to be at a known port value so users >>> know what port to connect to. So is it wise for the script to auto- >>> magically select different different port values for different opensm >>> instances? Personally I don't think so. >>> >>> I was initially thinking the init script could take command line >>> arguments that could be passed directly to the init.d scripts. So for >>> example, you can say: >>> >>> service opensmd start "--config ddn.conf" >>> service opensmd start "--config lsi.conf" >>> >>> This puts alternate log file names and console port numbers into the >>> responsibility of the user. >>> >> >> Why to not just copy /etc/init.d/opensmd to let's say >> /etc/init.d/opensmd2? Sligthly edit this (add different config file, >> etc.). >> >> Sasha >> >> > > Sasha, it is just what I want to avoid . > Depending on how the initial script is written, it will require big > modifications to execute without undesired interactions > (think about stopping one opensmd daemon ) Yes, you will need to change related places in the second script. I don't think it is massive modifications. > Why not using a directory /etc/opensm.d , put all the config files in this > directory and let the script /etc/init.d/opensmd Why not to use just separate config file (/etc/opensm/opensm-2.conf, or so)? I don't feel that two OpenSM instances is a common case, anyway we don't need to over-complicate the default (single instance) usage. Sasha > starts one daemon for each file if it is executed without argument and > only for the configuration file if tis given as an argument. > And let the system admin provide the good configuration file for different > instances. > If the script manages different pid files , you will be able to do: > service opensmd start > service opensmd stop > to start and stop all the instances > or > service opensmd start ddn.conf > service opensmd stop ddn.conf > to start /stop the ddn instance using /etc/opensmd.d/ddn.conf configuration > file > > > From chu11 at llnl.gov Wed Oct 29 09:29:13 2008 From: chu11 at llnl.gov (Al Chu) Date: Wed, 29 Oct 2008 09:29:13 -0700 Subject: [ofa-general] [opensm] remove qos_max_vls config?? In-Reply-To: <49087EFA.4040401@dev.mellanox.co.il> References: <1225237965.3358.9.camel@whatsup> <4908292A.40004@dev.mellanox.co.il> <49087EFA.4040401@dev.mellanox.co.il> Message-ID: <1225297753.1197.493.camel@cardanus.llnl.gov> Hey Hal, Yevgeny, On Wed, 2008-10-29 at 17:19 +0200, Yevgeny Kliteynik wrote: > Hal Rosenstock wrote: > > On Wed, Oct 29, 2008 at 5:13 AM, Yevgeny Kliteynik > > wrote: > >> Al Chu wrote: > >>> Hey Sasha, > >>> > >>> I was working on a different bug fix on the qos config parsing, when I > >>> noticed the qos_*max_vls fields aren't used anywhere. They seem to be > >>> parsed from the config, stored, and never used. Maybe it used to be > >>> what 'max_op_vls' is now used for? > >> I guess that the initial idea was to have an option to configure > >> different operational VLs on different type of nodes in the subnet. > >> The question is, does having such option make sense? > > > > Does it impact buffering ? If so, in those cases it would be worth > > configuring (assuming it gets acted on elsewhere). > > Right, it does impact buffering. > I think that OpenSM always sets the same op_vls on both sides of > the link (if there is a mismatch, SM will set the lowest value), > so we can have different num. of VLs on switch-2-switch links > and CA-2-switch links. > Not sure how much value does this ability add, but perhaps we need > to implement this configuration instead of removing the parameters... Implementing it would be fine instead of removing its parameters. But I think documenting its behavior/existence and opensm not performing the behavior is worse. If we're not going to implement it soon (I don't mind putting it on my todo for later), perhaps we should at minimum comment it out of the documentation/code for the time being? Would the QoS max_vls override the max_op_vls? Thinking about it a bit, wouldn't a max_op_vls_ca, max_op_vls_swe, etc. parameters make more sense than the current ones? Al > -- Yevgeny > > > -- Hal > > > >> -- Yevgeny > >> > >>> If there's still a purpose for it in the future, obviously no issue on > >>> leaving in there. Patch is attached to remove it everywhere I found it. > >>> > >>> Al > >>> > >>> > >>> > >>> ------------------------------------------------------------------------ > >>> > >>> _______________________________________________ > >>> general mailing list > >>> general at lists.openfabrics.org > >>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>> > >>> To unsubscribe, please visit > >>> http:// openib.org/mailman/listinfo/openib-general > >> _______________________________________________ > >> general mailing list > >> general at lists.openfabrics.org > >> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >> > >> To unsubscribe, please visit > >> http:// openib.org/mailman/listinfo/openib-general > >> > > > > -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory From pradeeps at linux.vnet.ibm.com Wed Oct 29 10:39:33 2008 From: pradeeps at linux.vnet.ibm.com (Pradeep Satyanarayana) Date: Wed, 29 Oct 2008 10:39:33 -0700 Subject: [ofa-general] Re: [ewg] Bonding fail over not working In-Reply-To: <1225244575.12073.241.camel@sarium.pathscale.com> References: <48FD14DA.6000007@linux.vnet.ibm.com> <1225244575.12073.241.camel@sarium.pathscale.com> Message-ID: <49089FD5.8040905@linux.vnet.ibm.com> Betsy Zeller wrote: > Pradeep: > > I don't believe this is a known issue. There is no bug open against it > in http://www.openfabrics.org/bugzilla that I can see. Can you open a > bug on this in the Open Fabrics bug system, if you are able to reproduce > it on OFED 1.4 RC3, or a recent nightly build? If you do open a bug, > please mark it critical, as we need to make an explicit decision about > this. > > - Betsy Betsy, You might have missed my response to Or Gerlitz's note. Bonding fail over is indeed working on the main line kernel. The issue was that some one had accidentally dislodged the cable in the lab from one of the ports right between my tests. Hence, I came to the incorrect conclusion that bonding was not working. Sorry about the false report. I have not yet had a chance to test with OFED 1.4 RC3 as yet. Pradeep From ricklist at microway.com Wed Oct 29 11:09:18 2008 From: ricklist at microway.com (Rick Warner) Date: Wed, 29 Oct 2008 14:09:18 -0400 Subject: [ofa-general] poll CQ failed -2 with connectX In-Reply-To: <20081029073922.GA14691@mtls03> References: <200810271838.48510.ricklist@microway.com> <200810281639.03422.ricklist@microway.com> <20081029073922.GA14691@mtls03> Message-ID: <200810291409.18842.ricklist@microway.com> On Wednesday 29 October 2008, Eli Cohen wrote: > On Tue, Oct 28, 2008 at 04:39:02PM -0400, Rick Warner wrote: > > Thanks for the suggestion. Unfortunately, I have now reproduced this > > same problem on a group of 8 Xeon based systems as well, so the problem > > is not specific to the Opterons. > > Do you have another, simpler test, that can demonstrate this problem? > If not, please send instructions how to reproduce and whatever files > needed to reproduce the problem. Alternatively, can you arrange for > remote login to these systems? I have not yet found a test simpler than the NAS tests. I can provide you with a remote login. I'm setting it up now and I'll send you a private email with the login details, etc. Thanks, Rick -- Richard Warner Lead Systems Integrator Microway, Inc (508)732-5517 From roland.list at gmail.com Wed Oct 29 11:11:34 2008 From: roland.list at gmail.com (Roland Dreier) Date: Wed, 29 Oct 2008 11:11:34 -0700 Subject: [ofa-general] OOM problem with ib_ipoib? In-Reply-To: <490876DF.2020705@ec.gc.ca> References: <48FF6DFA.9080409@ec.gc.ca> <48FFA62D.3030305@ec.gc.ca> <490083D0.5000807@ec.gc.ca> <490876DF.2020705@ec.gc.ca> Message-ID: > The curious thing is that the OOM occurs even when the ib interfaces > are _not even UP_, although the ib_ipoib module is loaded. So, I cannot > see how it can be an allocation issue in such a case related to usage. Am I > missing something here? The IPoIB CM code allocates receive buffers even before the interface is brought up. Maybe the wrong thing to do, but that's how the code is now at least. > As well, shouldn't the OS handle this transparently via the pdflush which > will write out the data and free up memory? Or does the pdflush not > distinguish between total memory and low memory so that a problem > occurs (yet the OOM happens even when the interfaces are not UP!)? You may really have no free lowmem... keep in mind that the linux mm really does not behave well with 32G of RAM and a 32-bit kernel. It's fundamentally and insane config and so no one tunes for it. - R. From roland.list at gmail.com Wed Oct 29 11:15:47 2008 From: roland.list at gmail.com (Roland Dreier) Date: Wed, 29 Oct 2008 11:15:47 -0700 Subject: [ofa-general] Re: [PATCH] infiniband: Add struct in6_addr addr to union ib_gib In-Reply-To: <1225263521.5269.273.camel@localhost> References: <1225236128.5269.240.camel@localhost> <1225239374.5269.251.camel@localhost> <20081028.223941.18931166.davem@davemloft.net> <1225263521.5269.273.camel@localhost> Message-ID: > Sure. After Harvey's patches show up in > http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git;a=summary > I'll make a patchset including this union ib_gid addition as > well as changes to drivers/infiniband/ OK I guess. I also don't have any objection to using a different format letter but the same underlying code as the p6 implementation as an alternative for handling the type checking. > Roland, in a separate question, are the infiniband maintainers > also the maintainers of include/rdma/? If "F: patterns" ever > gets accepted into MAINTAINERS, should include/rdma/ be listed > under infiniband? Yes, drivers/infiniband and include/rdma are all part of the InfiniBand/RDMA stack. From DavidRobb at comsci.co.uk Wed Oct 29 12:52:49 2008 From: DavidRobb at comsci.co.uk (David Robb) Date: Wed, 29 Oct 2008 19:52:49 +0000 Subject: [ofa-general] Poor Performance of OpenIB with small packets c.f. Gigabit Ethernet In-Reply-To: <490781B0.9040105@comsci.co.uk> References: <490781B0.9040105@comsci.co.uk> Message-ID: <4908BF11.5050800@comsci.co.uk> Some further info that may provide some clues The transfer rate appears to be very sensitive to the socket recv/send buffer size settings. Using the default buffer sizes rather than our settings of 128K for send and 256K for recv has increased the v1.3 IPoIB transfer rate to ~ 4MB/s. Leaving NAGLE algorithm enabled by not setting TCP_NODELAY on the socket further increases the rate to ~7.8MB/s. Looking at our timing logs it appears that with TCP_NODELAY set, the socket send call returns EAGAIN before any amount of data is queued to the send buffer? Also, we are seeing the occasional glitch where our Comms layer stalls waiting in an epoll_wait on recv for ~ 200ms. (Replacing the epoll_wait with a polled loop shows that the socket really has no data available for this time) Could it be that we depleting the work requests and hence triggering a 'Not Ready' at the receiving end? If so, how much delay would this cause? Are any of these values configurable when using IPoIB? When using Ethernet, we need to set TCP_NODELAY to avoid latency on the last part of messages. What affect does this setting have when using IPoIB? (It appears to prevent us from filling up the socket send buffer. But is it even required when using low latency Infiniband?) I would be very grateful if someone with greater inside knowledge of this could provide some diagnosis here. TIA Dave Robb David Robb wrote: > We have a data logging application that exhibits poor performance when > operated using TCP/IP sockets and IPoIB. > > With small message sizes ~ 64 bytes, the performance values for our > application are > > OFED 1.2 IPoIB: 2.81MB/s > OFED 1.3 IPoIB: 1.37MB/s > GB Ethernet: 5.38MB/s > > It is not until the message sizes reach 16K or so that the Infiniband > starts to overtake the Ethernet. > > Are these values as expected? > > What further tests could I run to investigate the problem? > > Are there any settings and or device configuration that we can tweak > to improve the small message performance? > > We are running RH-EL Linux and using Mellanox HCAs and switches. > We recently upgrade to OFED 1.3 and have upgraded the HCA firmware to > the latest 1.2 version. > > Many thanks for any help > > Regards > > David Robb > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Wed Oct 29 14:17:44 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 29 Oct 2008 17:17:44 -0400 Subject: [ofa-general] [opensm] remove qos_max_vls config?? In-Reply-To: <49087EFA.4040401@dev.mellanox.co.il> References: <1225237965.3358.9.camel@whatsup> <4908292A.40004@dev.mellanox.co.il> <49087EFA.4040401@dev.mellanox.co.il> Message-ID: On Wed, Oct 29, 2008 at 11:19 AM, Yevgeny Kliteynik wrote: > Hal Rosenstock wrote: >> >> On Wed, Oct 29, 2008 at 5:13 AM, Yevgeny Kliteynik >> wrote: >>> >>> Al Chu wrote: >>>> >>>> Hey Sasha, >>>> >>>> I was working on a different bug fix on the qos config parsing, when I >>>> noticed the qos_*max_vls fields aren't used anywhere. They seem to be >>>> parsed from the config, stored, and never used. Maybe it used to be >>>> what 'max_op_vls' is now used for? >>> >>> I guess that the initial idea was to have an option to configure >>> different operational VLs on different type of nodes in the subnet. >>> The question is, does having such option make sense? >> >> Does it impact buffering ? If so, in those cases it would be worth >> configuring (assuming it gets acted on elsewhere). > > Right, it does impact buffering. > I think that OpenSM always sets the same op_vls on both sides of > the link (if there is a mismatch, SM will set the lowest value), Right, it takes the VLCaps on the two ends of the link and set OpVLs on both sides to the lower value. > so we can have different num. of VLs on switch-2-switch links > and CA-2-switch links. Assuming switches having one VLCap and CAs another. It might be more diverse than that in terms of switches and CAs in use. > Not sure how much value does this ability add, Me neither. I'm not sure whether the main determinant is VLCap or OpVLs or some combination of the two (in terms of buffering). The only data I have as to whether this makes any difference at all is that the buffering effects are observed with long latency links. > but perhaps we need > to implement this configuration instead of removing the parameters... I agree. I think this might be useful or at least gather better data on it so I'd rather see it implemented than excised. -- Hal > -- Yevgeny > >> -- Hal >> >>> -- Yevgeny >>> >>>> If there's still a purpose for it in the future, obviously no issue on >>>> leaving in there. Patch is attached to remove it everywhere I found it. >>>> >>>> Al >>>> >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>>> http://openib.org/mailman/listinfo/openib-general >>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >> > > From hal.rosenstock at gmail.com Wed Oct 29 14:29:29 2008 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 29 Oct 2008 17:29:29 -0400 Subject: ***SPAM*** Re: [ofa-general] [opensm] remove qos_max_vls config?? In-Reply-To: <1225297753.1197.493.camel@cardanus.llnl.gov> References: <1225237965.3358.9.camel@whatsup> <4908292A.40004@dev.mellanox.co.il> <49087EFA.4040401@dev.mellanox.co.il> <1225297753.1197.493.camel@cardanus.llnl.gov> Message-ID: Hi Al, On Wed, Oct 29, 2008 at 12:29 PM, Al Chu wrote: > Hey Hal, Yevgeny, > > On Wed, 2008-10-29 at 17:19 +0200, Yevgeny Kliteynik wrote: >> Hal Rosenstock wrote: >> > On Wed, Oct 29, 2008 at 5:13 AM, Yevgeny Kliteynik >> > wrote: >> >> Al Chu wrote: >> >>> Hey Sasha, >> >>> >> >>> I was working on a different bug fix on the qos config parsing, when I >> >>> noticed the qos_*max_vls fields aren't used anywhere. They seem to be >> >>> parsed from the config, stored, and never used. Maybe it used to be >> >>> what 'max_op_vls' is now used for? >> >> I guess that the initial idea was to have an option to configure >> >> different operational VLs on different type of nodes in the subnet. >> >> The question is, does having such option make sense? >> > >> > Does it impact buffering ? If so, in those cases it would be worth >> > configuring (assuming it gets acted on elsewhere). >> >> Right, it does impact buffering. >> I think that OpenSM always sets the same op_vls on both sides of >> the link (if there is a mismatch, SM will set the lowest value), >> so we can have different num. of VLs on switch-2-switch links >> and CA-2-switch links. >> Not sure how much value does this ability add, but perhaps we need >> to implement this configuration instead of removing the parameters... > > Implementing it would be fine instead of removing its parameters. But I > think documenting its behavior/existence and opensm not performing the > behavior is worse. If we're not going to implement it soon (I don't > mind putting it on my todo for later), perhaps we should at minimum > comment it out of the documentation/code for the time being? > > Would the QoS max_vls override the max_op_vls? Thinking about it a bit, > wouldn't a max_op_vls_ca, max_op_vls_swe, etc. parameters make more > sense than the current ones? max_op_vls was added as an option in pre QoS days to trim things to something lower than the min of the VLCaps of the two ends of the link for adding buffering. If your direction is taken, in addition to ca and swe, there would be enhsp0 and rtr options for this too. Ultimately, it might even be more granular but that would be cumbersome to configure. Also, I would think max_op_vls of whatever flavor needs to be in concert with the QoS configuration. I'm not sure what would happen if you set it lower than what the QoS configuration "expected". Guess those other VLs (above this configured max) would not be supported and any SLs which tried to use them would get dropped. -- Hal > Al > >> -- Yevgeny >> >> > -- Hal >> > >> >> -- Yevgeny >> >> >> >>> If there's still a purpose for it in the future, obviously no issue on >> >>> leaving in there. Patch is attached to remove it everywhere I found it. >> >>> >> >>> Al >> >>> >> >>> >> >>> >> >>> ------------------------------------------------------------------------ >> >>> >> >>> _______________________________________________ >> >>> general mailing list >> >>> general at lists.openfabrics.org >> >>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >>> >> >>> To unsubscribe, please visit >> >>> http:// openib.org/mailman/listinfo/openib-general >> >> _______________________________________________ >> >> general mailing list >> >> general at lists.openfabrics.org >> >> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> >> >> To unsubscribe, please visit >> >> http:// openib.org/mailman/listinfo/openib-general >> >> >> > >> >> > -- > Albert Chu > chu11 at llnl.gov > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > > From vlad at lists.openfabrics.org Thu Oct 30 03:24:47 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 30 Oct 2008 03:24:47 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081030-0200 daily build status Message-ID: <20081030102447.4F86DE61174@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: Build failed on x86_64 with linux-2.6.26 Log: /home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.26_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c: In function 'ioremap_wc': /home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.26_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: error: implicit declaration of function '__ioremap' /home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.26_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: warning: return makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.26_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.26_x86_64_check/drivers/infiniband/hw/ipath] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.26_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.26_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.26' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.25 Log: /home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.25_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c: In function 'ioremap_wc': /home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.25_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: error: implicit declaration of function '__ioremap' /home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.25_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: warning: return makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.25_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.25_x86_64_check/drivers/infiniband/hw/ipath] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.25_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.25_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.25' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.24 Log: /home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.24_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:218: error: 'cpu_data' undeclared (first use in this function) /home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.24_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:218: error: (Each undeclared identifier is reported only once /home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.24_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:218: error: for each function it appears in.) make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.24_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.24_x86_64_check/drivers/infiniband/hw/ipath] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.24_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.24_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.24' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-42.ELsmp Log: patching file drivers/infiniband/hw/ipath/ipath_init_chip.c Hunk #1 succeeded at 529 (offset 135 lines). Hunk #2 succeeded at 537 (offset 135 lines). Hunk #3 succeeded at 848 (offset 135 lines). patching file drivers/infiniband/hw/ipath/ipath_sysfs.c patching file drivers/infiniband/hw/ipath/ipath_user_pages.c Patch ipath_0110_2.6.9.patch does not apply (enforce with -f) Failed executing /usr/bin/quilt ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-55.ELsmp Log: patching file drivers/infiniband/hw/ipath/ipath_init_chip.c Hunk #1 succeeded at 529 (offset 135 lines). Hunk #2 succeeded at 537 (offset 135 lines). Hunk #3 succeeded at 848 (offset 135 lines). patching file drivers/infiniband/hw/ipath/ipath_sysfs.c patching file drivers/infiniband/hw/ipath/ipath_user_pages.c Patch ipath_0110_2.6.9.patch does not apply (enforce with -f) Failed executing /usr/bin/quilt ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.27 Log: /home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.27_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c: In function 'ioremap_wc': /home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.27_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: error: implicit declaration of function '__ioremap' /home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.27_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: warning: return makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.27_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.27_x86_64_check/drivers/infiniband/hw/ipath] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.27_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081030-0200_linux-2.6.27_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.27' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: patching file drivers/infiniband/hw/ipath/ipath_init_chip.c Hunk #1 succeeded at 529 (offset 135 lines). Hunk #2 succeeded at 537 (offset 135 lines). Hunk #3 succeeded at 848 (offset 135 lines). patching file drivers/infiniband/hw/ipath/ipath_sysfs.c patching file drivers/infiniband/hw/ipath/ipath_user_pages.c Patch ipath_0110_2.6.9.patch does not apply (enforce with -f) Failed executing /usr/bin/quilt ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: patching file drivers/infiniband/hw/ipath/ipath_init_chip.c Hunk #1 succeeded at 529 (offset 135 lines). Hunk #2 succeeded at 537 (offset 135 lines). Hunk #3 succeeded at 848 (offset 135 lines). patching file drivers/infiniband/hw/ipath/ipath_sysfs.c patching file drivers/infiniband/hw/ipath/ipath_user_pages.c Patch ipath_0110_2.6.9.patch does not apply (enforce with -f) Failed executing /usr/bin/quilt ---------------------------------------------------------------------------------- From kliteyn at dev.mellanox.co.il Thu Oct 30 07:05:09 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 30 Oct 2008 16:05:09 +0200 Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c: some simplification in LFT handling Message-ID: <4909BF15.5040200@dev.mellanox.co.il> Following the recent LFT simplification, adding some simplification of LFT handling in ftree. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_ucast_ftree.c | 62 +++++++++------------------------------ 1 files changed, 14 insertions(+), 48 deletions(-) diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c index 35a6a1c..5c833e3 100644 --- a/opensm/opensm/osm_ucast_ftree.c +++ b/opensm/opensm/osm_ucast_ftree.c @@ -739,23 +739,6 @@ __osm_ftree_sw_add_port(IN ftree_sw_t * p_sw, /***************************************************/ -static inline void -__osm_ftree_sw_set_fwd_table_block(IN ftree_sw_t * p_sw, IN uint16_t lid_ho, - IN uint8_t port_num) -{ - p_sw->lft_buf[lid_ho] = port_num; -} - -/***************************************************/ - -static inline uint8_t __osm_ftree_sw_get_fwd_table_block(IN ftree_sw_t * p_sw, - IN uint16_t lid_ho) -{ - return p_sw->lft_buf[lid_ho]; -} - -/***************************************************/ - static inline cl_status_t __osm_ftree_sw_set_hops(IN ftree_sw_t * p_sw, IN uint16_t lid_ho, IN uint8_t port_num, @@ -1940,12 +1923,10 @@ static void __osm_ftree_set_sw_fwd_table(IN cl_map_item_t * const p_map_item, ftree_sw_t *p_sw = (ftree_sw_t * const)p_map_item; ftree_fabric_t *p_ftree = (ftree_fabric_t *) context; - /* calculate lft length rounded up to a multiple of 64 (block length) */ - uint16_t lft_len = 64 * ((p_ftree->lft_max_lid_ho + 1 + 63) / 64); - p_sw->p_osm_sw->max_lid_ho = p_ftree->lft_max_lid_ho; - memcpy(p_sw->p_osm_sw->lft_buf, p_sw->lft_buf, lft_len); + memcpy(p_sw->p_osm_sw->lft_buf, p_sw->lft_buf, + IB_LID_UCAST_END_HO + 1); osm_ucast_mgr_set_fwd_table(&p_ftree->p_osm->sm.ucast_mgr, p_sw->p_osm_sw); } @@ -2084,18 +2065,13 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, /* second case: skip the port group if the remote (lower) switch has been already configured for this target LID */ if (is_real_lid && !is_main_path && - __osm_ftree_sw_get_fwd_table_block(p_remote_sw, - cl_ntoh16(target_lid)) != - OSM_NO_PATH) + p_remote_sw->lft_buf[cl_ntoh16(target_lid)] != OSM_NO_PATH) continue; /* setting fwd tbl port only if this is real LID */ if (is_real_lid) { - __osm_ftree_sw_set_fwd_table_block(p_remote_sw, - cl_ntoh16 - (target_lid), - p_min_port-> - remote_port_num); + p_remote_sw->lft_buf[cl_ntoh16(target_lid)] = + p_min_port->remote_port_num; OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "Switch %s: set path to CA LID %u through port %u\n", __osm_ftree_tuple_to_str(p_remote_sw->tuple), @@ -2273,11 +2249,8 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, p_min_group->counter_down++; p_min_port->counter_down++; if (is_real_lid) { - __osm_ftree_sw_set_fwd_table_block(p_remote_sw, - cl_ntoh16 - (target_lid), - p_min_port-> - remote_port_num); + p_remote_sw->lft_buf[cl_ntoh16(target_lid)] = + p_min_port->remote_port_num; OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "Switch %s: set path to CA LID %u through port %u\n", __osm_ftree_tuple_to_str(p_remote_sw->tuple), @@ -2352,8 +2325,7 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, p_remote_sw = p_group->remote_hca_or_sw.p_sw; /* skip if target lid has been already set on remote switch fwd tbl */ - if (__osm_ftree_sw_get_fwd_table_block - (p_remote_sw, cl_ntoh16(target_lid)) != OSM_NO_PATH) + if (p_remote_sw->lft_buf[cl_ntoh16(target_lid)] != OSM_NO_PATH) continue; if (p_sw->is_leaf) { @@ -2371,9 +2343,8 @@ __osm_ftree_fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, trying to balance these routes - always pick port 0. */ cl_ptr_vector_at(&p_group->ports, 0, (void *)&p_port); - __osm_ftree_sw_set_fwd_table_block(p_remote_sw, - cl_ntoh16(target_lid), - p_port->remote_port_num); + p_remote_sw->lft_buf[cl_ntoh16(target_lid)] = + p_port->remote_port_num; /* On the remote switch that is pointed by the p_group, set hops for ALL the ports in the remote group. */ @@ -2464,9 +2435,8 @@ static void __osm_ftree_fabric_route_to_cns(IN ftree_fabric_t * p_ftree) /* set local LFT(LID) to the port that is connected to HCA */ cl_ptr_vector_at(&p_leaf_port_group->ports, 0, (void *)&p_port); - __osm_ftree_sw_set_fwd_table_block(p_sw, - cl_ntoh16(hca_lid), - p_port->port_num); + p_sw->lft_buf[cl_ntoh16(hca_lid)] = p_port->port_num; + OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "Switch %s: set path to CN LID %u through port %u\n", __osm_ftree_tuple_to_str(p_sw->tuple), @@ -2574,9 +2544,7 @@ static void __osm_ftree_fabric_route_to_non_cns(IN ftree_fabric_t * p_ftree) cl_ptr_vector_at(&p_hca_port_group->ports, 0, (void *)&p_hca_port); port_num_on_switch = p_hca_port->remote_port_num; - __osm_ftree_sw_set_fwd_table_block(p_sw, - cl_ntoh16(hca_lid), - port_num_on_switch); + p_sw->lft_buf[cl_ntoh16(hca_lid)] = port_num_on_switch; OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "Switch %s: set path to non-CN HCA LID %u through port %u\n", @@ -2632,9 +2600,7 @@ static void __osm_ftree_fabric_route_to_switches(IN ftree_fabric_t * p_ftree) p_next_sw = (ftree_sw_t *) cl_qmap_next(&p_sw->map_item); /* set local LFT(LID) to 0 (route to itself) */ - __osm_ftree_sw_set_fwd_table_block(p_sw, - cl_ntoh16(p_sw->base_lid), - 0); + p_sw->lft_buf[cl_ntoh16(p_sw->base_lid)] = 0; OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "Switch %s (LID %u): routing switch-to-switch paths\n", -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Thu Oct 30 07:12:12 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 30 Oct 2008 16:12:12 +0200 Subject: [ofa-general] [PATCH] opensm: free lft_buf if it matches switch's lft Message-ID: <4909C0BC.1080205@dev.mellanox.co.il> Sasha, This patch frees the switch's lft_buf if it matches the LFT that is currently configured on switch. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_switch.c | 5 ++++ opensm/opensm/osm_ucast_cache.c | 13 ++++++++- opensm/opensm/osm_ucast_mgr.c | 50 ++++++++++++++++++++++---------------- 3 files changed, 45 insertions(+), 23 deletions(-) diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c index 4b07dbc..30d617b 100644 --- a/opensm/opensm/osm_switch.c +++ b/opensm/opensm/osm_switch.c @@ -532,6 +532,11 @@ osm_switch_prepare_path_rebuild(IN osm_switch_t * p_sw, IN uint16_t max_lids) osm_port_prof_construct(&p_sw->p_prof[i]); osm_switch_clear_hops(p_sw); + + if (!p_sw->lft_buf) + if (!(p_sw->lft_buf = malloc(IB_LID_UCAST_END_HO + 1))) + return IB_INSUFFICIENT_MEMORY; + memset(p_sw->lft_buf, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); if (!p_sw->hops) { diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c index 9287e6c..199fdf0 100644 --- a/opensm/opensm/osm_ucast_cache.c +++ b/opensm/opensm/osm_ucast_cache.c @@ -1075,8 +1075,17 @@ osm_ucast_cache_add_node(osm_ucast_mgr_t * p_mgr, /* linear forwarding table */ - p_cache_sw->lft = p_node->sw->lft_buf; - p_node->sw->lft_buf = NULL; + if (p_node->sw->lft_buf) { + /* LFT buffer exists - we use it, because + it is more updated than the switch's LFT */ + p_cache_sw->lft = p_node->sw->lft_buf; + p_node->sw->lft_buf = NULL; + } + else { + /* no LFT buffer, so we use the switch's LFT */ + p_cache_sw->lft = p_node->sw->lft; + p_node->sw->lft = NULL; + } p_cache_sw->max_lid_ho = p_node->sw->max_lid_ho; } else { diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index adb6688..3a1c9a4 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -392,29 +392,37 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr, context.lft_context.node_guid = osm_node_get_node_guid(p_node); context.lft_context.set_method = TRUE; - for (block_id_ho = 0; - osm_switch_get_lft_block(p_sw, block_id_ho, block); - block_id_ho++) { - if (!p_sw->need_update && - !memcmp(block, - p_sw->lft_buf + block_id_ho * IB_SMP_DATA_SIZE, - IB_SMP_DATA_SIZE)) - continue; - - OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, - "Writing FT block %u\n", block_id_ho); + if (!p_sw->need_update && + !memcmp(p_sw->lft, p_sw->lft_buf, IB_LID_UCAST_END_HO + 1)) { + free(p_sw->lft_buf); + p_sw->lft_buf = NULL; - status = osm_req_set(p_mgr->sm, p_path, - p_sw->lft_buf + block_id_ho * IB_SMP_DATA_SIZE, - sizeof(block), - IB_MAD_ATTR_LIN_FWD_TBL, - cl_hton32(block_id_ho), - CL_DISP_MSGID_NONE, &context); + } else { - if (status != IB_SUCCESS) - OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A05: " - "Sending linear fwd. tbl. block failed (%s)\n", - ib_get_err_str(status)); + for (block_id_ho = 0; + osm_switch_get_lft_block(p_sw, block_id_ho, block); + block_id_ho++) { + if (!p_sw->need_update && + !memcmp(block, + p_sw->lft_buf + block_id_ho * IB_SMP_DATA_SIZE, + IB_SMP_DATA_SIZE)) + continue; + + OSM_LOG(p_mgr->p_log, OSM_LOG_DEBUG, + "Writing FT block %u\n", block_id_ho); + + status = osm_req_set(p_mgr->sm, p_path, + p_sw->lft_buf + block_id_ho * IB_SMP_DATA_SIZE, + sizeof(block), + IB_MAD_ATTR_LIN_FWD_TBL, + cl_hton32(block_id_ho), + CL_DISP_MSGID_NONE, &context); + + if (status != IB_SUCCESS) + OSM_LOG(p_mgr->p_log, OSM_LOG_ERROR, "ERR 3A05: " + "Sending linear fwd. tbl. block failed (%s)\n", + ib_get_err_str(status)); + } } OSM_LOG_EXIT(p_mgr->p_log); -- 1.5.1.4 From ilpo.jarvinen at helsinki.fi Thu Oct 30 04:39:43 2008 From: ilpo.jarvinen at helsinki.fi (=?ISO-8859-1?Q?Ilpo_J=E4rvinen?=) Date: Thu, 30 Oct 2008 13:39:43 +0200 (EET) Subject: [ofa-general] [PATCH 07/10] rdma/nes: reindent mis-indented spinlocks In-Reply-To: References: Message-ID: Signed-off-by: Ilpo Järvinen --- drivers/infiniband/hw/nes/nes_verbs.c | 18 +++++++++--------- 1 files changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index 932e56f..ffdd141 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -220,14 +220,14 @@ static int nes_bind_mw(struct ib_qp *ibqp, struct ib_mw *ibmw, if (nesqp->ibqp_state > IB_QPS_RTS) return -EINVAL; - spin_lock_irqsave(&nesqp->lock, flags); + spin_lock_irqsave(&nesqp->lock, flags); head = nesqp->hwqp.sq_head; qsize = nesqp->hwqp.sq_tail; /* Check for SQ overflow */ if (((head + (2 * qsize) - nesqp->hwqp.sq_tail) % qsize) == (qsize - 1)) { - spin_unlock_irqrestore(&nesqp->lock, flags); + spin_unlock_irqrestore(&nesqp->lock, flags); return -EINVAL; } @@ -269,7 +269,7 @@ static int nes_bind_mw(struct ib_qp *ibqp, struct ib_mw *ibmw, nes_write32(nesdev->regs+NES_WQE_ALLOC, (1 << 24) | 0x00800000 | nesqp->hwqp.qp_id); - spin_unlock_irqrestore(&nesqp->lock, flags); + spin_unlock_irqrestore(&nesqp->lock, flags); return 0; } @@ -3212,7 +3212,7 @@ static int nes_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr, if (nesqp->ibqp_state > IB_QPS_RTS) return -EINVAL; - spin_lock_irqsave(&nesqp->lock, flags); + spin_lock_irqsave(&nesqp->lock, flags); head = nesqp->hwqp.sq_head; @@ -3337,7 +3337,7 @@ static int nes_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr, (counter << 24) | 0x00800000 | nesqp->hwqp.qp_id); } - spin_unlock_irqrestore(&nesqp->lock, flags); + spin_unlock_irqrestore(&nesqp->lock, flags); if (err) *bad_wr = ib_wr; @@ -3368,7 +3368,7 @@ static int nes_post_recv(struct ib_qp *ibqp, struct ib_recv_wr *ib_wr, if (nesqp->ibqp_state > IB_QPS_RTS) return -EINVAL; - spin_lock_irqsave(&nesqp->lock, flags); + spin_lock_irqsave(&nesqp->lock, flags); head = nesqp->hwqp.rq_head; @@ -3421,7 +3421,7 @@ static int nes_post_recv(struct ib_qp *ibqp, struct ib_recv_wr *ib_wr, nes_write32(nesdev->regs+NES_WQE_ALLOC, (counter<<24) | nesqp->hwqp.qp_id); } - spin_unlock_irqrestore(&nesqp->lock, flags); + spin_unlock_irqrestore(&nesqp->lock, flags); if (err) *bad_wr = ib_wr; @@ -3453,7 +3453,7 @@ static int nes_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry) nes_debug(NES_DBG_CQ, "\n"); - spin_lock_irqsave(&nescq->lock, flags); + spin_lock_irqsave(&nescq->lock, flags); head = nescq->hw_cq.cq_head; cq_size = nescq->hw_cq.cq_size; @@ -3562,7 +3562,7 @@ static int nes_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry) nes_debug(NES_DBG_CQ, "Reporting %u completions for CQ%u.\n", cqe_count, nescq->hw_cq.cq_number); - spin_unlock_irqrestore(&nescq->lock, flags); + spin_unlock_irqrestore(&nescq->lock, flags); return cqe_count; } -- 1.5.2.2 From kliteyn at dev.mellanox.co.il Thu Oct 30 09:03:20 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 30 Oct 2008 18:03:20 +0200 Subject: [ofa-general] [PATCH v2] opensm: free lft_buf if it matches switch's lft Message-ID: <4909DAC8.4040602@dev.mellanox.co.il> Sasha, This patch frees the switch's lft_buf if it matches the LFT that is currently configured on switch. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_switch.c | 5 +++++ opensm/opensm/osm_ucast_cache.c | 31 +++++++++++++++++++++++++++---- opensm/opensm/osm_ucast_mgr.c | 15 +++++++++++++++ 3 files changed, 47 insertions(+), 4 deletions(-) diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c index 4b07dbc..30d617b 100644 --- a/opensm/opensm/osm_switch.c +++ b/opensm/opensm/osm_switch.c @@ -532,6 +532,11 @@ osm_switch_prepare_path_rebuild(IN osm_switch_t * p_sw, IN uint16_t max_lids) osm_port_prof_construct(&p_sw->p_prof[i]); osm_switch_clear_hops(p_sw); + + if (!p_sw->lft_buf) + if (!(p_sw->lft_buf = malloc(IB_LID_UCAST_END_HO + 1))) + return IB_INSUFFICIENT_MEMORY; + memset(p_sw->lft_buf, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); if (!p_sw->hops) { diff --git a/opensm/opensm/osm_ucast_cache.c b/opensm/opensm/osm_ucast_cache.c index 9287e6c..b1ce182 100644 --- a/opensm/opensm/osm_ucast_cache.c +++ b/opensm/opensm/osm_ucast_cache.c @@ -1075,8 +1075,17 @@ osm_ucast_cache_add_node(osm_ucast_mgr_t * p_mgr, /* linear forwarding table */ - p_cache_sw->lft = p_node->sw->lft_buf; - p_node->sw->lft_buf = NULL; + if (p_node->sw->lft_buf) { + /* LFT buffer exists - we use it, because + it is more updated than the switch's LFT */ + p_cache_sw->lft = p_node->sw->lft_buf; + p_node->sw->lft_buf = NULL; + } + else { + /* no LFT buffer, so we use the switch's LFT */ + p_cache_sw->lft = p_node->sw->lft; + p_node->sw->lft = NULL; + } p_cache_sw->max_lid_ho = p_node->sw->max_lid_ho; } else { @@ -1109,6 +1118,7 @@ osm_ucast_cache_process(osm_ucast_mgr_t * p_mgr) { cl_qmap_t *tbl = &p_mgr->p_subn->sw_guid_tbl; cl_map_item_t *item; + osm_switch_t * p_sw; if (!p_mgr->p_subn->opt.use_ucast_cache) return 1; @@ -1121,8 +1131,21 @@ osm_ucast_cache_process(osm_ucast_mgr_t * p_mgr) "Configuring switch tables using cached routing\n"); for (item = cl_qmap_head(tbl); item != cl_qmap_end(tbl); - item = cl_qmap_next(item)) - osm_ucast_mgr_set_fwd_table(p_mgr, (osm_switch_t *)item); + item = cl_qmap_next(item)) { + p_sw = (osm_switch_t *)item; + + if (p_sw->need_update && !p_sw->lft_buf) { + /* no new routing was recently calculated for this + switch, but the LFT needs to be updated anyway */ + p_sw->lft_buf = p_sw->lft; + p_sw->lft = malloc(IB_LID_UCAST_END_HO + 1); + if (!p_sw->lft) + return IB_INSUFFICIENT_MEMORY; + memset(p_sw->lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); + } + + osm_ucast_mgr_set_fwd_table(p_mgr, p_sw); + } return 0; } diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c index adb6688..175817c 100644 --- a/opensm/opensm/osm_ucast_mgr.c +++ b/opensm/opensm/osm_ucast_mgr.c @@ -392,6 +392,20 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr, context.lft_context.node_guid = osm_node_get_node_guid(p_node); context.lft_context.set_method = TRUE; + if (!p_sw->lft_buf) { + /* any routing should provide the lft_buf */ + CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache && + p_mgr->cache_valid && !p_sw->need_update); + goto Exit; + } + + if (!p_sw->need_update && + !memcmp(p_sw->lft, p_sw->lft_buf, IB_LID_UCAST_END_HO + 1)) { + free(p_sw->lft_buf); + p_sw->lft_buf = NULL; + goto Exit; + } + for (block_id_ho = 0; osm_switch_get_lft_block(p_sw, block_id_ho, block); block_id_ho++) { @@ -417,6 +431,7 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr, ib_get_err_str(status)); } +Exit: OSM_LOG_EXIT(p_mgr->p_log); return 0; } -- 1.5.1.4 From sashak at voltaire.com Thu Oct 30 12:51:59 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 30 Oct 2008 21:51:59 +0200 Subject: [ofa-general] Re: [PATCH 2/2 v2] opensm: replace switch's fwd_tbl with simple LFT - remove obsolete files In-Reply-To: <49085EC6.7060404@dev.mellanox.co.il> References: <49085EC6.7060404@dev.mellanox.co.il> Message-ID: <20081030195159.GL7502@sashak.voltaire.com> On 15:01 Wed 29 Oct , Yevgeny Kliteynik wrote: > Remove all the fwd_tbl files that became obsolete. > > [v2 - no changes, just rebased] > > Signed-off-by: Yevgeny Kliteynik Both applied. Thanks. Sasha From sashak at voltaire.com Thu Oct 30 12:56:06 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 30 Oct 2008 21:56:06 +0200 Subject: [ofa-general] Re: [PATCH] opensm/osm_ucast_ftree.c: some simplification in LFT handling In-Reply-To: <4909BF15.5040200@dev.mellanox.co.il> References: <4909BF15.5040200@dev.mellanox.co.il> Message-ID: <20081030195606.GM7502@sashak.voltaire.com> On 16:05 Thu 30 Oct , Yevgeny Kliteynik wrote: > Following the recent LFT simplification, adding > some simplification of LFT handling in ftree. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From sashak at voltaire.com Thu Oct 30 14:45:19 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 30 Oct 2008 23:45:19 +0200 Subject: [ofa-general] Re: [PATCH v2] opensm: free lft_buf if it matches switch's lft In-Reply-To: <4909DAC8.4040602@dev.mellanox.co.il> References: <4909DAC8.4040602@dev.mellanox.co.il> Message-ID: <20081030214519.GN7502@sashak.voltaire.com> Hi Yevgeny, On 18:03 Thu 30 Oct , Yevgeny Kliteynik wrote: > Sasha, > > This patch frees the switch's lft_buf if it matches the > LFT that is currently configured on switch. > > Signed-off-by: Yevgeny Kliteynik I applied this. Thanks. However one question below. [snip...] > diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c > index adb6688..175817c 100644 > --- a/opensm/opensm/osm_ucast_mgr.c > +++ b/opensm/opensm/osm_ucast_mgr.c > @@ -392,6 +392,20 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr, > context.lft_context.node_guid = osm_node_get_node_guid(p_node); > context.lft_context.set_method = TRUE; > > + if (!p_sw->lft_buf) { > + /* any routing should provide the lft_buf */ > + CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache && > + p_mgr->cache_valid && !p_sw->need_update); > + goto Exit; > + } > + > + if (!p_sw->need_update && > + !memcmp(p_sw->lft, p_sw->lft_buf, IB_LID_UCAST_END_HO + 1)) { > + free(p_sw->lft_buf); > + p_sw->lft_buf = NULL; > + goto Exit; > + } > + So buffers are freed only on next routing iteration (heavy sweep). Isn't it better to drop it when LFT images from switches are received in osm_lin_fwd_rcv.c? Sasha From sashak at voltaire.com Thu Oct 30 14:51:02 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 30 Oct 2008 23:51:02 +0200 Subject: [ofa-general] [opensm] remove qos_max_vls config?? In-Reply-To: References: <1225237965.3358.9.camel@whatsup> <4908292A.40004@dev.mellanox.co.il> <49087EFA.4040401@dev.mellanox.co.il> Message-ID: <20081030215102.GO7502@sashak.voltaire.com> Hi, On 17:17 Wed 29 Oct , Hal Rosenstock wrote: > > > but perhaps we need > > to implement this configuration instead of removing the parameters... > > I agree. I think this might be useful or at least gather better data > on it so I'd rather see it implemented than excised. Me too. I don't really remember why it was not done originally, likely just was forgotten. Sasha From kliteyn at dev.mellanox.co.il Thu Oct 30 14:51:25 2008 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 30 Oct 2008 23:51:25 +0200 Subject: [ofa-general] Re: [PATCH v2] opensm: free lft_buf if it matches switch's lft In-Reply-To: <20081030214519.GN7502@sashak.voltaire.com> References: <4909DAC8.4040602@dev.mellanox.co.il> <20081030214519.GN7502@sashak.voltaire.com> Message-ID: <490A2C5D.4080309@dev.mellanox.co.il> Sasha Khapyorsky wrote: > Hi Yevgeny, > > On 18:03 Thu 30 Oct , Yevgeny Kliteynik wrote: >> Sasha, >> >> This patch frees the switch's lft_buf if it matches the >> LFT that is currently configured on switch. >> >> Signed-off-by: Yevgeny Kliteynik > > I applied this. Thanks. However one question below. > > [snip...] > >> diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c >> index adb6688..175817c 100644 >> --- a/opensm/opensm/osm_ucast_mgr.c >> +++ b/opensm/opensm/osm_ucast_mgr.c >> @@ -392,6 +392,20 @@ int osm_ucast_mgr_set_fwd_table(IN osm_ucast_mgr_t * const p_mgr, >> context.lft_context.node_guid = osm_node_get_node_guid(p_node); >> context.lft_context.set_method = TRUE; >> >> + if (!p_sw->lft_buf) { >> + /* any routing should provide the lft_buf */ >> + CL_ASSERT(p_mgr->p_subn->opt.use_ucast_cache && >> + p_mgr->cache_valid && !p_sw->need_update); >> + goto Exit; >> + } >> + >> + if (!p_sw->need_update && >> + !memcmp(p_sw->lft, p_sw->lft_buf, IB_LID_UCAST_END_HO + 1)) { >> + free(p_sw->lft_buf); >> + p_sw->lft_buf = NULL; >> + goto Exit; >> + } >> + > > So buffers are freed only on next routing iteration (heavy sweep). Isn't > it better to drop it when LFT images from switches are received in > osm_lin_fwd_rcv.c? Sure, why not. That way the memory would be freed faster. -- Yevgeny > Sasha > From chu11 at llnl.gov Thu Oct 30 15:01:21 2008 From: chu11 at llnl.gov (Al Chu) Date: Thu, 30 Oct 2008 15:01:21 -0700 Subject: [ofa-general] [opensm patch][2/2] verify config inputs when config file is rescanned Message-ID: <1225404081.1197.534.camel@cardanus.llnl.gov> Hey Sasha, I noticed that after the config file is rescanned, the new potential inputs aren't checked for validity. Patch is attached. Al -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0002-verify-rescanned-config-input.patch Type: text/x-patch Size: 1044 bytes Desc: not available URL: From chu11 at llnl.gov Thu Oct 30 15:01:18 2008 From: chu11 at llnl.gov (Al Chu) Date: Thu, 30 Oct 2008 15:01:18 -0700 Subject: [ofa-general] [opensm patch][1/2] fix qos config parsing bugs Message-ID: <1225404078.1197.533.camel@cardanus.llnl.gov> Hey Sasha, I found a bunch of qos config parsing issues, listed below: 1) If the user sets the qos default fields (i.e. qos_high_limit, qos_vlarb_high. etc.), but do not have the qos_ca, qos_swe, qos_rtr, etc. equivalent fields listed (i.e. qos_ca_high_limit, qos_sw0_vlarb_high), the values set in teh qos default fields are not loaded into the CAs, switches, etc. The reason is in qos_build_config() we load defaults like this: p = opt->vlarb_high ? opt->vlarb_high : dflt->vlarb_high; but we always set the fields to something non-NULL. static void subn_set_default_qos_options(IN osm_qos_options_t * opt) { opt->max_vls = OSM_DEFAULT_QOS_MAX_VLS; opt->high_limit = OSM_DEFAULT_QOS_HIGH_LIMIT; opt->vlarb_high = OSM_DEFAULT_QOS_VLARB_HIGH; opt->vlarb_low = OSM_DEFAULT_QOS_VLARB_LOW; opt->sl2vl = OSM_DEFAULT_QOS_SL2VL; } 2) In qos_build_config() we load the high_limit like this: cfg->vl_high_limit = (uint8_t) opt->high_limit; So there is no way to tell the qos_ca, qos_swe, qos_rtr, etc. high_limit options to "go back to" the default high_limit. It just assumes that whatever is input (or was set by default) is what you should use. 3) Some fields like qos_vlarb_high are assumed to be correctly set and can segfault opensm. The attached patch fixes these up. Obviously there's tons of ways to do this. I decided to ... A) only initialization qos_options to the real defaults B) init all qos_*_options to sentinel values (-1, NULL, etc.) to indicate it should use the configured defaults if they aren't set by the user. The high_limit was changed from an unsigned to an int b/c 0 is a valid high_limit value. C) verify that the default qos inputs are definitely correct (i.e. can't be NULL). Reset to hard coded defaults if need be. D) load the default vs. non-default appropriately in QoS. Al P.S. This patch does not rely on my previous "remove qos_max_vls config" patch. I assume we're keeping the max_vls fields in this patch. -- Albert Chu chu11 at llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-fix-qos-config-parsing-bugs.patch Type: text/x-patch Size: 21150 bytes Desc: not available URL: From jon at opengridcomputing.com Thu Oct 30 15:03:04 2008 From: jon at opengridcomputing.com (Jon Mason) Date: Thu, 30 Oct 2008 17:03:04 -0500 Subject: [ofa-general] [PATCH] rds: Fix IB regression from last patch Message-ID: <20081030220304.GB17522@opengridcomputing.com> The patch I sent out on 10/10/08 for iWARP zcopy enablement on RDS broke IB support. The patch below fixes the regression. More testing is needed on iWARP zcopy, but fixing IB is paramount. Thanks, Jon Signed-Off-By: Jon Mason diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c index 2ca48d5..1da664e 100644 --- a/net/rds/ib_recv.c +++ b/net/rds/ib_recv.c @@ -97,12 +97,12 @@ void rds_ib_recv_init_ring(struct rds_ib_connection *ic) sge = rds_ib_data_sge(ic, recv->r_sge); sge->addr = 0; sge->length = RDS_FRAG_SIZE; - sge->lkey = 0; + sge->lkey = rds_ib_local_dma_lkey(ic); sge = rds_ib_header_sge(ic, recv->r_sge); sge->addr = ic->i_recv_hdrs_dma + (i * sizeof(struct rds_header)); sge->length = sizeof(struct rds_header); - sge->lkey = 0; + sge->lkey = rds_ib_local_dma_lkey(ic); } } From sashak at voltaire.com Thu Oct 30 21:32:26 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 31 Oct 2008 06:32:26 +0200 Subject: [ofa-general] Re: [PATCH v2] opensm: free lft_buf if it matches switch's lft In-Reply-To: <490A2C5D.4080309@dev.mellanox.co.il> References: <4909DAC8.4040602@dev.mellanox.co.il> <20081030214519.GN7502@sashak.voltaire.com> <490A2C5D.4080309@dev.mellanox.co.il> Message-ID: <20081031043226.GH16455@sashak.voltaire.com> On 23:51 Thu 30 Oct , Yevgeny Kliteynik wrote: > > Sure, why not. That way the memory would be freed faster. Patch? Sasha From sashak at voltaire.com Thu Oct 30 21:33:43 2008 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 31 Oct 2008 06:33:43 +0200 Subject: [ofa-general] Re: invalid OpenSM version in OFED In-Reply-To: <20081023124406.GH28713@sashak.voltaire.com> References: <20081023124406.GH28713@sashak.voltaire.com> Message-ID: <20081031043343.GI16455@sashak.voltaire.com> On 14:44 Thu 23 Oct , Sasha Khapyorsky wrote: > Hi Vlad, > > I noticed that OFED...tgz has OpenSM (and I guess other management > packages) packaged with invalid version. For example OFED-1.4-rc3 OpenSM > package instead of the real daily version 3.2.2_20081019_ad24a3e has > only 3.2.2 (which actually was almost 200 commits before). > > For this reason I cannot detect properly installed OpenSM version when > binary package was used. It make problem investigation and debugging > unnecessary difficult. > > Could we fix this ASAP? Any news? Sasha From jgarzik at pobox.com Thu Oct 30 21:55:05 2008 From: jgarzik at pobox.com (Jeff Garzik) Date: Fri, 31 Oct 2008 00:55:05 -0400 Subject: [ofa-general][PATCH 1/3]mlx4: Multiple completion vectors support In-Reply-To: <4907348E.7060508@mellanox.co.il> References: <4907348E.7060508@mellanox.co.il> Message-ID: <490A8FA9.7080802@pobox.com> Yevgeny Petrilin wrote: > The driver now creates a completion EQ for every cpu. > While allocating CQ a ULP asks a completion vector number > it wants the CQ to be attached to. The number of completion > vectors is advertised via ib_device.num_comp_vectors > > Signed-off-by: Yevgeny Petrilin > --- > drivers/infiniband/hw/mlx4/cq.c | 2 +- > drivers/infiniband/hw/mlx4/main.c | 2 +- > drivers/net/mlx4/cq.c | 14 ++++++++-- > drivers/net/mlx4/en_cq.c | 9 ++++-- > drivers/net/mlx4/en_main.c | 4 +- > drivers/net/mlx4/eq.c | 47 ++++++++++++++++++++++++------------ > drivers/net/mlx4/main.c | 14 ++++++---- > drivers/net/mlx4/mlx4.h | 4 +- > include/linux/mlx4/device.h | 4 ++- > 9 files changed, 65 insertions(+), 35 deletions(-) Roland, OK for me to put merge this via net-next (the standard avenue for drivers/net patches during -rc)? From vlad at lists.openfabrics.org Fri Oct 31 03:07:43 2008 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 31 Oct 2008 03:07:43 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20081031-0200 daily build status Message-ID: <20081031100743.72BA3E60B82@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22.5-31-default Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.19 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Failed: Build failed on x86_64 with linux-2.6.26 Log: /home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.26_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c: In function 'ioremap_wc': /home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.26_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: error: implicit declaration of function '__ioremap' /home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.26_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: warning: return makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.26_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.26_x86_64_check/drivers/infiniband/hw/ipath] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.26_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.26_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.26' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.25 Log: /home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.25_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c: In function 'ioremap_wc': /home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.25_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: error: implicit declaration of function '__ioremap' /home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.25_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: warning: return makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.25_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.25_x86_64_check/drivers/infiniband/hw/ipath] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.25_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.25_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.25' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.24 Log: /home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.24_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:218: error: 'cpu_data' undeclared (first use in this function) /home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.24_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:218: error: (Each undeclared identifier is reported only once /home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.24_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:218: error: for each function it appears in.) make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.24_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.24_x86_64_check/drivers/infiniband/hw/ipath] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.24_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.24_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.24' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-42.ELsmp Log: patching file drivers/infiniband/hw/ipath/ipath_init_chip.c Hunk #1 succeeded at 529 (offset 135 lines). Hunk #2 succeeded at 537 (offset 135 lines). Hunk #3 succeeded at 848 (offset 135 lines). patching file drivers/infiniband/hw/ipath/ipath_sysfs.c patching file drivers/infiniband/hw/ipath/ipath_user_pages.c Patch ipath_0110_2.6.9.patch does not apply (enforce with -f) Failed executing /usr/bin/quilt ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-55.ELsmp Log: patching file drivers/infiniband/hw/ipath/ipath_init_chip.c Hunk #1 succeeded at 529 (offset 135 lines). Hunk #2 succeeded at 537 (offset 135 lines). Hunk #3 succeeded at 848 (offset 135 lines). patching file drivers/infiniband/hw/ipath/ipath_sysfs.c patching file drivers/infiniband/hw/ipath/ipath_user_pages.c Patch ipath_0110_2.6.9.patch does not apply (enforce with -f) Failed executing /usr/bin/quilt ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.27 Log: /home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.27_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c: In function 'ioremap_wc': /home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.27_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: error: implicit declaration of function '__ioremap' /home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.27_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.c:260: warning: return makes pointer from integer without a cast make[4]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.27_x86_64_check/drivers/infiniband/hw/ipath/ipath_wc_pat.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.27_x86_64_check/drivers/infiniband/hw/ipath] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.27_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_4_kernel-20081031-0200_linux-2.6.27_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.27' make: *** [kernel] Error 2 ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-78.ELsmp Log: patching file drivers/infiniband/hw/ipath/ipath_init_chip.c Hunk #1 succeeded at 529 (offset 135 lines). Hunk #2 succeeded at 537 (offset 135 lines). Hunk #3 succeeded at 848 (offset 135 lines). patching file drivers/infiniband/hw/ipath/ipath_sysfs.c patching file drivers/infiniband/hw/ipath/ipath_user_pages.c Patch ipath_0110_2.6.9.patch does not apply (enforce with -f) Failed executing /usr/bin/quilt ---------------------------------------------------------------------------------- Build failed on x86_64 with linux-2.6.9-67.ELsmp Log: patching file drivers/infiniband/hw/ipath/ipath_init_chip.c Hunk #1 succeeded at 529 (offset 135 lines). Hunk #2 succeeded at 537 (offset 135 lines). Hunk #3 succeeded at 848 (offset 135 lines). patching file drivers/infiniband/hw/ipath/ipath_sysfs.c patching file drivers/infiniband/hw/ipath/ipath_user_pages.c Patch ipath_0110_2.6.9.patch does not apply (enforce with -f) Failed executing /usr/bin/quilt ---------------------------------------------------------------------------------- From yosefe at Voltaire.COM Fri Oct 31 05:55:28 2008 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Fri, 31 Oct 2008 14:55:28 +0200 Subject: [ofa-general] [PATCH] ipoib: fix hang in ipoib_flush_paths Message-ID: <490B0040.3040802@Voltaire.COM> Fixes a hang in ipoib_flush_paths during sm up/down loop. Even if path_rec_start() fails (for instance, because there is no sm_ah), the path is added to the path list by neigh_add_path(). Then, ipoib_flush_paths() will wait for path->done, but it will never complete because the request was not issued at all. Signed-off-by: Yossi Etigin -- Fixes bugzilla 1329. Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-10-31 14:15:03.000000000 +0200 +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-10-31 14:42:22.000000000 +0200 @@ -523,6 +523,7 @@ static int path_rec_start(struct net_dev if (path->query_id < 0) { ipoib_warn(priv, "ib_sa_path_rec_get failed: %d\n", path->query_id); path->query = NULL; + complete(&path->done); return path->query_id; } -- --Yossi From yosefe at Voltaire.COM Fri Oct 31 06:01:58 2008 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Fri, 31 Oct 2008 15:01:58 +0200 Subject: [ofa-general] [PATCH] ipoib: fix crash in path_rec_completion Message-ID: <490B01C6.7020302@Voltaire.COM> Fix a crash in path_rec_completion() during sm up/down loop. If more than one path record request is issued, the first completion releases path->done, allowing ipoib_flush_paths() to free the path, and thus corrupting it for the second completion. Signed-off-by: Yossi Etigin -- Fixes bugzilla 1325. The flush levels patch added the field 'path->valid' and changed the test 'if (!path)' to 'if (!path || !path->valid)'. This change made it possible for a path with an outstanding query to pass the test and issue another query on the same path. Having two queries on the same path leads to a crash. Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-10-31 14:13:28.000000000 +0200 +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-10-31 14:15:03.000000000 +0200 @@ -639,7 +639,7 @@ static void unicast_arp_send(struct sk_b skb_push(skb, sizeof *phdr); __skb_queue_tail(&path->queue, skb); - if (path_rec_start(dev, path)) { + if (!path->query && path_rec_start(dev, path)) { spin_unlock_irqrestore(&priv->lock, flags); path_free(dev, path); return; -- --Yossi From rdreier at cisco.com Fri Oct 31 10:34:11 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 31 Oct 2008 10:34:11 -0700 Subject: [ofa-general][PATCH 1/3]mlx4: Multiple completion vectors support In-Reply-To: <490A8FA9.7080802@pobox.com> (Jeff Garzik's message of "Fri, 31 Oct 2008 00:55:05 -0400") References: <4907348E.7060508@mellanox.co.il> <490A8FA9.7080802@pobox.com> Message-ID: > Roland, OK for me to put merge this via net-next (the standard avenue > for drivers/net patches during -rc)? Actually please let me review this and merge it through my tree, since it has a bigger impact on the IB side of mlx4. - R. From rdreier at cisco.com Fri Oct 31 10:36:28 2008 From: rdreier at cisco.com (Roland Dreier) Date: Fri, 31 Oct 2008 10:36:28 -0700 Subject: [ofa-general] RE: [PATCH v2] RDMA/nes: Mitigate compatibility issue regarding PCI write credits In-Reply-To: <60BEFF3FBD4C6047B0F13F205CAFA383030EF62125@azsmsx501.amr.corp.intel.com> (Chien Tin Tung's message of "Wed, 29 Oct 2008 07:40:31 -0700") References: <20081029002853.GA3212@ctung-MOBL> <60BEFF3FBD4C6047B0F13F205CAFA383030EF62125@azsmsx501.amr.corp.intel.com> Message-ID: > I normally don't indent that way either but CodingStyle doc said I > can't use spaces to indent... > > "Outside of comments, documentation and except in Kconfig, spaces are never > used for indentation, and the above example is deliberately broken." Don't take that too literally. When you want to indent by less than a full tab-stop to align things properly, then spaces are fine. > Thanks for the change. Want a v3 for this patch? Yes please. - R. From chien.tin.tung at intel.com Fri Oct 31 10:44:16 2008 From: chien.tin.tung at intel.com (Tung, Chien Tin) Date: Fri, 31 Oct 2008 10:44:16 -0700 Subject: [ofa-general] RE: [PATCH v2] RDMA/nes: Mitigate compatibility issue regarding PCI write credits In-Reply-To: References: <20081029002853.GA3212@ctung-MOBL> <60BEFF3FBD4C6047B0F13F205CAFA383030EF62125@azsmsx501.amr.corp.intel.com> Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA383030F025268@azsmsx501.amr.corp.intel.com> > > "Outside of comments, documentation and except in Kconfig, >spaces are never > > used for indentation, and the above example is deliberately broken." > >Don't take that too literally. When you want to indent by less than a >full tab-stop to align things properly, then spaces are fine. I'm trying very hard to avoid another formatting discussion. > > Thanks for the change. Want a v3 for this patch? > >Yes please. I'm re-submitting all three. Keep the ones you like. Comments are welcome. Thanks, Chien From yosefe at Voltaire.COM Fri Oct 31 10:46:52 2008 From: yosefe at Voltaire.COM (Yossi Etigin) Date: Fri, 31 Oct 2008 19:46:52 +0200 Subject: [ofa-general] [PATCH] ipoib: Fix loss of connectivity after bonding failover on both sides Message-ID: <490B448C.5080306@Voltaire.COM> Fix bonding failover in the case poth peers have failover and gratuitous arp is lost. In that case, ipoib sender side will create ipoib_neigh and issue a path request with the old gid first. When skb->dst->neighbour->ha changes due to arp refresh, ipoib_neigh will not be added to the path->list of the path of the new mgid, because ipoib_neigh already exists. It will not have an ah either, because of sender-side failover. Therefore, it will not get an ah when the path is resolved. The solution here is to compare gids even if neigh->ah is invalid, also initiallize neigh->dgid.raw to have value to compare with. Signed-off-by: Yossi Etigin --- Fix bugzilla 1286. drivers/infiniband/ulp/ipoib/ipoib_main.c | 40 +++++++++++++++--------------- 1 file changed, 21 insertions(+), 19 deletions(-) Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-10-31 19:17:42.000000000 +0200 +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-10-31 19:24:52.000000000 +0200 @@ -584,6 +584,8 @@ static void neigh_add_path(struct sk_buf ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha)); } else { neigh->ah = NULL; + memcpy(&neigh->dgid.raw, skb->dst->neighbour->ha + 4, + sizeof(union ib_gid)); if (!path->query && path_rec_start(dev, path)) goto err_list; @@ -687,26 +689,26 @@ static int ipoib_start_xmit(struct sk_bu neigh = *to_ipoib_neigh(skb->dst->neighbour); - if (neigh->ah) - if (unlikely((memcmp(&neigh->dgid.raw, - skb->dst->neighbour->ha + 4, - sizeof(union ib_gid))) || - (neigh->dev != dev))) { - spin_lock_irqsave(&priv->lock, flags); - /* - * It's safe to call ipoib_put_ah() inside - * priv->lock here, because we know that - * path->ah will always hold one more reference, - * so ipoib_put_ah() will never do more than - * decrement the ref count. - */ + if (unlikely((memcmp(&neigh->dgid.raw, + skb->dst->neighbour->ha + 4, + sizeof(union ib_gid))) || + (neigh->dev != dev))) { + spin_lock_irqsave(&priv->lock, flags); + /* + * It's safe to call ipoib_put_ah() inside + * priv->lock here, because we know that + * path->ah will always hold one more reference, + * so ipoib_put_ah() will never do more than + * decrement the ref count. + */ + if (neigh->ah) ipoib_put_ah(neigh->ah); - list_del(&neigh->list); - ipoib_neigh_free(dev, neigh); - spin_unlock_irqrestore(&priv->lock, flags); - ipoib_path_lookup(skb, dev); - return NETDEV_TX_OK; - } + list_del(&neigh->list); + ipoib_neigh_free(dev, neigh); + spin_unlock_irqrestore(&priv->lock, flags); + ipoib_path_lookup(skb, dev); + return NETDEV_TX_OK; + } if (ipoib_cm_get(neigh)) { if (ipoib_cm_up(neigh)) { -- --Yossi From vst at vlnb.net Fri Oct 31 10:51:44 2008 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Fri, 31 Oct 2008 20:51:44 +0300 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <49022438.9030903@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48F79CA9.8090806@vlnb.net> <49022438.9030903@harr.org> Message-ID: <490B45B0.7030208@vlnb.net> Cameron Harr wrote: > Vladislav Bolkhovitin wrote: >>> ** Sometimes the benchmark "zombied" (process doing no work, but >>> process can't be killed) after running a certain amount of time. >>> However, it wasn't repeatable in a reliable way, so I mark that this >>> particular run has zombied before. >> That means that there is a bug somewhere. Usually such bugs are found >> in few hours of code auditing (srpt driver is pretty simple) or by >> using kernel debug facilities (example diff to .config attached). I >> personally always prefer put my effort on fixing real things, not >> inventing various workarounds, like srpt_thread in this case. >> >> So I would: >> >> 1. Completely remove srpt thread and all related code. It doesn't do >> anything, which can't be done in SIRQ context (tasklet) >> >> 2. Audit the code to check if it does any action, which it shouldn't >> do on SIRQ and fix it. This step isn't required, but usually it saves >> a lot of time of puzzled debugging in the future. >> >> 3. Change in srpt_handle_rdma_comp() and srpt_handle_new_iu() >> SCST_CONTEXT_THREAD to SCST_CONTEXT_DIRECT_ATOMIC. > > I also changed it in srpt_handle_err_comp() >> Then I would run the problematic tests (heavy tpc-h workload, e.g.) on >> debug kernel and fix found problems. >> >> Anyway, Cameron, can you get the latest code from SCST trunk and try >> with it? It was recently updated. Also please add the case with >> changes from (3) above. > This is all with version 1.0.1 of SCST (v532). > In my fio test, I do runs with srpt thread=1 and then =0. When it was > set to zero during the test, I got many errors printed out by FIO, and > the target eventually crashed. This is the first part of a long call trace. > > NMI Watchdog detected LOCKUP on CPU 0 > CPU 0 > Modules linked in: ib_srpt(U) scst_vdisk(U) scst(U) fio_driver(PU) > fio_port(PU) autofs4 hidp rfcomm l2cap bluetooth sunrpc ib_ipoib mlx4_ib > ib_cm ib_sa ib_mad ib_core ipv6 xfrm_nalgo crypto_api nls_utf8 hfsplus > dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec button battery > asus_acpi acpi_memhotplug ac parport_pc lp parport i2c_i801 shpchp > i2c_core e1000e mlx4_core i5000_edac edac_mc pcspkr ata_piix libata > sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd > Pid: 25732, comm: scsi_tgt0 Tainted: P 2.6.18-92.1.13.el5 #1 > RIP: 0010:[] [] > .text.lock.spinlock+0x29/0x30 > RSP: 0018:ffffffff80418a88 EFLAGS: 00000086 > RAX: ffff810785307fd8 RBX: ffffffff884e68a0 RCX: 0000000000000000 > RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffff884e68a0 > RBP: ffffffff884e62a0 R08: ffff810790926900 R09: ffff8107909268e8 > R10: 0000000000000018 R11: ffffffff884fcab3 R12: 0000000000000001 > R13: 0000000000000001 R14: 0000000000000000 R15: ffff8107f0f374c0 > FS: 0000000000000000(0000) GS:ffffffff803a0000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > CR2: 00000037bc0986d0 CR3: 0000000000201000 CR4: 00000000000006e0 > Process scsi_tgt0 (pid: 25732, threadinfo ffff810785306000, task > ffff810810852100) > Stack: 0000000000000000 ffffffff884c509d ffff8107909268e8 ffff810790926900 > 00000002071dd688 0000020000000220 0000000000000200 00000000da984c08 > 0000000000000000 ffff8107909267f0 ffff810806ceee20 0000000000000001 > Call Trace: > [] :scst:sgv_pool_alloc+0x10c/0x5d3 > [] :scst:scst_alloc_space+0x5b/0x106 > [] :scst:scst_process_active_cmd+0x4fc/0x131c > [] :scst:scst_cmd_init_done+0x17f/0x3ef > [] :ib_srpt:srpt_handle_new_iu+0x281/0x4e7 > [] :mlx4_ib:mlx4_ib_free_srq_wqe+0x27/0x4f > [] :mlx4_ib:get_sw_cqe+0x12/0x30 > [] :mlx4_ib:mlx4_ib_poll_cq+0x432/0x48f > [] :ib_srpt:srpt_completion+0x190/0x250 > [] :mlx4_core:mlx4_eq_int+0x3b/0x26f > [] :mlx4_core:mlx4_msi_x_interrupt+0xf/0x17 According to this trace, Vu was incorrect when he wrote that srpt_handle_new_iu called on tasklet context. It at least sometimes called from IRQ context. Try with the attached patch. It's against the latest trunk. -------------- next part -------------- A non-text attachment was scrubbed... Name: srpt_context.diff Type: text/x-patch Size: 837 bytes Desc: not available URL: From vst at vlnb.net Fri Oct 31 10:52:45 2008 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Fri, 31 Oct 2008 20:52:45 +0300 Subject: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <49022553.1020804@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49022553.1020804@harr.org> Message-ID: <490B45ED.3020203@vlnb.net> Cameron Harr wrote: > Vladislav Bolkhovitin wrote: >> Cameron Harr wrote: >>> Ok, I've done some testing with elevator=noop, with >>> scst_threads=[123] and srpt thread=[01]. I ran with both 4k blocks >>> and 512B blocks, random writes with 60s per test. Unfortunately, it >>> looks like I can't seem to reproduce the numbers I had before - I >>> believe the reporting mechanism I used earlier (script that uses >>> /proc/diskstats) gave me invalid results. This time I have calculated >>> iops straight from the FIO results. One interesting note is that in >>> almost every case srpt thread=1 gives better performance. >> Strange, indeed. >> >> Do you use the latest SVN trunk? > Almost - it was svn rev 532. >> Did you use the real drives or NULLIO? > Real drives >> What is your FIO script? > A variation on this: > fio/fio --rw=randwrite --bs=512 --size=20G --loops=10 > --name=randwrite_512_sdc --numjobs=64 --runtime=60 --direct=1 > --group_reporting --randrepeat=0 --softrandommap=1 --ioengine=libaio > --iodepth=16 --filename=/dev/sdb --filename=/dev/sdc AFAIK, libaio currently isn't the best from performance POV. Can you try with other ioengines, especially "sync"? Also, why did you choose other options, especially "iodepth"? To better interpret results I also need "vmstat 1" and "top d1" output during runs from all initiators and target. Top should show stats for all CPUs, not only the aggregate value, which it shows by default. >> How do you calculate IOPS rate? > I divide the sum (if more than 1) of the "ios=" from a particular test > by the runtime. I would rather use the "iops=" value, reported by fio. From vst at vlnb.net Fri Oct 31 10:53:17 2008 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Fri, 31 Oct 2008 20:53:17 +0300 Subject: ***SPAM*** Re: [ofa-general] SRP/mlx4 interrupts throttling performance In-Reply-To: <49026145.4050006@harr.org> References: <48E386F6.5040502@fusionio.com> <48E38BAF.5000801@harr.org> <48E6498A.3070002@mellanox.com> <48E65FE0.2060602@harr.org> <48E67ACC.1020903@harr.org> <48E695F9.80703@harr.org> <48E9E681.8090600@vlnb.net> <48EA2F42.80008@harr.org> <48EB8CBC.30303@harr.org> <48EB96C5.2060202@vlnb.net> <48EBA581.4040301@mellanox.com> <48EBA72B.4000909@harr.org> <48EBBDB1.1080203@harr.org> <48EBE6B6.4060804@mellanox.com> <48ECEA4D.7080504@harr.org> <48ED3489.4030905@harr.org> <48F79CF8.3010905@vlnb.net> <48FE6C84.7030300@harr.org> <48FEDA26.4080304@vlnb.net> <48FF2D1A.8000101@harr.org> <48FF5F42.2050902@vlnb.net> <48FF60D3.9020809@harr.org> <4901F14C.6000006@harr.org> <490210EE.2070000@vlnb.net> <49026145.4050006@harr.org> Message-ID: <490B460D.6000204@vlnb.net> Cameron Harr wrote: > > Vladislav Bolkhovitin wrote: >> Strange, indeed. >> >> Did you use the real drives or NULLIO? >> > Here are some results with NULLIO, but they seem to hang when srpt > thread is set to 0 (this got a few runs in). Note that to get things > even running when srpt thread=0, I had to put the ib_srpt code back to > it's original state. Unfortunately, it invalidates all the results. Can you try again with the patch I sent to you? With gathering of top and vmstat output. > type=randwrite bs=512 drives=1 scst_threads=1 srptthread=1 iops=113418.36 > type=randwrite bs=4k drives=1 scst_threads=1 srptthread=1 iops=107773.57 > type=randwrite bs=512 drives=2 scst_threads=1 srptthread=1 iops=147188.09 > type=randwrite bs=4k drives=2 scst_threads=1 srptthread=1 iops=170401.06 > type=randwrite bs=512 drives=3 scst_threads=1 srptthread=1 iops=194783.09 > type=randwrite bs=4k drives=3 scst_threads=1 srptthread=1 iops=112113.57 > type=randwrite bs=512 drives=1 scst_threads=2 srptthread=1 iops=88371.81 > type=randwrite bs=4k drives=1 scst_threads=2 srptthread=1 iops=86334.84 > type=randwrite bs=512 drives=2 scst_threads=2 srptthread=1 iops=177128.90 > type=randwrite bs=4k drives=2 scst_threads=2 srptthread=1 iops=105784.42 > type=randwrite bs=512 drives=3 scst_threads=2 srptthread=1 iops=125456.49 > type=randwrite bs=4k drives=3 scst_threads=2 srptthread=1 iops=93726.40 > type=randwrite bs=512 drives=1 scst_threads=3 srptthread=1 iops=137550.91 > type=randwrite bs=4k drives=1 scst_threads=3 srptthread=1 iops=90684.18 > type=randwrite bs=512 drives=2 scst_threads=3 srptthread=1 iops=182657.96 > type=randwrite bs=4k drives=2 scst_threads=3 srptthread=1 iops=95166.77 > type=randwrite bs=512 drives=3 scst_threads=3 srptthread=1 iops=184928.53 > type=randwrite bs=4k drives=3 scst_threads=3 srptthread=1 iops=84169.93 > type=randwrite bs=512 drives=1 scst_threads=1 srptthread=0 iops=139561.62 > type=randwrite bs=4k drives=1 scst_threads=1 srptthread=0 iops=100328.18 > type=randwrite bs=512 drives=2 scst_threads=1 srptthread=0 iops=206477.91 > type=randwrite bs=4k drives=2 scst_threads=1 srptthread=0 iops=99723.22 > >>> type=randwrite bs=4k drives=1 scst_threads=1 srptthread=0 >>> iops=51134.20 >>> type=randwrite bs=4k drives=1 scst_threads=1 srptthread=1 >>> iops=63461.86 >>> type=randwrite bs=4k drives=1 scst_threads=2 srptthread=0 >>> iops=52383.10 >>> type=randwrite bs=4k drives=1 scst_threads=2 srptthread=1 >>> iops=54065.52 >>> type=randwrite bs=4k drives=1 scst_threads=3 srptthread=0 >>> iops=48827.27 >>> type=randwrite bs=4k drives=1 scst_threads=3 srptthread=1 >>> iops=52703.82 >>> type=randwrite bs=4k drives=2 scst_threads=1 srptthread=0 >>> iops=64619.11 >>> type=randwrite bs=4k drives=2 scst_threads=1 srptthread=1 >>> iops=62605.09 >>> type=randwrite bs=4k drives=2 scst_threads=2 srptthread=0 >>> iops=67961.56 >>> type=randwrite bs=4k drives=2 scst_threads=2 srptthread=1 >>> iops=78884.72 >>> type=randwrite bs=4k drives=2 scst_threads=3 srptthread=0 >>> iops=70340.04 >>> type=randwrite bs=4k drives=2 scst_threads=3 srptthread=1 >>> iops=76253.60 >>> type=randwrite bs=4k drives=3 scst_threads=1 srptthread=0 >>> iops=53777.02 >>> type=randwrite bs=4k drives=3 scst_threads=1 srptthread=1 >>> iops=64661.21 >>> type=randwrite bs=4k drives=3 scst_threads=2 srptthread=0 >>> iops=91073.05 >>> type=randwrite bs=4k drives=3 scst_threads=2 srptthread=1 >>> iops=90127.98 >>> type=randwrite bs=4k drives=3 scst_threads=3 srptthread=0 >>> iops=92012.13 >>> type=randwrite bs=4k drives=3 scst_threads=3 srptthread=1 >>> iops=96848.61 >>> type=randwrite bs=512 drives=1 scst_threads=1 srptthread=0 >>> iops=55040.20 >>> type=randwrite bs=512 drives=1 scst_threads=1 srptthread=1 >>> iops=62057.33 >>> type=randwrite bs=512 drives=1 scst_threads=2 srptthread=0 >>> iops=60237.05 >>> type=randwrite bs=512 drives=1 scst_threads=2 srptthread=1 >>> iops=63465.54 >>> type=randwrite bs=512 drives=1 scst_threads=3 srptthread=0 >>> iops=58716.01 >>> type=randwrite bs=512 drives=1 scst_threads=3 srptthread=1 >>> iops=60089.11 >>> type=randwrite bs=512 drives=2 scst_threads=1 srptthread=0 >>> iops=64978.41 >>> type=randwrite bs=512 drives=2 scst_threads=1 srptthread=1 >>> iops=64018.47 >>> type=randwrite bs=512 drives=2 scst_threads=2 srptthread=0 >>> iops=78128.56 >>> type=randwrite bs=512 drives=2 scst_threads=2 srptthread=1 >>> iops=94561.47 >>> type=randwrite bs=512 drives=2 scst_threads=3 srptthread=0 >>> iops=82526.52 >>> type=randwrite bs=512 drives=2 scst_threads=3 srptthread=1 >>> iops=105874.51 >>> type=randwrite bs=512 drives=3 scst_threads=1 srptthread=0 >>> iops=56730.70 >>> type=randwrite bs=512 drives=3 scst_threads=1 srptthread=1 >>> iops=62147.04 >>> type=randwrite bs=512 drives=3 scst_threads=2 srptthread=0 >>> iops=87507.15 >>> type=randwrite bs=512 drives=3 scst_threads=2 srptthread=1 >>> iops=95781.40 >>> type=randwrite bs=512 drives=3 scst_threads=3 srptthread=0 >>> iops=91645.99 >>> type=randwrite bs=512 drives=3 scst_threads=3 srptthread=1 >>> iops=114164.39 >>> > > > From chien.tin.tung at intel.com Fri Oct 31 11:39:38 2008 From: chien.tin.tung at intel.com (Chien Tung) Date: Fri, 31 Oct 2008 13:39:38 -0500 Subject: [ofa-general] [PATCH 1/2 v2] RDMA/nes: Correct handling of PBL resources Message-ID: <20081031183938.GA7140@ctung-MOBL> From: Chien Tung RDMA/nes: Correct handling of PBL resources. * Roll back allocated structures on failures. * Use GFP_ATOMIC instead of GFP_KERNEL since we are holding a lock. * Acquire nesadapter->pbl_lock when modifying PBL counters. * Decrement PBL counters on deallocation. Signed-off-by: Chien Tung -- V2 change: change 2 BUG_ON() to WARN_ON() drivers/infiniband/hw/nes/nes_verbs.c | 44 ++++++++++++++++++++++++-------- 1 files changed, 33 insertions(+), 11 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index 932e56f..ba6a40c 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -349,7 +349,7 @@ static struct ib_fmr *nes_alloc_fmr(struct ib_pd *ibpd, if (nesfmr->nesmr.pbls_used > nesadapter->free_4kpbl) { spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); ret = -ENOMEM; - goto failed_vpbl_alloc; + goto failed_vpbl_avail; } else { nesadapter->free_4kpbl -= nesfmr->nesmr.pbls_used; } @@ -357,7 +357,7 @@ static struct ib_fmr *nes_alloc_fmr(struct ib_pd *ibpd, if (nesfmr->nesmr.pbls_used > nesadapter->free_256pbl) { spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); ret = -ENOMEM; - goto failed_vpbl_alloc; + goto failed_vpbl_avail; } else { nesadapter->free_256pbl -= nesfmr->nesmr.pbls_used; } @@ -391,14 +391,14 @@ static struct ib_fmr *nes_alloc_fmr(struct ib_pd *ibpd, goto failed_vpbl_alloc; } - nesfmr->root_vpbl.leaf_vpbl = kzalloc(sizeof(*nesfmr->root_vpbl.leaf_vpbl)*1024, GFP_KERNEL); + nesfmr->leaf_pbl_cnt = nesfmr->nesmr.pbls_used-1; + nesfmr->root_vpbl.leaf_vpbl = kzalloc(sizeof(*nesfmr->root_vpbl.leaf_vpbl)*1024, GFP_ATOMIC); if (!nesfmr->root_vpbl.leaf_vpbl) { spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); ret = -ENOMEM; goto failed_leaf_vpbl_alloc; } - nesfmr->leaf_pbl_cnt = nesfmr->nesmr.pbls_used-1; nes_debug(NES_DBG_MR, "two level pbl, root_vpbl.pbl_vbase=%p" " leaf_pbl_cnt=%d root_vpbl.leaf_vpbl=%p\n", nesfmr->root_vpbl.pbl_vbase, nesfmr->leaf_pbl_cnt, nesfmr->root_vpbl.leaf_vpbl); @@ -519,6 +519,16 @@ static struct ib_fmr *nes_alloc_fmr(struct ib_pd *ibpd, nesfmr->root_vpbl.pbl_pbase); failed_vpbl_alloc: + if (nesfmr->nesmr.pbls_used != 0) { + spin_lock_irqsave(&nesadapter->pbl_lock, flags); + if (nesfmr->nesmr.pbl_4k) + nesadapter->free_4kpbl += nesfmr->nesmr.pbls_used; + else + nesadapter->free_256pbl += nesfmr->nesmr.pbls_used; + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); + } + + failed_vpbl_avail: kfree(nesfmr); failed_fmr_alloc: @@ -534,18 +544,14 @@ static struct ib_fmr *nes_alloc_fmr(struct ib_pd *ibpd, */ static int nes_dealloc_fmr(struct ib_fmr *ibfmr) { + unsigned long flags; struct nes_mr *nesmr = to_nesmr_from_ibfmr(ibfmr); struct nes_fmr *nesfmr = to_nesfmr(nesmr); struct nes_vnic *nesvnic = to_nesvnic(ibfmr->device); struct nes_device *nesdev = nesvnic->nesdev; - struct nes_mr temp_nesmr = *nesmr; + struct nes_adapter *nesadapter = nesdev->nesadapter; int i = 0; - temp_nesmr.ibmw.device = ibfmr->device; - temp_nesmr.ibmw.pd = ibfmr->pd; - temp_nesmr.ibmw.rkey = ibfmr->rkey; - temp_nesmr.ibmw.uobject = NULL; - /* free the resources */ if (nesfmr->leaf_pbl_cnt == 0) { /* single PBL case */ @@ -561,8 +567,24 @@ static int nes_dealloc_fmr(struct ib_fmr *ibfmr) pci_free_consistent(nesdev->pcidev, 8192, nesfmr->root_vpbl.pbl_vbase, nesfmr->root_vpbl.pbl_pbase); } + nesmr->ibmw.device = ibfmr->device; + nesmr->ibmw.pd = ibfmr->pd; + nesmr->ibmw.rkey = ibfmr->rkey; + nesmr->ibmw.uobject = NULL; + + if (nesfmr->nesmr.pbls_used != 0) { + spin_lock_irqsave(&nesadapter->pbl_lock, flags); + if (nesfmr->nesmr.pbl_4k) { + nesadapter->free_4kpbl += nesfmr->nesmr.pbls_used; + WARN_ON(nesadapter->free_4kpbl > nesadapter->max_4kpbl); + } else { + nesadapter->free_256pbl += nesfmr->nesmr.pbls_used; + WARN_ON(nesadapter->free_256pbl > nesadapter->max_256pbl); + } + spin_unlock_irqrestore(&nesadapter->pbl_lock, flags); + } - return nes_dealloc_mw(&temp_nesmr.ibmw); + return nes_dealloc_mw(&nesmr->ibmw); } From chien.tin.tung at intel.com Fri Oct 31 11:39:41 2008 From: chien.tin.tung at intel.com (Chien Tung) Date: Fri, 31 Oct 2008 13:39:41 -0500 Subject: [ofa-general] [PATCH 2/2 v2] RDMA/nes: Change CQ allocation scheme for performance applications Message-ID: <20081031183941.GA2608@ctung-MOBL> From: Vadim Makhervaks RDMA/nes: New CQ allocation scheme for performance applications. Fix CQ allocation for Mutli-Cast Receive Queue applications. Before this patch, CQ was not lined up with the right NIC. Signed-off-by: Vadim Makhervaks Signed-off-by: Chien Tung -- V2 change: better(?) description for this patch... drivers/infiniband/hw/nes/nes_verbs.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index f9b37b3..51cb1b5 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -1617,7 +1617,7 @@ static struct ib_cq *nes_create_cq(struct ib_device *ibdev, int entries, nes_ucontext->mcrqf = req.mcrqf; if (nes_ucontext->mcrqf) { if (nes_ucontext->mcrqf & 0x80000000) - nescq->hw_cq.cq_number = nesvnic->nic.qp_id + 12 + (nes_ucontext->mcrqf & 0xf) - 1; + nescq->hw_cq.cq_number = nesvnic->nic.qp_id + 28 + 2*((nes_ucontext->mcrqf & 0xf) - 1); else if (nes_ucontext->mcrqf & 0x40000000) nescq->hw_cq.cq_number = nes_ucontext->mcrqf & 0xffff; else From chien.tin.tung at intel.com Fri Oct 31 11:39:43 2008 From: chien.tin.tung at intel.com (Chien Tung) Date: Fri, 31 Oct 2008 13:39:43 -0500 Subject: [ofa-general] [PATCH v3] RDMA/nes: Mitigate compatibility issue regarding PCI write credits Message-ID: <20081031183943.GA7376@ctung-MOBL> From: Chien Tung RDMA/nes: Mitigate compatibility issue regarding PCI write credits. Under heavy load, there is an compatibility issue regarding PCI write credits with certain chipsets. It can be mitigated by limiting read requests to 256 Bytes. This workaround is always enabled for Tbird2 on Gladius. Add a driver parameter to enable workaround for non-Gladius cards. Signed-off-by: Chien Tung -- V3 change: * change limit_maxrdreqsz from int to bool * change formatting to line up ( * use standard pcie interface to get/modify readrq value. drivers/infiniband/hw/nes/nes.c | 16 ++++++++++++++++ drivers/infiniband/hw/nes/nes_hw.h | 1 + 2 files changed, 17 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes.c b/drivers/infiniband/hw/nes/nes.c index a2b04d6..1a49781 100644 --- a/drivers/infiniband/hw/nes/nes.c +++ b/drivers/infiniband/hw/nes/nes.c @@ -95,6 +95,10 @@ unsigned int wqm_quanta = 0x10000; module_param(wqm_quanta, int, 0644); MODULE_PARM_DESC(wqm_quanta, "WQM quanta"); +static unsigned int limit_maxrdreqsz; +module_param(limit_maxrdreqsz, bool, 0644); +MODULE_PARM_DESC(limit_maxrdreqsz, "Limit max read request size to 256 Bytes"); + LIST_HEAD(nes_adapter_list); static LIST_HEAD(nes_dev_list); @@ -588,6 +592,18 @@ static int __devinit nes_probe(struct pci_dev *pcidev, const struct pci_device_i nesdev->nesadapter->port_count; } + if ((limit_maxrdreqsz || + ((nesdev->nesadapter->phy_type[0] == NES_PHY_TYPE_GLADIUS) && + (hw_rev == NE020_REV1))) && + (pcie_get_readrq(pcidev) > 256)) { + if (pcie_set_readrq(pcidev, 256)) + printk(KERN_ERR PFX "Unable to set max read request" + " to 256 bytes\n"); + else + nes_debug(NES_DBG_INIT, "Max read request size set" + " to 256 bytes\n"); + } + tasklet_init(&nesdev->dpc_tasklet, nes_dpc, (unsigned long)nesdev); /* bring up the Control QP */ diff --git a/drivers/infiniband/hw/nes/nes_hw.h b/drivers/infiniband/hw/nes/nes_hw.h index 610b9d8..bc0b4de 100644 --- a/drivers/infiniband/hw/nes/nes_hw.h +++ b/drivers/infiniband/hw/nes/nes_hw.h @@ -40,6 +40,7 @@ #define NES_PHY_TYPE_ARGUS 4 #define NES_PHY_TYPE_PUMA_1G 5 #define NES_PHY_TYPE_PUMA_10G 6 +#define NES_PHY_TYPE_GLADIUS 7 #define NES_MULTICAST_PF_MAX 8