From ogerlitz at Voltaire.com Wed Apr 1 00:32:20 2009 From: ogerlitz at Voltaire.com (Or Gerlitz) Date: Wed, 01 Apr 2009 10:32:20 +0300 Subject: [ofa-general] [PATCH] IB/mlx4: Use pgprot_writecombine() for BlueFlame pages In-Reply-To: References: <15ddcffd0903291006g4b7549cfj1879dd67518f8bff@mail.gmail.com> <200903301117.32355.jackm@dev.mellanox.co.il> <49D1FD74.9040205@Voltaire.com> Message-ID: <49D31884.5020609@Voltaire.com> Roland Dreier wrote: > So the OFED WC patch is work 2.3usec?? Or did you mean 1.2usec vs. 1.5usec? see detailed results below, I used ib_rdma_lat -n 50000 and ignored the maximal result as it seems to be very different between runs over the same configuration- I think we can do well without taking it into account as the best and typical results are very close under all configs. Don't know if its good or bad... but the news are that I can't reproduce the 3.5us result any more. Without any patches, with/without CONFIG_X86_PAT gives the same results... and both your patch and ofed's take the latency down by 20% I'll send you my .config where the difference between the 1st vs the 2nd,3rd 4th runs is CONFIG_X86_PAT - the setup is made of 2 super-micro nodes, pcie gen2, connectx ddr fw 2.6 and Linux 2.6.29 ... > In any case it seems my patch is not working on your system. Do you > have CONFIG_X86_PAT set in your kernel .config? 2.6.29 without CONFIG_X86_PAT Latency typical: 1.43087 usec Latency best : 1.41287 usec 2.6.29 + CONFIG_X86_PAT Latency typical: 1.43245 usec Latency best : 1.41295 usec 2.6.29 + CONFIG_X86_PAT + your patch Latency typical: 1.14896 usec Latency best : 1.12646 usec 2.6.29 + CONFIG_X86_PAT + ofed's patch Latency typical: 1.14746 usec Latency best : 1.12646 usec From vlad at dev.mellanox.co.il Wed Apr 1 00:35:55 2009 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Wed, 01 Apr 2009 10:35:55 +0300 Subject: [ofa-general] Re: [ANNOUNCE] dapl-1.2.14 and dapl-2.0.17 release In-Reply-To: References: Message-ID: <49D3195B.6030000@dev.mellanox.co.il> Davis, Arlin R wrote: > > New release for dapl 1.2 and 2.0 available on the OFA download page and in my git tree. > > md5sum: f58d6dd903cee271d71b0eb6fa33984e compat-dapl-1.2.14.tar.gz > md5sum: 617ebf54456b5559ea49e787515a8fa8 dapl-2.0.17.tar.gz > > Summary of changes: > > v1,v2 - Fix SuSE 11 build issue, asm/atomic.h no longer exists (Bug #1564) > > Vlad, please pull both packages into OFED 1.4.1 RC3 and install the following: > > compat-dapl-1.2.14-1 > compat-dapl-devel-1.2.14-1 > dapl-2.0.17-1 > dapl-utils-2.0.17-1 > dapl-devel-2.0.17-1 > dapl-debuginfo-2.0.17-1 > > See http://www.openfabrics.org/downloads/dapl/ more details. > > -arlin > Done, Regards, Vladimir From ogerlitz at Voltaire.com Wed Apr 1 00:44:15 2009 From: ogerlitz at Voltaire.com (Or Gerlitz) Date: Wed, 01 Apr 2009 10:44:15 +0300 Subject: [ofa-general] [PATCH] IB/mlx4: Use pgprot_writecombine() for BlueFlame pages In-Reply-To: <49D31884.5020609@Voltaire.com> References: <15ddcffd0903291006g4b7549cfj1879dd67518f8bff@mail.gmail.com> <200903301117.32355.jackm@dev.mellanox.co.il> <49D1FD74.9040205@Voltaire.com> <49D31884.5020609@Voltaire.com> Message-ID: <49D31B4F.3000700@Voltaire.com> Or Gerlitz wrote: > with/without CONFIG_X86_PAT gives the same results... I noted that I have CONFIG_MTRR=y so maybe this can explain the nice latency even without setting X86_PAT? Or. From vlad at lists.openfabrics.org Wed Apr 1 03:21:56 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 1 Apr 2009 03:21:56 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090401-0200 daily build status Message-ID: <20090401102156.CE43FE61135@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.27 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From ogerlitz at voltaire.com Wed Apr 1 04:19:53 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 01 Apr 2009 14:19:53 +0300 Subject: [ofa-general] Re: [PATCH] rdma_cm: Add proc entry to monitor rdma_cm connections In-Reply-To: References: <49D2417D.9000608@Voltaire.COM> <49D24C4C.2060203@Voltaire.COM> <20090331174618.GB32482@obsidianresearch.com> <49D269BD.4050909@opengridcomputing.com> Message-ID: <49D34DD9.5060208@voltaire.com> Roland Dreier wrote: > Having a sysfs file per connection is huge overhead, both in terms of the number of syscalls to read it, and also the pinned kernel memory just for the dentries etc. I don't think it's a great idea. > Yes, I also think that the sysfs convention of one file per each piece of data is inappropriate for this case Or. From monis at Voltaire.COM Wed Apr 1 06:17:36 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Wed, 01 Apr 2009 16:17:36 +0300 Subject: [ofa-general] Re: [PATCH] rdma_cm: Add proc entry to monitor rdma_cm connections In-Reply-To: <20090331174618.GB32482@obsidianresearch.com> References: <49D2417D.9000608@Voltaire.COM> <49D24C4C.2060203@Voltaire.COM> <20090331174618.GB32482@obsidianresearch.com> Message-ID: <49D36970.8070300@Voltaire.COM> > BTW, Moni, while you are looking at this it would be really nice to > have a proc//qp directory or file so that lists of QPs associated > with processes can be produced by lsof. > OK, I plan to enhance monitoring capabilities of what's under drivers/infiniband so I'll keep that in mind. Thanks From monis at Voltaire.COM Wed Apr 1 06:28:49 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Wed, 01 Apr 2009 16:28:49 +0300 Subject: [ofa-general] [PATCH] rdma_cm: Add proc entry to monitor rdma_cm connections In-Reply-To: <49D2417D.9000608@Voltaire.COM> References: <49D2417D.9000608@Voltaire.COM> Message-ID: <49D36C11.8030404@Voltaire.COM> I'm trying to summarize the comments here /proc is only for processes /proc/net is deprecated but if netdev community is OK with it then no one will object /sys/class/infiniband is too expensive (at least if data us separated to many small files) tcp_diag like helper module. requires user application to read the data (?) Please correct me if I'm wrong but I think I'll start with a question to netdev From monis at Voltaire.COM Wed Apr 1 06:37:59 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Wed, 01 Apr 2009 16:37:59 +0300 Subject: [ofa-general] Where is the right place to add monitoring data Message-ID: <49D36E37.6060308@Voltaire.COM> Hi, I want to add data monitoring capabilities to rdma_cm (socket like interface that does connection management for InfiniBand and iWarp protocols). My question is: will it be acceptable to create a new enrty /proc/net/rdma_cm or is is /proc file system really closed for new members. I get opinions for here and there and I'd like to get yours. thanks MoniS From caitlin.bestler at gmail.com Wed Apr 1 08:52:10 2009 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Wed, 1 Apr 2009 08:52:10 -0700 Subject: ***SPAM*** Re: [ofa-general] uDAPL DTO completion question. In-Reply-To: <49D30C7F.1050201@cs.anu.edu.au> References: <49D2BD00.5010002@cs.anu.edu.au> <469958e00903312040j7700d2ccr9104996c2fc29cd4@mail.gmail.com> <517c62fb0903312253w6344d62j1b8c072354b15ad2@mail.gmail.com> <49D30C7F.1050201@cs.anu.edu.au> Message-ID: <469958e00904010852ydaf5b07wccdde27dd02ca724@mail.gmail.com> On Tue, Mar 31, 2009 at 11:41 PM, Jie Cai wrote: > Understood now. A further question is here again. > > To implement software level acknowledgment to inform initiator that data > has been available for remoter, is that possible to use a busy loop at > remote > side to detect the last element of transferring has appear in the memory. > > Or remoter has to wait for the event of recv matching initiator's send, then > send a message back to initiator as a acknowledgment? > There are two issues when spinning on a remote memory update. The first is that packets may be received and processed out of order, especially for iWARP. Therefore the fact that the last byte has been received and placed does not guarantee that the prior packets have been received and placed. More importantly, the order in which updates become visible to a specific software thread can make the order of updates unpredictable to the application. When delivering a completion the Provider is responsible for dealing with both of these problems. So when you reap a completion from the CQ, the operation it represents (and all prior operations) are complete. There are no gaps in received packets, nothing is still sitting on an Adapter buffer waiting to be placed in host memory. If your application does not want to block you can consider polling the cq whether than enabling notifications. But polling memory locations directly should only be done when you're willing to have bus/adapter specific dependencies. You working code might stop working when your network changes, or you install a new Adapter that has a different strategy for optimizing its writes over the PCIe bus. From rdreier at cisco.com Wed Apr 1 10:35:58 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 01 Apr 2009 10:35:58 -0700 Subject: [ofa-general] [PATCH] IB/mlx4: Use pgprot_writecombine() for BlueFlame pages In-Reply-To: <49D31B4F.3000700@Voltaire.com> (Or Gerlitz's message of "Wed, 01 Apr 2009 10:44:15 +0300") References: <15ddcffd0903291006g4b7549cfj1879dd67518f8bff@mail.gmail.com> <200903301117.32355.jackm@dev.mellanox.co.il> <49D1FD74.9040205@Voltaire.com> <49D31884.5020609@Voltaire.com> <49D31B4F.3000700@Voltaire.com> Message-ID: > > with/without CONFIG_X86_PAT gives the same results... > I noted that I have CONFIG_MTRR=y so maybe this can explain the nice latency > even without setting X86_PAT? I don't understand: your results below don't seem to have good results without X86_PAT? In any case I don't see how you can get the lowest latency (from WC mapping to userspace) without X86_PAT, since pgprot_writecombine() will fall back to pgprot_noncached() is X86_PAT=n. Anyway, thanks a lot for testing, I'll go ahead and include the patch in my next pull request. > 2.6.29 without CONFIG_X86_PAT > Latency typical: 1.43087 usec > Latency best : 1.41287 usec > > 2.6.29 + CONFIG_X86_PAT > Latency typical: 1.43245 usec > Latency best : 1.41295 usec > > 2.6.29 + CONFIG_X86_PAT + your patch > Latency typical: 1.14896 usec > Latency best : 1.12646 usec > > 2.6.29 + CONFIG_X86_PAT + ofed's patch > Latency typical: 1.14746 usec > Latency best : 1.12646 usec From rdreier at cisco.com Wed Apr 1 10:44:51 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 01 Apr 2009 10:44:51 -0700 Subject: [ofa-general] Re: Dereferencing freed memory bugs In-Reply-To: <49CE7688.2020501@gmail.com> (Marcin Slusarz's message of "Sat, 28 Mar 2009 20:12:08 +0100") References: <49CE7688.2020501@gmail.com> Message-ID: Thanks for forwarding this... > I added a check to smatch (http://repo.or.cz/w/smatch.git/) to check > for when we dereference > freed memory. > drivers/infiniband/hw/nes/nes_cm.c +563 nes_cm_timer_tick(121) 'cm_node' > drivers/infiniband/hw/nes/nes_cm.c +621 nes_cm_timer_tick(179) 'cm_node' This seems to be against an older tree -- the code in that file has been reorganized. Would it be possible to rerun this check (which sounds very useful) against the latest version of drivers/infiniband/hw/nes for 2.6.29-git? Thanks, Roland From chien.tin.tung at intel.com Wed Apr 1 11:05:38 2009 From: chien.tin.tung at intel.com (Tung, Chien Tin) Date: Wed, 1 Apr 2009 11:05:38 -0700 Subject: [ofa-general] Re: Dereferencing freed memory bugs In-Reply-To: References: <49CE7688.2020501@gmail.com> Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA38303363875A3@azsmsx501.amr.corp.intel.com> > > I added a check to smatch (http://repo.or.cz/w/smatch.git/) to check > > for when we dereference > > freed memory. > > > drivers/infiniband/hw/nes/nes_cm.c +563 >nes_cm_timer_tick(121) 'cm_node' > > drivers/infiniband/hw/nes/nes_cm.c +621 >nes_cm_timer_tick(179) 'cm_node' > >This seems to be against an older tree -- the code in that >file has been >reorganized. Would it be possible to rerun this check (which sounds >very useful) against the latest version of >drivers/infiniband/hw/nes for >2.6.29-git? I believe Dan did run his tool against 2.6.29 source. We are looking into the two warnings. Current thinking is we did introduce a couple of errors with the recent CM changes. Chien -- Chien Tung | chien.tin.tung at intel.com From rdreier at cisco.com Wed Apr 1 11:46:14 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 01 Apr 2009 11:46:14 -0700 Subject: [ofa-general] Re: Dereferencing freed memory bugs In-Reply-To: <60BEFF3FBD4C6047B0F13F205CAFA38303363875A3@azsmsx501.amr.corp.intel.com> (Chien Tin Tung's message of "Wed, 1 Apr 2009 11:05:38 -0700") References: <49CE7688.2020501@gmail.com> <60BEFF3FBD4C6047B0F13F205CAFA38303363875A3@azsmsx501.amr.corp.intel.com> Message-ID: > > drivers/infiniband/hw/nes/nes_cm.c +621 nes_cm_timer_tick(179) 'cm_node' > I believe Dan did run his tool against 2.6.29 source. We are > looking into the two warnings. Current thinking is we did > introduce a couple of errors with the recent CM changes. Hmm, maybe I'm not reading the results correctly -- for example, in the latest git tree, line 621 of nes_cm.c is: nes_debug(NES_DBG_CM, "Retransmitting send_entry %p " "for node %p, jiffies = %lu, time to send = " "%lu, retranscount = %u, send_entry->seq_num = " "0x%08X, cm_node->tcp_cntxt.rem_ack_num = " "0x%08X\n", send_entry, cm_node, jiffies, send_entry->timetosend, = 621 => send_entry->retranscount, send_entry->seq_num, cm_node->tcp_cntxt.rem_ack_num); or is 621 not the line number? - R. From chien.tin.tung at intel.com Wed Apr 1 12:24:57 2009 From: chien.tin.tung at intel.com (Tung, Chien Tin) Date: Wed, 1 Apr 2009 12:24:57 -0700 Subject: [ofa-general] Re: Dereferencing freed memory bugs In-Reply-To: References: <49CE7688.2020501@gmail.com> <60BEFF3FBD4C6047B0F13F205CAFA38303363875A3@azsmsx501.amr.corp.intel.com> Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA38303363877F7@azsmsx501.amr.corp.intel.com> >Hmm, maybe I'm not reading the results correctly -- for example, in the >latest git tree, line 621 of nes_cm.c is: > > nes_debug(NES_DBG_CM, "Retransmitting >send_entry %p " > "for node %p, jiffies = %lu, >time to send = " > "%lu, retranscount = %u, >send_entry->seq_num = " > "0x%08X, >cm_node->tcp_cntxt.rem_ack_num = " > "0x%08X\n", send_entry, >cm_node, jiffies, > send_entry->timetosend, > = 621 => send_entry->retranscount, > send_entry->seq_num, > cm_node->tcp_cntxt.rem_ack_num); > >or is 621 not the line number? > > - R. This is from linux-2.6.29 tar file, nes_cm.c: if (last_state == NES_CM_STATE_SYN_RCVD) rem_ref_cm_node(cm_core, cm_node); else create_event(cm_node, NES_CM_EVENT_ABORTED); 563 ==> spin_lock_irqsave(&cm_node->retrans_list_lock, flags); [...] } else { int close_when_complete; close_when_complete = send_entry->close_when_complete; nes_debug(NES_DBG_CM, "cm_node=%p state=%d\n", cm_node, cm_node->state); free_retrans_entry(cm_node); if (close_when_complete) rem_ref_cm_node(cm_node->cm_core, cm_node); } } while (0); 621 ==> spin_unlock_irqrestore(&cm_node->retrans_list_lock, flags); rem_ref_cm_node(cm_node->cm_core, cm_node); if (ret != NETDEV_TX_OK) { The reason for the warning is probably from rem_ref_cm_node() call where a cm_node will get freed if the reference count is 0. At the top of the function is a loop where a cm_node with TX or RX will get its ref count incremented and placed on a list. The rest of the function only process cm_nodes off that list. Theoretically, a cm_node shouldn't get freed before 622. Chien From rdreier at cisco.com Wed Apr 1 12:32:00 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 01 Apr 2009 12:32:00 -0700 Subject: [ofa-general] Re: Dereferencing freed memory bugs In-Reply-To: <60BEFF3FBD4C6047B0F13F205CAFA38303363877F7@azsmsx501.amr.corp.intel.com> (Chien Tin Tung's message of "Wed, 1 Apr 2009 12:24:57 -0700") References: <49CE7688.2020501@gmail.com> <60BEFF3FBD4C6047B0F13F205CAFA38303363875A3@azsmsx501.amr.corp.intel.com> <60BEFF3FBD4C6047B0F13F205CAFA38303363877F7@azsmsx501.amr.corp.intel.com> Message-ID: > This is from linux-2.6.29 tar file, nes_cm.c: OK... but the code has changed since 2.6.29. I guess it would be useful to make sure that the current code is still OK. - R. From chien.tin.tung at intel.com Wed Apr 1 12:37:40 2009 From: chien.tin.tung at intel.com (Tung, Chien Tin) Date: Wed, 1 Apr 2009 12:37:40 -0700 Subject: [ofa-general] Re: Dereferencing freed memory bugs In-Reply-To: References: <49CE7688.2020501@gmail.com> <60BEFF3FBD4C6047B0F13F205CAFA38303363875A3@azsmsx501.amr.corp.intel.com> <60BEFF3FBD4C6047B0F13F205CAFA38303363877F7@azsmsx501.amr.corp.intel.com> Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA3830336387844@azsmsx501.amr.corp.intel.com> >OK... but the code has changed since 2.6.29. I guess it would >be useful >to make sure that the current code is still OK. Ok, fingers crossed. :-) Chien From rdreier at cisco.com Wed Apr 1 13:56:56 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 01 Apr 2009 13:56:56 -0700 Subject: [ofa-general] Re: [PATCH] rdma_cm: Use rate from ipoib broadcast when joining ipoib multicast In-Reply-To: <49D1141A.40700@Voltaire.COM> (Yossi Etigin's message of "Mon, 30 Mar 2009 21:48:58 +0300") References: <49D1141A.40700@Voltaire.COM> Message-ID: thanks, applied I know I said I was going to be super-strict about patch cleanliness, but I did fix the following thing up by hand: > -- if you just use two '-'s then git doesn't strip it automatically and the "--" gets put in the kernel changelog (where it doesn't belong). Use "---" (*three* '-'s) to separate the changelog from the patch and anything else you don't want in the git changelog. - R. From rdreier at cisco.com Wed Apr 1 13:59:35 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 01 Apr 2009 13:59:35 -0700 Subject: [ofa-general] [PATCH v2] rdma_cm: create cm id even when port is down In-Reply-To: <49D10673.3000006@Voltaire.COM> (Yossi Etigin's message of "Mon, 30 Mar 2009 20:50:43 +0300") References: <49D0E1A4.3070804@Voltaire.COM> <49D10673.3000006@Voltaire.COM> Message-ID: patch is corrupted -- not sure how you sent it, but eg: Index: b/drivers/infiniband/core/cma.c =================================================================== --- a/drivers/infiniband/core/cma.c 2009-03-30 18:27:36.000000000 +0300 +++ b/drivers/infiniband/core/cma.c 2009-03-30 19:01:30.000000000 +0300 @@ -297,21 +297,25 @@ static void cma_detach_from_dev(struct r id_priv->cma_dev = NULL; } the lines of context are not starting with spaces... Please resend a non-corrupted patch. - R. From arkady.kanevsky at gmail.com Wed Apr 1 14:06:18 2009 From: arkady.kanevsky at gmail.com (arkady kanevsky) Date: Wed, 1 Apr 2009 17:06:18 -0400 Subject: ***SPAM*** Re: [ofa-general] uDAPL DTO completion question. In-Reply-To: <469958e00904010852ydaf5b07wccdde27dd02ca724@mail.gmail.com> References: <49D2BD00.5010002@cs.anu.edu.au> <469958e00903312040j7700d2ccr9104996c2fc29cd4@mail.gmail.com> <517c62fb0903312253w6344d62j1b8c072354b15ad2@mail.gmail.com> <49D30C7F.1050201@cs.anu.edu.au> <469958e00904010852ydaf5b07wccdde27dd02ca724@mail.gmail.com> Message-ID: <517c62fb0904011406x7f97c8edye606b0fa6eed4bd4@mail.gmail.com> Jie, in addition the only guarantee that transport standards provide are at completion reap time. The state of the user buffer prior to that is opaque. Some vendors may provide stronger semantic but it is outside the scope of both IB and iWARP transports. Cheers, Arkady On Wed, Apr 1, 2009 at 11:52 AM, Caitlin Bestler wrote: > On Tue, Mar 31, 2009 at 11:41 PM, Jie Cai wrote: > > Understood now. A further question is here again. > > > > To implement software level acknowledgment to inform initiator that data > > has been available for remoter, is that possible to use a busy loop at > > remote > > side to detect the last element of transferring has appear in the memory. > > > > Or remoter has to wait for the event of recv matching initiator's send, > then > > send a message back to initiator as a acknowledgment? > > > > There are two issues when spinning on a remote memory update. > > The first is that packets may be received and processed out of order, > especially for iWARP. Therefore the fact that the last byte has been > received and placed does not guarantee that the prior packets have > been received and placed. > > More importantly, the order in which updates become visible to a > specific software thread can make the order of updates unpredictable > to the application. > > When delivering a completion the Provider is responsible for dealing > with both of these problems. So when you reap a completion from the > CQ, the operation it represents (and all prior operations) are complete. > There are no gaps in received packets, nothing is still sitting on an > Adapter buffer waiting to be placed in host memory. > > If your application does not want to block you can consider polling > the cq whether than enabling notifications. But polling memory locations > directly should only be done when you're willing to have bus/adapter > specific dependencies. You working code might stop working when > your network changes, or you install a new Adapter that has a different > strategy for optimizing its writes over the PCIe bus. > -- Cheers, Arkady Kanevsky -------------- next part -------------- An HTML attachment was scrubbed... URL: From guido.passet at clustervision.com Thu Apr 2 00:56:51 2009 From: guido.passet at clustervision.com (Guido Passet) Date: Thu, 02 Apr 2009 09:56:51 +0200 Subject: [ofa-general] MVAPICH_GCC seems incompatible with (RedHat)gcc-4.1.2-44 Message-ID: <49D46FC3.4050704@clustervision.com> Dear OpenFabrics list, RedHat and spinoffs recently shipped v5.3 and many RedHat alike distributions like Scientific Linux picked up the new flood of packages. One curiosity I am trying to debug is the fact that MVAPICH/MVAPICH2 do not seem to compile anymore after the update from: gcc-4.1.2-42.el5.x86_64 (RedHat 5.2) to: gcc-4.1.2-44.el5.x86_64 (RedHat 5.3) On running ./install.pl --build32 --all I get: mvapich_gcc is not available on this platform mvapich2_gcc is not available on this platform and no rpms are being build.. While all other compilers (PGI/Pathscale and Intel) seem to work fine. I tried OFED-1.4.1-20090401-0600, OFED-1.4.1-20090319-0600, OFED-1.4-20090301-0600 and a couple more but I seem to be in trouble on all versions. Any pointers into a direction to get these stack parts compiled against GCC would be more than welcome. Cheers, Guido. From eli at dev.mellanox.co.il Thu Apr 2 01:44:06 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Thu, 2 Apr 2009 11:44:06 +0300 Subject: [ofa-general] Re: [ewg] Re: [PATCH v3] mlx4_ib: Optimize hugetlab pages support In-Reply-To: <49D0EA11.2040409@Voltaire.COM> References: <20090129152725.GA26284@mtls03> <49D0EA11.2040409@Voltaire.COM> Message-ID: <20090402084406.GB21370@mtls03> On Mon, Mar 30, 2009 at 06:49:37PM +0300, Yossi Etigin wrote: Yossi, I wouldn't expect this to matter since handle_hugetlb_user_mr() only gets called when the memory is huge pages which means the number of PAGE_SIZE pages cover full HUGE_PAGES. You mention in Bugzilla an "mckey" test but I don't know this test. Can you send how to obatain the test and instructions how to build it and run it? > + > + if (cur_size) { > + arr[j++] = cur_addr; > + } > + http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg From ogerlitz at voltaire.com Thu Apr 2 01:47:26 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 02 Apr 2009 11:47:26 +0300 Subject: [ofa-general] Re: [ewg] Re: [PATCH v3] mlx4_ib: Optimize hugetlab pages support In-Reply-To: <20090402084406.GB21370@mtls03> References: <20090129152725.GA26284@mtls03> <49D0EA11.2040409@Voltaire.COM> <20090402084406.GB21370@mtls03> Message-ID: <49D47B9E.4090404@voltaire.com> Eli Cohen wrote: > You mention in Bugzilla an "mckey" test but I don't know this test. Can you send how to obatain the test and instructions how to build it and run it? > Eli, mckey is installed with librdmac-utils, has man page, etc. Its source is under the examples directory of the librdmacm src tree, you can clone Sean's librdmacm git from ofa to get it - OK? Or. From eli at dev.mellanox.co.il Thu Apr 2 02:04:10 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Thu, 2 Apr 2009 12:04:10 +0300 Subject: [ofa-general] Re: [ewg] Re: [PATCH v3] mlx4_ib: Optimize hugetlab pages support In-Reply-To: <49D47B9E.4090404@voltaire.com> References: <20090129152725.GA26284@mtls03> <49D0EA11.2040409@Voltaire.COM> <20090402084406.GB21370@mtls03> <49D47B9E.4090404@voltaire.com> Message-ID: <20090402090410.GC21370@mtls03> On Thu, Apr 02, 2009 at 11:47:26AM +0300, Or Gerlitz wrote: > mckey is installed with librdmac-utils, has man page, etc. Its source is > under the examples directory of the librdmacm src tree, you can clone > Sean's librdmacm git from ofa to get it - OK? > Thanks Or. I will check this. From ogerlitz at Voltaire.com Thu Apr 2 02:30:01 2009 From: ogerlitz at Voltaire.com (Or Gerlitz) Date: Thu, 02 Apr 2009 12:30:01 +0300 Subject: [ofa-general] [PATCH] doc/ipoib: document CM, offloads, interrupt moderation In-Reply-To: References: Message-ID: <49D48599.80308@Voltaire.com> Or Gerlitz wrote: > Update the documentation to include connected mode, stateless offloads > and interrupt moderation. Roland, will you be able to review/merge this for 2.6.30? Or. From ogerlitz at voltaire.com Thu Apr 2 02:40:33 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 02 Apr 2009 12:40:33 +0300 Subject: [ofa-general] [PATCH] IB/mlx4: Use pgprot_writecombine() for BlueFlame pages In-Reply-To: References: <15ddcffd0903291006g4b7549cfj1879dd67518f8bff@mail.gmail.com><200903301117.32355.jackm@dev.mellanox.co.il><49D1FD74.9040205@Voltaire.com> <49D31884.5020609@Voltaire.com> <49D31B4F.3000700@Voltaire.com> Message-ID: <49D48811.2080204@voltaire.com> Roland Dreier wrote: > > I don't understand: your results below don't seem to have good results > without X86_PAT? In any case I don't see how you can get the lowest > latency (from WC mapping to userspace) without X86_PAT, since > pgprot_writecombine() will fall back to pgprot_noncached() is X86_PAT=n. > Again, the facts are that on my latest round of testing I got a latency of 1.4us with and without X86_PAT which went down to 1.1us when applying your patch on top of a X86_PAT config. Even if I'm doing something wrong, I still want to see that we understand each other... so please let me know if you still don't understand me. Or. From vlad at lists.openfabrics.org Thu Apr 2 03:22:37 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 2 Apr 2009 03:22:37 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090402-0200 daily build status Message-ID: <20090402102237.F3BD5E60C56@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From chien.tin.tung at intel.com Thu Apr 2 06:31:31 2009 From: chien.tin.tung at intel.com (Tung, Chien Tin) Date: Thu, 2 Apr 2009 06:31:31 -0700 Subject: [ofa-general] Re: Dereferencing freed memory bugs In-Reply-To: References: <49CE7688.2020501@gmail.com> <60BEFF3FBD4C6047B0F13F205CAFA38303363875A3@azsmsx501.amr.corp.intel.com> <60BEFF3FBD4C6047B0F13F205CAFA38303363877F7@azsmsx501.amr.corp.intel.com> <60BEFF3FBD4C6047B0F13F205CAFA3830336387844@azsmsx501.amr.corp.intel.com> Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA38303363E7D50@azsmsx501.amr.corp.intel.com> >I checked it with 2.6.29-git9. There were still a couple issues in >drivers/infiniband/hw/nes/nes_cm.c. Thank you for checking against the latest source. We will look into the issues. Chien From perkinjo at cse.ohio-state.edu Thu Apr 2 08:04:11 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Thu, 2 Apr 2009 11:04:11 -0400 Subject: [ofa-general] MVAPICH_GCC seems incompatible with (RedHat)gcc-4.1.2-44 In-Reply-To: <49D46FC3.4050704@clustervision.com> References: <49D46FC3.4050704@clustervision.com> Message-ID: <20090402150411.GI3078@cse.ohio-state.edu> Guido: Thanks for reporting this issue. I'll take a look into the install process and see if I can find any logic that may be leading to this behavior. While I'm looking into this, can you also try installing our packages directly to see if this works. You can find our tarballs from the following urls: http://mvapich.cse.ohio-state.edu/download/mvapich/ http://mvapich.cse.ohio-state.edu/download/mvapich2/ On Thu, Apr 02, 2009 at 09:56:51AM +0200, Guido Passet wrote: > Dear OpenFabrics list, > > > RedHat and spinoffs recently shipped v5.3 and many RedHat alike > distributions like Scientific Linux picked up the new flood of packages. > > One curiosity I am trying to debug is the fact that MVAPICH/MVAPICH2 do > not seem to compile anymore after the update from: > > gcc-4.1.2-42.el5.x86_64 (RedHat 5.2) > > to: > > gcc-4.1.2-44.el5.x86_64 (RedHat 5.3) > > On running ./install.pl --build32 --all > > I get: > > mvapich_gcc is not available on this platform > mvapich2_gcc is not available on this platform > > and no rpms are being build.. > > While all other compilers (PGI/Pathscale and Intel) seem to work fine. > > I tried OFED-1.4.1-20090401-0600, OFED-1.4.1-20090319-0600, > OFED-1.4-20090301-0600 and a couple more but I seem to be in trouble on > all versions. > > Any pointers into a direction to get these stack parts compiled against > GCC would be more than welcome. > > Cheers, > Guido. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available URL: From guido.passet at clustervision.com Thu Apr 2 08:22:29 2009 From: guido.passet at clustervision.com (Guido Passet) Date: Thu, 02 Apr 2009 17:22:29 +0200 Subject: [ofa-general] MVAPICH_GCC seems incompatible with (RedHat)gcc-4.1.2-44 In-Reply-To: <20090402150411.GI3078@cse.ohio-state.edu> References: <49D46FC3.4050704@clustervision.com> <20090402150411.GI3078@cse.ohio-state.edu> Message-ID: <49D4D835.2010908@clustervision.com> Hi Jonathan, many thanks for looking in to this. Both mvapich-1.1 and mvapich2-1.2p1 seem to build fine from source. My guess is that the logic in the OFED wrapper build scripts are somehow confused. Best regards, Guido Passet. Jonathan Perkins wrote: > Guido: > Thanks for reporting this issue. I'll take a look into the install > process and see if I can find any logic that may be leading to this > behavior. > > While I'm looking into this, can you also try installing our packages > directly to see if this works. You can find our tarballs from the > following urls: > http://mvapich.cse.ohio-state.edu/download/mvapich/ > http://mvapich.cse.ohio-state.edu/download/mvapich2/ > > On Thu, Apr 02, 2009 at 09:56:51AM +0200, Guido Passet wrote: >> Dear OpenFabrics list, >> >> >> RedHat and spinoffs recently shipped v5.3 and many RedHat alike >> distributions like Scientific Linux picked up the new flood of packages. >> >> One curiosity I am trying to debug is the fact that MVAPICH/MVAPICH2 do >> not seem to compile anymore after the update from: >> >> gcc-4.1.2-42.el5.x86_64 (RedHat 5.2) >> >> to: >> >> gcc-4.1.2-44.el5.x86_64 (RedHat 5.3) >> >> On running ./install.pl --build32 --all >> >> I get: >> >> mvapich_gcc is not available on this platform >> mvapich2_gcc is not available on this platform >> >> and no rpms are being build.. >> >> While all other compilers (PGI/Pathscale and Intel) seem to work fine. >> >> I tried OFED-1.4.1-20090401-0600, OFED-1.4.1-20090319-0600, >> OFED-1.4-20090301-0600 and a couple more but I seem to be in trouble on >> all versions. >> >> Any pointers into a direction to get these stack parts compiled against >> GCC would be more than welcome. >> >> Cheers, >> Guido. >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -- Guido Passet Email: guido.passet at clustervision.com Engineering Manager ClusterVision BV Nieuw-Zeelandweg 15B Web: http://www.clustervision.com 1045 AL Amsterdam Tel: +31 20 407 7550 The Netherlands Fax: +31 84 759 8389 KvK Amsterdam 30184312 VAT/BTW NL8117.05.195.B01 From bs_lists at aakef.fastmail.fm Thu Apr 2 11:07:20 2009 From: bs_lists at aakef.fastmail.fm (Bernd Schubert) Date: Thu, 2 Apr 2009 20:07:20 +0200 Subject: [ofa-general] ib_mthca 0000:0d:00.0: Async event 16 for bogus QP 00da0407 Message-ID: <200904022007.20630.bs_lists@aakef.fastmail.fm> Hello, I'm fighting (as usual) with some Lustre problems and I think this time it is IB related. In the logs of some systems I see messages like these: ib_mthca 0000:0d:00.0: Async event 16 for bogus QP 00da0407 Anyone knows what is the meaning of that? The kernel modules are from OFED-1.3.1. Thanks, Bernd From root at voltaire.com Thu Apr 2 13:31:42 2009 From: root at voltaire.com (Yossi Etigin) Date: Thu, 2 Apr 2009 23:31:42 +0300 (IDT) Subject: [ofa-general] Re: [PATCH v3] mlx4_ib: Optimize hugetlab pages support In-Reply-To: <20090402084406.GB21370@mtls03> References: <20090129152725.GA26284@mtls03> <49D0EA11.2040409@Voltaire.COM> <20090402084406.GB21370@mtls03> Message-ID: Hi Eli, I've placed a printk in the new hugetlb function to print n and j. While running the mckey test (attached in bugzilla), I got j=0, n=1. Why do you say that the number of pages must cover HUGE_PAGES? In ib_umem_get, hugetlb is set to 0 only if any of the pages is not-hugetlb - otherwise it's 1. Am I missing something? --Yossi On Thu, 2 Apr 2009, Eli Cohen wrote: > On Mon, Mar 30, 2009 at 06:49:37PM +0300, Yossi Etigin wrote: > > Yossi, > > I wouldn't expect this to matter since handle_hugetlb_user_mr() only > gets called when the memory is huge pages which means the number of > PAGE_SIZE pages cover full HUGE_PAGES. You mention in Bugzilla an > "mckey" test but I don't know this test. Can you send how to obatain > the test and instructions how to build it and run it? > >> + >> + if (cur_size) { >> + arr[j++] = cur_addr; >> + } >> + > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg > From root at voltaire.com Thu Apr 2 13:37:14 2009 From: root at voltaire.com (Yossi Etigin) Date: Thu, 2 Apr 2009 23:37:14 +0300 (IDT) Subject: [ofa-general] Re: [PATCH v2] rdma_cm: create cm id even when port is down In-Reply-To: References: <49D0E1A4.3070804@Voltaire.COM> <49D10673.3000006@Voltaire.COM> Message-ID: Roland Dreier wrote: > patch is corrupted -- not sure how you sent it, but eg: > Sorry, about that, here is the patch. When doing rdma_resolve_addr() and relevant port is down, the function fails and rdma_cm id is not bound to the device. Therefore, application does not have device handle and cannot wait for the port to become active. The function fails because ipoib is not joined to the multicast group and therefore sa does not have a multicast record to take a qkey from. The proposed patch is to make lazy qkey resolution - cma_set_qkey will set id_priv->qkey if it was not set, and will be called just before the qkey is really required. Signed-off-by: Yossi Etigin --- drivers/infiniband/core/cma.c | 41 +++++++++++++++++++++++++++-------------- 1 file changed, 27 insertions(+), 14 deletions(-) Index: b/drivers/infiniband/core/cma.c =================================================================== --- a/drivers/infiniband/core/cma.c 2009-03-30 18:27:36.000000000 +0300 +++ b/drivers/infiniband/core/cma.c 2009-03-30 19:01:30.000000000 +0300 @@ -297,21 +297,25 @@ static void cma_detach_from_dev(struct r id_priv->cma_dev = NULL; } -static int cma_set_qkey(struct ib_device *device, u8 port_num, - enum rdma_port_space ps, - struct rdma_dev_addr *dev_addr, u32 *qkey) +static int cma_set_qkey(struct rdma_id_private *id_priv) { struct ib_sa_mcmember_rec rec; int ret = 0; - switch (ps) { + if (id_priv->qkey) + return 0; + + switch (id_priv->id.ps) { case RDMA_PS_UDP: - *qkey = RDMA_UDP_QKEY; + id_priv->qkey = RDMA_UDP_QKEY; break; case RDMA_PS_IPOIB: - ib_addr_get_mgid(dev_addr, &rec.mgid); - ret = ib_sa_get_mcmember_rec(device, port_num, &rec.mgid, &rec); - *qkey = be32_to_cpu(rec.qkey); + ib_addr_get_mgid(&id_priv->id.route.addr.dev_addr, &rec.mgid); + ret = ib_sa_get_mcmember_rec(id_priv->id.device, + id_priv->id.port_num, &rec.mgid, + &rec); + if (!ret) + id_priv->qkey = be32_to_cpu(rec.qkey); break; default: break; @@ -341,12 +345,7 @@ static int cma_acquire_dev(struct rdma_i ret = ib_find_cached_gid(cma_dev->device, &gid, &id_priv->id.port_num, NULL); if (!ret) { - ret = cma_set_qkey(cma_dev->device, - id_priv->id.port_num, - id_priv->id.ps, dev_addr, - &id_priv->qkey); - if (!ret) - cma_attach_to_dev(id_priv, cma_dev); + cma_attach_to_dev(id_priv, cma_dev); break; } } @@ -578,6 +577,10 @@ static int cma_ib_init_qp_attr(struct rd *qp_attr_mask = IB_QP_STATE | IB_QP_PKEY_INDEX | IB_QP_PORT; if (cma_is_ud_ps(id_priv->id.ps)) { + ret = cma_set_qkey(id_priv); + if (ret) + return ret; + qp_attr->qkey = id_priv->qkey; *qp_attr_mask |= IB_QP_QKEY; } else { @@ -2201,6 +2204,12 @@ static int cma_sidr_rep_handler(struct i event.status = ib_event->param.sidr_rep_rcvd.status; break; } + ret = cma_set_qkey(id_priv); + if (ret) { + event.event = RDMA_CM_EVENT_ADDR_ERROR; + event.status = -EINVAL; + break; + } if (id_priv->qkey != rep->qkey) { event.event = RDMA_CM_EVENT_UNREACHABLE; event.status = -EINVAL; @@ -2480,10 +2489,14 @@ static int cma_send_sidr_rep(struct rdma const void *private_data, int private_data_len) { struct ib_cm_sidr_rep_param rep; + int ret; memset(&rep, 0, sizeof rep); rep.status = status; if (status == IB_SIDR_SUCCESS) { + ret = cma_set_qkey(id_priv); + if (ret) + return ret; rep.qp_num = id_priv->qp_num; rep.qkey = id_priv->qkey; } From faisal.latif at intel.com Thu Apr 2 14:36:14 2009 From: faisal.latif at intel.com (Latif, Faisal) Date: Thu, 2 Apr 2009 14:36:14 -0700 Subject: [ofa-general] Re: Dereferencing freed memory bugs In-Reply-To: References: <49CE7688.2020501@gmail.com> <60BEFF3FBD4C6047B0F13F205CAFA38303363875A3@azsmsx501.amr.corp.intel.com> <60BEFF3FBD4C6047B0F13F205CAFA38303363877F7@azsmsx501.amr.corp.intel.com> <60BEFF3FBD4C6047B0F13F205CAFA3830336387844@azsmsx501.amr.corp.intel.com> Message-ID: <588992150B702C48B3312184F1B810AD03E3E82EA2@azsmsx501.amr.corp.intel.com> Thanks Dan for the input. There were 3 issues (comments inline) that required changes while others were OK. I will be creating a patch for them as you pointed out. Faisal >-----Original Message----- >From: Dan Carpenter [mailto:error27 at gmail.com] >Sent: Thursday, April 02, 2009 1:38 AM >To: Tung, Chien Tin >Cc: Roland Dreier; Marcin Slusarz; general at lists.openfabrics.org >Subject: Re: [ofa-general] Re: Dereferencing freed memory bugs > >I checked it with 2.6.29-git9. There were still a couple issues in >drivers/infiniband/hw/nes/nes_cm.c. > > 428 if (cm_node->recv_entry) { > 429 WARN_ON(1); > 430 return -EINVAL; > 431 } > >missing kfree(new_send); Will be adding the kfree() before the WARN_ON(. > > 521 rem_ref_cm_node(cm_node->cm_core, cm_node); > 522 } > 523 if (cm_node->cm_id) > 524 cm_id->rem_ref(cm_id); > >dereferencing freed memory. rem_ref_cm_node can call kfree on cm_node. > > 662 >rem_ref_cm_node(cm_node->cm_core, > 663 cm_node); > 664 } > 665 } while (0); > 666 > 667 >spin_unlock_irqrestore(&cm_node->retrans_list_lock, flags); > >same. All the above were OK in the code. rem_ref_cm_node() will not be freeing up the cm_node memory as its ref_count will always be greater than 1 in above situations. The rem_ref_cm_node() will be decrementing the ref_count and returning. > > 1265 cm_node->freed = 1; > 1266 kfree(cm_node); > >You can't actually checked cm_node->freed if it's freed. Yes, this was leftover from previous debugging where we had commented out kfree() temporarily to find a bug. I removed the freed flag from the cm_node as it was not used anywhere else. > > 2007 loopbackremotenode = >make_cm_node(cm_core, nesvnic, > 2008 &loopback_cm_info, >loopbackremotelistener); > 2009 loopbackremotenode->loopbackpartner = >cm_node; > >make_cm_node() returns NULL in low memory situations. Yes I will fix the above. > >Don't forget to add the reported by sticker. :P >Reported-by: Dan Carpenter > >regards, >dan carpenter From sean.hefty at intel.com Thu Apr 2 15:10:01 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 2 Apr 2009 15:10:01 -0700 Subject: [ofa-general] RE: QoS setting and propagation In-Reply-To: <517c62fb0904021457r185627f6l400607bc2dc5a8d8@mail.gmail.com> References: <517c62fb0904021457r185627f6l400607bc2dc5a8d8@mail.gmail.com> Message-ID: responding on general list: >do we set QoS parameters in SM only? The SM must be configured with QoS. You'll need to look in the opensm QoS documentation to see how to setup QoS. (I don't know those details.) >I looked in cma.c and ib_cm and iw_cm and do not see any parameter passing for >QoS. >Am I missing something? IB specifies qos using the service ID and qos_class fields in the PR query. This is done during 'route resolution'. See cma_query_ib_route(). >Can we set it in transport independent way? See rdma_set_service_type(). This call is intended to be generic. - Sean From arkady.kanevsky at gmail.com Thu Apr 2 15:57:38 2009 From: arkady.kanevsky at gmail.com (arkady kanevsky) Date: Thu, 2 Apr 2009 18:57:38 -0400 Subject: [ofa-general] ***SPAM*** Re: QoS setting and propagation In-Reply-To: References: <517c62fb0904021457r185627f6l400607bc2dc5a8d8@mail.gmail.com> Message-ID: <517c62fb0904021557u584b253bm86b1b3b6d551da16@mail.gmail.com> Got it. Thanks. So the plan is to hook up service_type/tos to VLAN and skb->priority for iWARP? But since we do not setup socket connection explicitly we can not use SO_PRIORITY field to do it. Is this correct? Do we have a plan on how to hook up tos without socket? Thanks, Arkady On Thu, Apr 2, 2009 at 6:10 PM, Sean Hefty wrote: > responding on general list: > > >do we set QoS parameters in SM only? > > The SM must be configured with QoS. You'll need to look in the opensm QoS > documentation to see how to setup QoS. (I don't know those details.) > > >I looked in cma.c and ib_cm and iw_cm and do not see any parameter passing > for > >QoS. > >Am I missing something? > > IB specifies qos using the service ID and qos_class fields in the PR query. > This is done during 'route resolution'. See cma_query_ib_route(). > > >Can we set it in transport independent way? > > See rdma_set_service_type(). This call is intended to be generic. > > - Sean > > -- Cheers, Arkady Kanevsky -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Thu Apr 2 17:17:29 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 02 Apr 2009 17:17:29 -0700 Subject: [ofa-general] [PATCH] IB/mlx4: Use pgprot_writecombine() for BlueFlame pages In-Reply-To: <49D48811.2080204@voltaire.com> (Or Gerlitz's message of "Thu, 02 Apr 2009 12:40:33 +0300") References: <15ddcffd0903291006g4b7549cfj1879dd67518f8bff@mail.gmail.com> <200903301117.32355.jackm@dev.mellanox.co.il> <49D1FD74.9040205@Voltaire.com> <49D31884.5020609@Voltaire.com> <49D31B4F.3000700@Voltaire.com> <49D48811.2080204@voltaire.com> Message-ID: > Again, the facts are that on my latest round of testing I got a > latency of 1.4us with and without X86_PAT which went down to 1.1us > when applying your patch on top of a X86_PAT config. Even if I'm doing > something wrong, I still want to see that we understand each > other... so please let me know if you still don't understand me. Those results make sense. The only thing you said that I don't understand: > I noted that I have CONFIG_MTRR=y so maybe this can explain the nice latency > even without setting X86_PAT? But you are getting the worse latency without X86_PAT -- you need X86_PAT and a patch to use PAT in mlx4_ib to get better latency, which is as I expect. So I'm not sure why you talk about "nice" latency without X86_PAT. - R. From sean.hefty at intel.com Thu Apr 2 17:28:42 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 2 Apr 2009 17:28:42 -0700 Subject: [ofa-general] RE: QoS setting and propagation In-Reply-To: <517c62fb0904021557u584b253bm86b1b3b6d551da16@mail.gmail.com> References: <517c62fb0904021457r185627f6l400607bc2dc5a8d8@mail.gmail.com> <517c62fb0904021557u584b253bm86b1b3b6d551da16@mail.gmail.com> Message-ID: <9B4A01365359432E8A2567E0B1085D9E@amr.corp.intel.com> >So the plan is to hook up service_type/tos to VLAN and skb->priority >for iWARP? But since we do not setup socket connection explicitly >we can not use SO_PRIORITY field to do it. Is this correct? >Do we have a plan on how to hook up tos without socket? I really don't have any plans at this time to link QoS and iWarp. Someone working closer with iWarp would need to provide any sort of implementation. - Sean From devel at morey-chaisemartin.com Thu Apr 2 21:46:18 2009 From: devel at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Fri, 03 Apr 2009 06:46:18 +0200 Subject: [ofa-general] RE: QoS setting and propagation In-Reply-To: References: <517c62fb0904021457r185627f6l400607bc2dc5a8d8@mail.gmail.com> Message-ID: <49D5949A.7010600@morey-chaisemartin.com> Hi, Sean Hefty a écrit : > responding on general list: ... > See rdma_set_service_type(). This call is intended to be generic. Are there reservered service_type? If I got it right, regular ULP from OFED stack are already using service level to get their QoS level. If yes, is there a list somewhere so if we force the service type in an application we don't conflict with an existing one. Nicolas From vlad at lists.openfabrics.org Fri Apr 3 03:22:38 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 3 Apr 2009 03:22:38 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090403-0200 daily build status Message-ID: <20090403102238.6A4E3E6106A@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From hal.rosenstock at gmail.com Fri Apr 3 04:03:19 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 3 Apr 2009 07:03:19 -0400 Subject: [ofa-general] RE: QoS setting and propagation In-Reply-To: <49D5949A.7010600@morey-chaisemartin.com> References: <517c62fb0904021457r185627f6l400607bc2dc5a8d8@mail.gmail.com> <49D5949A.7010600@morey-chaisemartin.com> Message-ID: On Fri, Apr 3, 2009 at 12:46 AM, Nicolas Morey-Chaisemartin wrote: > Hi, > Sean Hefty a écrit : >> responding on general list: > ... >> See rdma_set_service_type().  This call is intended to be generic. > > Are there reservered service_type? > If I got it right, regular ULP from OFED stack are already using service level to get their QoS level. The SL used by the app should be based on what comes back in either a SA PathRecord or possibly MCMemberRecord response (for IPoIB UD). For RDMA CM based apps, the query is based on Service ID and either QoS class (added by QoS Annex) or traffic class depending on the address family used. > If yes, is there a list somewhere so if we force the service type in an application we don't conflict with an existing one. QoS class (QoS Annex) can be per DiffServ (see QoS Annex). I don't understand what you mean by service type not conflicting unless you mean service ID rather than type. Service IDs are standardized by IETF or IBTA. There's also provision for local OS or external organizations as well. See IBA 1.2.1 Annex A3 if that's what you mean. -- Hal > Nicolas From caitlin.bestler at gmail.com Fri Apr 3 09:13:17 2009 From: caitlin.bestler at gmail.com (Caitlin Bestler) Date: Fri, 3 Apr 2009 09:13:17 -0700 Subject: ***SPAM*** Re: [ofa-general] RE: QoS setting and propagation In-Reply-To: <9B4A01365359432E8A2567E0B1085D9E@amr.corp.intel.com> References: <517c62fb0904021457r185627f6l400607bc2dc5a8d8@mail.gmail.com> <517c62fb0904021557u584b253bm86b1b3b6d551da16@mail.gmail.com> <9B4A01365359432E8A2567E0B1085D9E@amr.corp.intel.com> Message-ID: <469958e00904030913j3e6282f1y200e0465d9ca7c16@mail.gmail.com> On Thu, Apr 2, 2009 at 5:28 PM, Sean Hefty wrote: >>So the plan is to hook up service_type/tos to VLAN and skb->priority >>for iWARP? But since we do not setup socket connection explicitly >>we can not use SO_PRIORITY field to do it. Is this correct? >>Do we have a plan on how to hook up tos without socket? > > I really don't have any plans at this time to link QoS and iWarp.  Someone > working closer with iWarp would need to provide any sort of implementation. > Implementing a parallel solution for VLAN/priority and other policy controls for iWARP connections would be terribly wasteful, both in development time and in the need for network administrators to use two different sets of tools to implement what should be an integrated set of policies. But that would require the netdev stack to acknowledge the existence of offloaded connections, and instead of just waiting for them to go away insist that they follow a minimal set of rules to provide co-ordinated management. In my opinion this won't happen because vendors push for it. Fabric users are going to have to make their need for uniform control of VLAN/priorities/etc. understood. From arkady.kanevsky at gmail.com Fri Apr 3 09:30:16 2009 From: arkady.kanevsky at gmail.com (arkady kanevsky) Date: Fri, 3 Apr 2009 12:30:16 -0400 Subject: ***SPAM*** Re: [ofa-general] RE: QoS setting and propagation In-Reply-To: <469958e00904030913j3e6282f1y200e0465d9ca7c16@mail.gmail.com> References: <517c62fb0904021457r185627f6l400607bc2dc5a8d8@mail.gmail.com> <517c62fb0904021557u584b253bm86b1b3b6d551da16@mail.gmail.com> <9B4A01365359432E8A2567E0B1085D9E@amr.corp.intel.com> <469958e00904030913j3e6282f1y200e0465d9ca7c16@mail.gmail.com> Message-ID: <517c62fb0904030930q1806a4cajc73093154d4fee31@mail.gmail.com> Caitlin, can you clarify what you are suggesting? Are you proposing that RDMA APIs for QoS setting is the not the way to go? Are you proposing to use netdev mechanisms for setting RFC2474 DS fields to be used for RDMA QoS setting? Something else? Thanks, Arkady On Fri, Apr 3, 2009 at 12:13 PM, Caitlin Bestler wrote: > On Thu, Apr 2, 2009 at 5:28 PM, Sean Hefty wrote: > >>So the plan is to hook up service_type/tos to VLAN and skb->priority > >>for iWARP? But since we do not setup socket connection explicitly > >>we can not use SO_PRIORITY field to do it. Is this correct? > >>Do we have a plan on how to hook up tos without socket? > > > > I really don't have any plans at this time to link QoS and iWarp. > Someone > > working closer with iWarp would need to provide any sort of > implementation. > > > Implementing a parallel solution for VLAN/priority and other policy > controls > for iWARP connections would be terribly wasteful, both in development time > and in the need for network administrators to use two different sets of > tools > to implement what should be an integrated set of policies. > > But that would require the netdev stack to acknowledge the existence of > offloaded connections, and instead of just waiting for them to go away > insist > that they follow a minimal set of rules to provide co-ordinated management. > > In my opinion this won't happen because vendors push for it. Fabric users > are going to have to make their need for uniform control of > VLAN/priorities/etc. > understood. > -- Cheers, Arkady Kanevsky -------------- next part -------------- An HTML attachment was scrubbed... URL: From YJia at tmriusa.com Fri Apr 3 09:44:14 2009 From: YJia at tmriusa.com (Yicheng Jia) Date: Fri, 3 Apr 2009 11:44:14 -0500 Subject: [ofa-general] change port's physical state without reset HCA Message-ID: Hi, I am using Mellanox MHES18 HCA connecting with Qlogic 9024 unmanaged switch. Sometime after system reboot I get 1x link width on two ports between HCA and the switch. The cable is good. It's supposed to be 4x, and usually after one more system reboot it will go back to 4x. My question is, can I simply change the port's physical state to polling and let it do link training again to get 4x link without reset the HCA? Is it possible? Thanks! Yicheng Jia _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From devel at morey-chaisemartin.com Fri Apr 3 09:46:51 2009 From: devel at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Fri, 03 Apr 2009 18:46:51 +0200 Subject: [ofa-general] change port's physical state without reset HCA In-Reply-To: References: Message-ID: <49D63D7B.9000408@morey-chaisemartin.com> You could reset the other end of the cable (on the switch) with ibportstate or simply unload/reload OFED drivers Nicolas Yicheng Jia a écrit : > > Hi, > > I am using Mellanox MHES18 HCA connecting with Qlogic 9024 unmanaged > switch. Sometime after system reboot I get 1x link width on two ports > between HCA and the switch. The cable is good. It's supposed to be 4x, > and usually after one more system reboot it will go back to 4x. My > question is, can I simply change the port's physical state to polling > and let it do link training again to get 4x link without reset the HCA? > Is it possible? > > Thanks! > Yicheng Jia > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by > MessageLabs. For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From YJia at tmriusa.com Fri Apr 3 09:53:37 2009 From: YJia at tmriusa.com (Yicheng Jia) Date: Fri, 3 Apr 2009 11:53:37 -0500 Subject: [ofa-general] change port's physical state without reset HCA In-Reply-To: <49D63D7B.9000408@morey-chaisemartin.com> Message-ID: Hi Nicolas, Do I need to restart HCA driver if I just use ibportstate to reset the cable on the switch side? Thanks! Yicheng Jia Nicolas Morey-Chaisemartin 04/03/2009 11:47 AM Please respond to devel at morey-chaisemartin.com To Yicheng Jia cc general at lists.openfabrics.org Subject Re: [ofa-general] change port's physical state without reset HCA You could reset the other end of the cable (on the switch) with ibportstate or simply unload/reload OFED drivers Nicolas Yicheng Jia a écrit : > > Hi, > > I am using Mellanox MHES18 HCA connecting with Qlogic 9024 unmanaged > switch. Sometime after system reboot I get 1x link width on two ports > between HCA and the switch. The cable is good. It's supposed to be 4x, > and usually after one more system reboot it will go back to 4x. My > question is, can I simply change the port's physical state to polling > and let it do link training again to get 4x link without reset the HCA? > Is it possible? > > Thanks! > Yicheng Jia > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by > MessageLabs. For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From devel at morey-chaisemartin.com Fri Apr 3 10:04:50 2009 From: devel at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Fri, 03 Apr 2009 19:04:50 +0200 Subject: [ofa-general] change port's physical state without reset HCA In-Reply-To: References: Message-ID: <49D641B2.5010104@morey-chaisemartin.com> No you just need to do one or the other. It's not necessaray to do both. Nicolas Yicheng Jia a écrit : > > Hi Nicolas, > > Do I need to restart HCA driver if I just use ibportstate to reset the > cable on the switch side? > > Thanks! > > Yicheng Jia > > > > > *Nicolas Morey-Chaisemartin * > > 04/03/2009 11:47 AM > Please respond to > devel at morey-chaisemartin.com > > From Ted.Kim at Sun.COM Fri Apr 3 10:45:06 2009 From: Ted.Kim at Sun.COM (Ted H. Kim) Date: Fri, 03 Apr 2009 10:45:06 -0700 Subject: [ofa-general] checksum offload for IPonIB-CM Message-ID: <49D64B22.5020200@sun.com> Folks, When I looked at the OFED code, it did not look like IPonIB connected mode does checksum offload. Did I understand that right? Or is the story more complex than that? Maybe it does it one direction but not the other? (generate vs. check) Was there ever an idea that we could rely on the IB checksum to protect the payload instead? I vaguely remember some old presentation talking about checksum bypass or something similar. Thanks, -ted -- Ted H. Kim Sun Microsystems, Inc. ted.kim at sun.com 222 North Sepulveda Blvd., 10th Floor (310) 341-1116 El Segundo, CA 90245 (310) 341-1120 FAX From jgunthorpe at obsidianresearch.com Fri Apr 3 10:55:48 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Fri, 3 Apr 2009 11:55:48 -0600 Subject: [ofa-general] checksum offload for IPonIB-CM In-Reply-To: <49D64B22.5020200@sun.com> References: <49D64B22.5020200@sun.com> Message-ID: <20090403175548.GK6979@obsidianresearch.com> On Fri, Apr 03, 2009 at 10:45:06AM -0700, Ted H. Kim wrote: > Was there ever an idea that we could rely on > the IB checksum to protect the > payload instead? I vaguely remember some > old presentation talking about checksum > bypass or something similar. Some patches were made but I don't think they were ever completed. The difficulty was arranging things so that an absent checksum could be negotiated and then ensuring that as the rx'd packet passed through the Linux stack it was properly flagged as valid data, no checksum so that the checksum is recomputed if the packet happens to be forwared out another interface. Essentially within the Linux architecture it would become similar to how something like Xen works to provide checksum/segmentation offload to VMs (and probably using the code they wrote..) IIRC the interest was actually beyond just checksum but the whole bag of tricks including segmentation offload. Essentially just connect two Linux stacks together over RC with all the offload features turned on. Jason From donald.e.wood at intel.com Fri Apr 3 13:35:55 2009 From: donald.e.wood at intel.com (Don Wood) Date: Fri, 3 Apr 2009 15:35:55 -0500 Subject: [ofa-general] [PATCH] RDMA/nes: Physical memory registration is incorrect Message-ID: <20090403203555.GA2608@dewood-MOBL> Code incorrectly failed memory registration if the buffer was not page aligned. Also, the length field is mangled causing the hardware to think the registration is much larger than it really is. The fix is to remove the page alignment restriction as well the incorrect length adjustment. Signed-off-by: Don Wood --- drivers/infiniband/hw/nes/nes_verbs.c | 13 +------------ 1 files changed, 1 insertions(+), 12 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index 7e5b5ba..0374e7d 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -2215,15 +2215,6 @@ static struct ib_mr *nes_reg_phys_mr(struct ib_pd *ib_pd, root_pbl_index++; cur_pbl_index = 0; } - if (buffer_list[i].addr & ~PAGE_MASK) { - /* TODO: Unwind allocated buffers */ - nes_free_resource(nesadapter, nesadapter->allocated_mrs, stag_index); - nes_debug(NES_DBG_MR, "Unaligned Memory Buffer: 0x%x\n", - (unsigned int) buffer_list[i].addr); - ibmr = ERR_PTR(-EINVAL); - kfree(nesmr); - goto reg_phys_err; - } if (!buffer_list[i].size) { nes_free_resource(nesadapter, nesadapter->allocated_mrs, stag_index); @@ -2238,7 +2229,7 @@ static struct ib_mr *nes_reg_phys_mr(struct ib_pd *ib_pd, if ((buffer_list[i-1].addr+PAGE_SIZE) != buffer_list[i].addr) single_page = 0; } - vpbl.pbl_vbase[cur_pbl_index].pa_low = cpu_to_le32((u32)buffer_list[i].addr); + vpbl.pbl_vbase[cur_pbl_index].pa_low = cpu_to_le32((u32)buffer_list[i].addr & PAGE_MASK); vpbl.pbl_vbase[cur_pbl_index++].pa_high = cpu_to_le32((u32)((((u64)buffer_list[i].addr) >> 32))); } @@ -2251,8 +2242,6 @@ static struct ib_mr *nes_reg_phys_mr(struct ib_pd *ib_pd, " length = 0x%016lX, index = 0x%08X\n", stag, (unsigned long)*iova_start, (unsigned long)region_length, stag_index); - region_length -= (*iova_start)&PAGE_MASK; - /* Make the leaf PBL the root if only one PBL */ if (root_pbl_index == 1) { root_vpbl.pbl_pbase = vpbl.pbl_pbase; -- 1.5.3.3 From donald.e.wood at intel.com Fri Apr 3 13:43:27 2009 From: donald.e.wood at intel.com (Don Wood) Date: Fri, 3 Apr 2009 15:43:27 -0500 Subject: [ofa-general] [PATCH] RDMA/nes: Incorrect casting for 32 bit driver Message-ID: <20090403204327.GA5556@dewood-MOBL> The were some incorrect casts that caused the driver to pass invalid adresses and lengths to the hardware. The problems were primarily with kernels with highmem configured but some could show up in non-highmem kernels, too. Signed-off-by: Don Wood --- drivers/infiniband/hw/nes/nes.h | 4 ++-- drivers/infiniband/hw/nes/nes_cm.c | 6 ++++-- 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes.h b/drivers/infiniband/hw/nes/nes.h index 04b12ad..17621de 100644 --- a/drivers/infiniband/hw/nes/nes.h +++ b/drivers/infiniband/hw/nes/nes.h @@ -289,8 +289,8 @@ static inline __le32 get_crc_value(struct nes_v4_quad *nes_quad) static inline void set_wqe_64bit_value(__le32 *wqe_words, u32 index, u64 value) { - wqe_words[index] = cpu_to_le32((u32) ((unsigned long)value)); - wqe_words[index + 1] = cpu_to_le32((u32)(upper_32_bits((unsigned long)value))); + wqe_words[index] = cpu_to_le32((u32) value); + wqe_words[index + 1] = cpu_to_le32(upper_32_bits(value)); } static inline void diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index 5242515..7c94247 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -2690,6 +2690,7 @@ int nes_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) struct ib_mr *ibmr = NULL; struct ib_phys_buf ibphysbuf; struct nes_pd *nespd; + u64 tagged_offset; @@ -2755,10 +2756,11 @@ int nes_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) ibphysbuf.addr = nesqp->ietf_frame_pbase; ibphysbuf.size = conn_param->private_data_len + sizeof(struct ietf_mpa_frame); + tagged_offset = (u64)(unsigned long)nesqp->ietf_frame; ibmr = nesibdev->ibdev.reg_phys_mr((struct ib_pd *)nespd, &ibphysbuf, 1, IB_ACCESS_LOCAL_WRITE, - (u64 *)&nesqp->ietf_frame); + &tagged_offset); if (!ibmr) { nes_debug(NES_DBG_CM, "Unable to register memory region" "for lSMM for cm_node = %p \n", @@ -2782,7 +2784,7 @@ int nes_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) sizeof(struct ietf_mpa_frame)); set_wqe_64bit_value(wqe->wqe_words, NES_IWARP_SQ_WQE_FRAG0_LOW_IDX, - (u64)nesqp->ietf_frame); + (u64)(unsigned long)nesqp->ietf_frame); wqe->wqe_words[NES_IWARP_SQ_WQE_LENGTH0_IDX] = cpu_to_le32(conn_param->private_data_len + sizeof(struct ietf_mpa_frame)); -- 1.5.3.3 From weiny2 at llnl.gov Fri Apr 3 15:42:44 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 3 Apr 2009 15:42:44 -0700 Subject: [ofa-general] [PATCH v3 0/3] Create a new library libibnetdisc and convert iblinkinfo and ibnetdiscover to that library. Message-ID: <20090403154244.a65227b5.weiny2@llnl.gov> Sasha, This new series uses the current master version ibmad to decode the data. If you accept the mad_*printf functions then I can convert later. For now I want to get this library in! :-D Let me know what you think of this new series. I believe I have everything in it that we have discussed. Thanks, Ira From weiny2 at llnl.gov Fri Apr 3 15:42:54 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 3 Apr 2009 15:42:54 -0700 Subject: [ofa-general] [PATCH v3 2/3] Convert iblinkinfo.pl to C and use new ibnetdisc library. Message-ID: <20090403154254.5ab60589.weiny2@llnl.gov> >From a677ae35fe7a5966f05b5859df8f00e9b18df864 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Fri, 3 Apr 2009 15:28:18 -0700 Subject: [PATCH] Convert iblinkinfo.pl to C and use new ibnetdisc library. Signed-off-by: Ira Weiny --- infiniband-diags/Makefile.am | 7 +- infiniband-diags/configure.in | 1 + infiniband-diags/scripts/iblinkinfo.pl | 327 ------------------------ infiniband-diags/scripts/iblinkinfo.pl.in | 40 +++ infiniband-diags/src/iblinkinfo.c | 386 +++++++++++++++++++++++++++++ 5 files changed, 432 insertions(+), 329 deletions(-) delete mode 100755 infiniband-diags/scripts/iblinkinfo.pl create mode 100755 infiniband-diags/scripts/iblinkinfo.pl.in create mode 100644 infiniband-diags/src/iblinkinfo.c diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am index 7b8523a..b480a4a 100644 --- a/infiniband-diags/Makefile.am +++ b/infiniband-diags/Makefile.am @@ -1,6 +1,7 @@ SUBDIRS = libibnetdisc -INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband +INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband \ + -I$(top_builddir)/libibnetdisc/include if DEBUG DBGFLAGS = -ggdb -D_DEBUG_ @@ -11,7 +12,7 @@ endif sbin_PROGRAMS = src/ibaddr src/ibnetdiscover src/ibping src/ibportstate \ src/ibroute src/ibstat src/ibsysstat src/ibtracert \ src/perfquery src/sminfo src/smpdump src/smpquery \ - src/saquery src/vendstat + src/saquery src/vendstat src/iblinkinfo if ENABLE_TEST_UTILS sbin_PROGRAMS += src/ibsendtrap src/mcm_rereg_test @@ -55,6 +56,8 @@ src_saquery_SOURCES = src/saquery.c src_ibsendtrap_SOURCES = src/ibsendtrap.c src_vendstat_SOURCES = src/vendstat.c src_mcm_rereg_test_SOURCES = src/mcm_rereg_test.c +src_iblinkinfo_SOURCES = src/iblinkinfo.c +src_iblinkinfo_LDADD = -libnetdisc man_MANS = man/ibaddr.8 man/ibcheckerrors.8 man/ibcheckerrs.8 \ man/ibchecknet.8 man/ibchecknode.8 man/ibcheckport.8 \ diff --git a/infiniband-diags/configure.in b/infiniband-diags/configure.in index 2b73167..4516dfa 100644 --- a/infiniband-diags/configure.in +++ b/infiniband-diags/configure.in @@ -166,6 +166,7 @@ AC_CONFIG_FILES([\ scripts/ibnodes \ scripts/ibswitches \ scripts/ibrouters \ + scripts/iblinkinfo.pl \ libibnetdisc/Makefile ]) AC_OUTPUT diff --git a/infiniband-diags/scripts/iblinkinfo.pl b/infiniband-diags/scripts/iblinkinfo.pl deleted file mode 100755 index b6b27ce..0000000 --- a/infiniband-diags/scripts/iblinkinfo.pl +++ /dev/null @@ -1,327 +0,0 @@ -#!/usr/bin/perl -# -# Copyright (c) 2006 The Regents of the University of California. -# Copyright (c) 2007-2008 Voltaire, Inc. All rights reserved. -# -# Produced at Lawrence Livermore National Laboratory. -# Written by Ira Weiny . -# -# This software is available to you under a choice of one of two -# licenses. You may choose to be licensed under the terms of the GNU -# General Public License (GPL) Version 2, available from the file -# COPYING in the main directory of this source tree, or the -# OpenIB.org BSD license below: -# -# Redistribution and use in source and binary forms, with or -# without modification, are permitted provided that the following -# conditions are met: -# -# - Redistributions of source code must retain the above -# copyright notice, this list of conditions and the following -# disclaimer. -# -# - Redistributions in binary form must reproduce the above -# copyright notice, this list of conditions and the following -# disclaimer in the documentation and/or other materials -# provided with the distribution. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, -# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF -# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND -# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS -# BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN -# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN -# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -# - -use strict; - -use Getopt::Std; -use IBswcountlimits; - -sub usage_and_exit -{ - my $prog = $_[0]; - print -"Usage: $prog [-Rhclp -S -D -C -P ]\n"; - print -" Report link speed and connection for each port of each switch which is active\n"; - print " -h This help message\n"; - print -" -R Recalculate ibnetdiscover information (Default is to reuse ibnetdiscover output)\n"; - print -" -D output only the switch specified by direct route path\n"; - print " -S output only the switch specified by (hex format)\n"; - print " -d print only down links\n"; - print - " -l (line mode) print all information for each link on each line\n"; - print -" -p print additional switch settings (PktLifeTime,HoqLife,VLStallCount)\n"; - print " -c print port capabilities (enabled/supported values)\n"; - print " -C use selected Channel Adaptor name for queries\n"; - print " -P use selected channel adaptor port for queries\n"; - print " -g print port guids instead of node guids\n"; - exit 2; -} - -my $argv0 = `basename $0`; -my $regenerate_map = undef; -my $single_switch = undef; -my $direct_route = undef; -my $line_mode = undef; -my $print_add_switch = undef; -my $print_extended_cap = undef; -my $only_down_links = undef; -my $ca_name = ""; -my $ca_port = ""; -my $print_port_guids = undef; -my $switch_found = "no"; -chomp $argv0; - -if (!getopts("hcpldRS:D:C:P:g")) { usage_and_exit $argv0; } -if (defined $Getopt::Std::opt_h) { usage_and_exit $argv0; } -if (defined $Getopt::Std::opt_D) { $direct_route = $Getopt::Std::opt_D; } -if (defined $Getopt::Std::opt_R) { $regenerate_map = $Getopt::Std::opt_R; } -if (defined $Getopt::Std::opt_S) { - $single_switch = format_guid($Getopt::Std::opt_S); -} -if (defined $Getopt::Std::opt_d) { $only_down_links = $Getopt::Std::opt_d; } -if (defined $Getopt::Std::opt_l) { $line_mode = $Getopt::Std::opt_l; } -if (defined $Getopt::Std::opt_p) { $print_add_switch = $Getopt::Std::opt_p; } -if (defined $Getopt::Std::opt_c) { $print_extended_cap = $Getopt::Std::opt_c; } -if (defined $Getopt::Std::opt_C) { $ca_name = $Getopt::Std::opt_C; } -if (defined $Getopt::Std::opt_P) { $ca_port = $Getopt::Std::opt_P; } -if (defined $Getopt::Std::opt_g) { $print_port_guids = $Getopt::Std::opt_g; } - -my $extra_smpquery_params = get_ca_name_port_param_string($ca_name, $ca_port); - -sub main -{ - get_link_ends($regenerate_map, $ca_name, $ca_port); - if (defined($direct_route)) { - # convert DR to guid, then use original single_switch option - $single_switch = convert_dr_to_guid($direct_route); - if (!defined($single_switch) || !is_switch($single_switch)) { - printf("The direct route (%s) does not map to a switch.\n", - $direct_route); - return; - } - } - foreach my $switch (sort (keys(%IBswcountlimits::link_ends))) { - if ($single_switch && $switch ne $single_switch) { - next; - } else { - $switch_found = "yes"; - } - my $switch_prompt = "no"; - my $num_ports = get_num_ports($switch, $ca_name, $ca_port); - if ($num_ports == 0) { - printf("ERROR: switch $switch has 0 ports???\n"); - } - my @output_lines = undef; - my $pkt_lifetime = ""; - my $pkt_life_prompt = ""; - my $port_timeouts = ""; - my $print_switch = "yes"; - if ($only_down_links) { $print_switch = "no"; } - if ($print_add_switch) { - my $data = `smpquery $extra_smpquery_params -G switchinfo $switch`; - if ($data eq "") { - printf("ERROR: failed to get switchinfo for $switch\n"); - } - my @lines = split("\n", $data); - foreach my $line (@lines) { - if ($line =~ /^LifeTime:\.+(.*)/) { $pkt_lifetime = $1; } - } - $pkt_life_prompt = sprintf(" (LT: %2s)", $pkt_lifetime); - } - foreach my $port (1 .. $num_ports) { - my $hr = $IBswcountlimits::link_ends{$switch}{$port}; - if ($switch_prompt eq "no" && !$line_mode) { - my $switch_name = ""; - my $tmp_port = $port; - while ($switch_name eq "" && $tmp_port <= $num_ports) { - # the first port is down find switch name with up port - my $hr = $IBswcountlimits::link_ends{$switch}{$tmp_port}; - $switch_name = $hr->{loc_desc}; - $tmp_port++; - } - if ($switch_name eq "") { - printf( - "WARNING: Switch Name not found for $switch\n"); - } - push( - @output_lines, - sprintf( - "Switch %18s %s%s:\n", - $switch, $switch_name, $pkt_life_prompt - ) - ); - $switch_prompt = "yes"; - } - my $data = - `smpquery $extra_smpquery_params -G portinfo $switch $port`; - if ($data eq "") { - printf( - "ERROR: failed to get portinfo for $switch port $port\n"); - } - my @lines = split("\n", $data); - my $speed = ""; - my $speed_sup = ""; - my $speed_enable = ""; - my $width = ""; - my $width_sup = ""; - my $width_enable = ""; - my $state = ""; - my $hoq_life = ""; - my $vl_stall = ""; - my $phy_link_state = ""; - - foreach my $line (@lines) { - if ($line =~ /^LinkSpeedActive:\.+(.*)/) { $speed = $1; } - if ($line =~ /^LinkSpeedEnabled:\.+(.*)/) { - $speed_enable = $1; - } - if ($line =~ /^LinkSpeedSupported:\.+(.*)/) { $speed_sup = $1; } - if ($line =~ /^LinkWidthActive:\.+(.*)/) { $width = $1; } - if ($line =~ /^LinkWidthEnabled:\.+(.*)/) { - $width_enable = $1; - } - if ($line =~ /^LinkWidthSupported:\.+(.*)/) { $width_sup = $1; } - if ($line =~ /^LinkState:\.+(.*)/) { $state = $1; } - if ($line =~ /^HoqLife:\.+(.*)/) { $hoq_life = $1; } - if ($line =~ /^VLStallCount:\.+(.*)/) { $vl_stall = $1; } - if ($line =~ /^PhysLinkState:\.+(.*)/) { $phy_link_state = $1; } - } - my $rem_port = $hr->{rem_port}; - my $rem_lid = $hr->{rem_lid}; - my $rem_speed_sup = ""; - my $rem_speed_enable = ""; - my $rem_width_sup = ""; - my $rem_width_enable = ""; - if ($rem_lid ne "" && $rem_port ne "") { - $data = - `smpquery $extra_smpquery_params portinfo $rem_lid $rem_port`; - if ($data eq "") { - printf( - "ERROR: failed to get portinfo for $switch port $port\n" - ); - } - my @lines = split("\n", $data); - foreach my $line (@lines) { - if ($line =~ /^LinkSpeedEnabled:\.+(.*)/) { - $rem_speed_enable = $1; - } - if ($line =~ /^LinkSpeedSupported:\.+(.*)/) { - $rem_speed_sup = $1; - } - if ($line =~ /^LinkWidthEnabled:\.+(.*)/) { - $rem_width_enable = $1; - } - if ($line =~ /^LinkWidthSupported:\.+(.*)/) { - $rem_width_sup = $1; - } - } - } - my $capabilities = ""; - if ($print_extended_cap) { - $capabilities = sprintf("(%3s %s %6s / %8s [%s/%s][%s/%s])", - $width, $speed, $state, $phy_link_state, $width_enable, - $width_sup, $speed_enable, $speed_sup); - } else { - $capabilities = sprintf("(%3s %s %6s / %8s)", - $width, $speed, $state, $phy_link_state); - } - if ($print_add_switch) { - $port_timeouts = - sprintf(" (HOQ:%s VL_Stall:%s)", $hoq_life, $vl_stall); - } - if (!$only_down_links || ($only_down_links && $state eq "Down")) { - my $width_msg = ""; - my $speed_msg = ""; - if ($rem_width_enable ne "" && $rem_width_sup ne "") { - if ( $width_enable =~ /12X/ - && $rem_width_enable =~ /12X/ - && $width !~ /12X/) - { - $width_msg = "Could be 12X"; - } else { - if ( $width_enable =~ /8X/ - && $rem_width_enable =~ /8X/ - && $width !~ /8X/) - { - $width_msg = "Could be 8X"; - } else { - if ( $width_enable =~ /4X/ - && $rem_width_enable =~ /4X/ - && $width !~ /4X/) - { - $width_msg = "Could be 4X"; - } - } - } - } - if ($rem_speed_enable ne "" && $rem_speed_sup ne "") { - if ( $speed_enable =~ /10\.0/ - && $rem_speed_enable =~ /10\.0/ - && $speed !~ /10\.0/) - { - $speed_msg = "Could be 10.0 Gbps"; - } else { - if ( $speed_enable =~ /5\.0/ - && $rem_speed_enable =~ /5\.0/ - && $speed !~ /5\.0/) - { - $speed_msg = "Could be 5.0 Gbps"; - } - } - } - - if ($line_mode) { - my $line_begin = sprintf("%18s \"%30s\"%s", - $switch, $hr->{loc_desc}, $pkt_life_prompt); - my $ext_guid = sprintf("%18s", $hr->{rem_guid}); - if ($print_port_guids && $hr->{rem_port_guid} ne "") { - $ext_guid = sprintf("0x%016s", $hr->{rem_port_guid}); - } - push( - @output_lines, - sprintf( -"%s %6s %4s[%2s] ==%s%s==> %18s %6s %4s[%2s] \"%s\" ( %s %s)\n", - $line_begin, $hr->{loc_sw_lid}, - $port, $hr->{loc_ext_port}, - $capabilities, $port_timeouts, - $ext_guid, $hr->{rem_lid}, - $hr->{rem_port}, $hr->{rem_ext_port}, - $hr->{rem_desc}, $width_msg, - $speed_msg - ) - ); - } else { - push( - @output_lines, - sprintf( -" %6s %4s[%2s] ==%s%s==> %6s %4s[%2s] \"%s\" ( %s %s)\n", - $hr->{loc_sw_lid}, $port, - $hr->{loc_ext_port}, $capabilities, - $port_timeouts, $hr->{rem_lid}, - $hr->{rem_port}, $hr->{rem_ext_port}, - $hr->{rem_desc}, $width_msg, - $speed_msg - ) - ); - } - $print_switch = "yes"; - } - } - if ($print_switch eq "yes") { - foreach my $line (@output_lines) { print $line; } - } - } - if ($single_switch && $switch_found ne "yes") { - printf("Switch \"%s\" not found.\n", $single_switch); - } -} -main; - diff --git a/infiniband-diags/scripts/iblinkinfo.pl.in b/infiniband-diags/scripts/iblinkinfo.pl.in new file mode 100755 index 0000000..c81570d --- /dev/null +++ b/infiniband-diags/scripts/iblinkinfo.pl.in @@ -0,0 +1,40 @@ +#!/usr/bin/perl +# +# Copyright (c) 2009 Lawrence Livermore National Security +# +# Produced at Lawrence Livermore National Laboratory. +# Written by Ira Weiny . +# +# This software is available to you under a choice of one of two +# licenses. You may choose to be licensed under the terms of the GNU +# General Public License (GPL) Version 2, available from the file +# COPYING in the main directory of this source tree, or the +# OpenIB.org BSD license below: +# +# Redistribution and use in source and binary forms, with or +# without modification, are permitted provided that the following +# conditions are met: +# +# - Redistributions of source code must retain the above +# copyright notice, this list of conditions and the following +# disclaimer. +# +# - Redistributions in binary form must reproduce the above +# copyright notice, this list of conditions and the following +# disclaimer in the documentation and/or other materials +# provided with the distribution. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS +# BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN +# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN +# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# + + +# this is not just a wrapper for the C based utility +$str = join " ", at ARGV; +exec "@IBSCRIPTPATH@/iblinkinfo $str"; diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c new file mode 100644 index 0000000..1e43788 --- /dev/null +++ b/infiniband-diags/src/iblinkinfo.c @@ -0,0 +1,386 @@ +/* + * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. + * Copyright (c) 2008 Lawrence Livermore National Lab. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +char *argv0 = "iblinkinfotest"; +static FILE *f; + +static char *node_name_map_file = NULL; +static nn_map_t *node_name_map = NULL; + +static int timeout_ms = 500; + +static int down_links_only = 0; +static int line_mode = 0; +static int add_sw_settings = 0; +static int print_port_guids = 0; + +static unsigned int +get_max(unsigned int num) +{ + unsigned int v = num; // 32-bit word to find the log base 2 of + unsigned r = 0; // r will be lg(v) + + while (v >>= 1) // unroll for more speed... + { + r++; + } + + return (1 << r); +} + +void +get_msg(char *width_msg, char *speed_msg, int msg_size, ibnd_port_t *port) +{ + char buf[64]; + uint32_t max_speed = 0; + + uint32_t max_width = get_max(mad_get_field(port->info, 0, + IB_PORT_LINK_WIDTH_SUPPORTED_F) + & mad_get_field(port->remoteport->info, 0, + IB_PORT_LINK_WIDTH_SUPPORTED_F)); + if ((max_width & mad_get_field(port->info, 0, + IB_PORT_LINK_WIDTH_ACTIVE_F)) == 0) { + // we are not at the max supported width + // print what we could be at. + snprintf(width_msg, msg_size, "Could be %s", + mad_dump_val(IB_PORT_LINK_WIDTH_ACTIVE_F, + buf, 64, &max_width)); + } + + max_speed = get_max(mad_get_field(port->info, 0, + IB_PORT_LINK_SPEED_SUPPORTED_F) + & mad_get_field(port->remoteport->info, 0, + IB_PORT_LINK_SPEED_SUPPORTED_F)); + if ((max_speed & mad_get_field(port->info, 0, + IB_PORT_LINK_SPEED_ACTIVE_F)) == 0) { + // we are not at the max supported speed + // print what we could be at. + snprintf(speed_msg, msg_size, "Could be %s", + mad_dump_val(IB_PORT_LINK_SPEED_ACTIVE_F, + buf, 64, &max_speed)); + } +} + +void +print_port(ibnd_node_t *node, ibnd_port_t *port) +{ + char width[64], speed[64], state[64], physstate[64]; + char remote_guid_str[256]; + char remote_str[256]; + char link_str[256]; + char width_msg[256]; + char speed_msg[256]; + char ext_port_str[256]; + int iwidth = mad_get_field(port->info, 0, IB_PORT_LINK_WIDTH_ACTIVE_F); + int ispeed = mad_get_field(port->info, 0, IB_PORT_LINK_SPEED_ACTIVE_F); + int istate = mad_get_field(port->info, 0, IB_PORT_STATE_F); + int iphystate = mad_get_field(port->info, 0, IB_PORT_PHYS_STATE_F); + int n = 0; + + if (!port) + return; + + remote_guid_str[0] = '\0'; + remote_str[0] = '\0'; + link_str[0] = '\0'; + width_msg[0] = '\0'; + speed_msg[0] = '\0'; + + n = snprintf(link_str, 256, "(%3s %s %6s/%8s)", + mad_dump_val(IB_PORT_LINK_WIDTH_ACTIVE_F, width, 64, &iwidth), + mad_dump_val(IB_PORT_LINK_SPEED_ACTIVE_F, speed, 64, &ispeed), + mad_dump_val(IB_PORT_STATE_F, state, 64, &istate), + mad_dump_val(IB_PORT_PHYS_STATE_F, physstate, 64, &iphystate)); + + if (add_sw_settings) + snprintf(link_str+n, 256-n, + " (HOQ:%d VL_Stall:%d)", + mad_get_field(port->info, 0, IB_PORT_HOQ_LIFE_F), + mad_get_field(port->info, 0, IB_PORT_VL_STALL_COUNT_F)); + + if (port->remoteport) { + char *remap = remap_node_name(node_name_map, port->remoteport->node->guid, + port->remoteport->node->nodedesc); + + if (port->remoteport->ext_portnum) + snprintf(ext_port_str, 256, "%d", port->remoteport->ext_portnum); + else + ext_port_str[0] = '\0'; + + get_msg(width_msg, speed_msg, 256, port); + + if (line_mode) { + if (print_port_guids) + snprintf(remote_guid_str, 256, "0x%016"PRIx64" ", + port->remoteport->guid); + else + snprintf(remote_guid_str, 256, "0x%016"PRIx64" ", + port->remoteport->node->guid); + } + + snprintf(remote_str, 256, + "%s%6d %4d[%2s] \"%s\" (%s %s)\n", + remote_guid_str, + port->remoteport->base_lid ? port->remoteport->base_lid : + port->remoteport->node->smalid, + port->remoteport->portnum, + ext_port_str, + remap, + width_msg, + speed_msg); + free(remap); + } else + snprintf(remote_str, 256, " [ ] \"\" ( )\n"); + + if (port->ext_portnum) + snprintf(ext_port_str, 256, "%d", port->ext_portnum); + else + ext_port_str[0] = '\0'; + + if (line_mode) { + char *remap = remap_node_name(node_name_map, node->guid, + node->nodedesc); + printf("0x%016"PRIx64" \"%30s\" ", node->guid, remap); + free(remap); + } else + printf(" "); + + printf("%6d %4d[%2s] ==%s==> %s", + node->smalid, port->portnum, ext_port_str, link_str, remote_str); +} + +void +print_switch(ibnd_node_t *node, void *user_data) +{ + int i = 0; + + if (!line_mode) { + char *remap = remap_node_name(node_name_map, node->guid, + node->nodedesc); + printf("Switch 0x%016"PRIx64" %s:\n", node->guid, remap); + free(remap); + } + + for (i = 1; i <= node->numports; i++) { + ibnd_port_t *port = node->ports[i]; + if (!port) + continue; + if (!down_links_only || + mad_get_field(port->info, 0, IB_PORT_STATE_F) == IB_LINK_DOWN) { + print_port(node, port); + } + } +} + +void +usage(void) +{ + fprintf(stderr, + "Usage: %s [-hclp -S -D -C -P ]\n" + " Report link speed and connection for each port of each switch which is active\n" + " -h This help message\n" + " -S output only the node specified by guid\n" + " -D print only node specified by \n" + " -f specify node to start \"from\"\n" + " -n Number of hops to include away from specified node\n" + " -d print only down links\n" + " -l (line mode) print all information for each link on each line\n" + " -p print additional switch settings (PktLifeTime,HoqLife,VLStallCount)\n" + + + " -t timeout for any single fabric query\n" + " -s show progress during scan\n" + " --node-name-map use specified node name map\n" + + " -C use selected Channel Adaptor name for queries\n" + " -P use selected channel adaptor port for queries\n" + " -g print port guids instead of node guids\n" + " --debug print debug messages\n" + " -R (this option is obsolete and does nothing)\n" + , + argv0); + exit(-1); +} + +int +main(int argc, char **argv) +{ + char *ca = 0; + int ca_port = 0; + ibnd_fabric_t *fabric = NULL; + uint64_t guid = 0; + char *dr_path = NULL; + char *from = NULL; + int hops = 0; + ib_portid_t port_id; + + static char const str_opts[] = "S:D:n:C:P:t:sldgphuf:R"; + static const struct option long_opts[] = { + { "S", 1, 0, 'S'}, + { "D", 1, 0, 'D'}, + { "num-hops", 1, 0, 'n'}, + { "down-links-only", 0, 0, 'd'}, + { "line-mode", 0, 0, 'l'}, + { "ca-name", 1, 0, 'C'}, + { "ca-port", 1, 0, 'P'}, + { "timeout", 1, 0, 't'}, + { "show", 0, 0, 's'}, + { "print-port-guids", 0, 0, 'g'}, + { "print-additional", 0, 0, 'p'}, + { "help", 0, 0, 'h'}, + { "usage", 0, 0, 'u'}, + { "node-name-map", 1, 0, 1}, + { "debug", 0, 0, 2}, + { "compat", 0, 0, 3}, + { "from", 1, 0, 'f'}, + { "R", 0, 0, 'R'}, + { } + }; + + f = stdout; + + argv0 = argv[0]; + + while (1) { + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); + if ( ch == -1 ) + break; + switch(ch) { + case 1: + node_name_map_file = strdup(optarg); + break; + case 2: + ibnd_debug(1); + break; + case 'f': + from = strdup(optarg); + break; + case 'C': + ca = strdup(optarg); + break; + case 'P': + ca_port = strtoul(optarg, 0, 0); + break; + case 'D': + dr_path = strdup(optarg); + break; + case 'n': + hops = (int)strtol(optarg, NULL, 0); + break; + case 'd': + down_links_only = 1; + break; + case 'l': + line_mode = 1; + break; + case 't': + timeout_ms = strtoul(optarg, 0, 0); + break; + case 's': + ibnd_show_progress(1); + break; + case 'g': + print_port_guids = 1; + break; + case 'S': + guid = (uint64_t)strtoull(optarg, 0, 0); + break; + case 'p': + add_sw_settings = 1; + break; + case 'R': + /* GNDN */ + break; + default: + usage(); + break; + } + } + argc -= optind; + argv += optind; + + if (argc && !(f = fopen(argv[0], "w"))) + fprintf(stderr, "can't open file %s for writing", argv[0]); + + node_name_map = open_node_name_map(node_name_map_file); + + if (from) { + /* only scan part of the fabric */ + str2drpath(&(port_id.drpath), from, 0, 0); + if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, &port_id, hops)) == NULL) { + fprintf(stderr, "discover failed\n"); + exit(1); + } + guid = 0; + } else { + if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, NULL, -1)) == NULL) { + fprintf(stderr, "discover failed\n"); + exit(1); + } + } + + if (guid) { + ibnd_node_t *sw = ibnd_find_node_guid(fabric, guid); + print_switch(sw, NULL); + } else if (dr_path) { + ibnd_node_t *sw = ibnd_find_node_dr(fabric, dr_path); + print_switch(sw, NULL); + } else { + ibnd_iter_nodes_type(fabric, print_switch, IB_NODE_SWITCH, NULL); + } + + ibnd_destroy_fabric(fabric); + + close_node_name_map(node_name_map); + exit(0); +} -- 1.5.4.5 From weiny2 at llnl.gov Fri Apr 3 15:42:51 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 3 Apr 2009 15:42:51 -0700 Subject: [ofa-general] [PATCH v3 1/3] Create a new library libibnetdisc Message-ID: <20090403154251.dec181f2.weiny2@llnl.gov> >From e1c2c10678b0d1d90f7eb31eb1c1441b5ee43311 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Fri, 3 Apr 2009 15:28:08 -0700 Subject: [PATCH] Create a new library libibnetdisc This encompasses the functionality of ibnetdiscover in a C library. It returns a single "ibnd_fabric_t" object which represents the data found during the scan. The NodeInfo, PortInfo, and SwitchInfo are preserved from the queries made on the fabric to be used by the calling function as they see fit. This greatly benefits some diags like iblinkinfo.pl. This diag in particular was re-written using this library in C and has shown an 85% speed up on a ~1000 node cluster. Previous iblinkinfo.pl real 3m35.876s user 0m13.210s sys 1m1.046s New iblinkinfotest real 0m32.869s user 0m0.067s sys 0m0.140s Signed-off-by: Ira Weiny --- infiniband-diags/Makefile.am | 1 + infiniband-diags/configure.in | 18 +- infiniband-diags/libibnetdisc/Makefile.am | 47 ++ .../libibnetdisc/include/infiniband/ibnetdisc.h | 188 +++++ infiniband-diags/libibnetdisc/libibnetdisc.ver | 9 + infiniband-diags/libibnetdisc/man/ibnd_debug.3 | 2 + .../libibnetdisc/man/ibnd_destroy_fabric.3 | 2 + .../libibnetdisc/man/ibnd_discover_fabric.3 | 49 ++ .../libibnetdisc/man/ibnd_find_node_dr.3 | 2 + .../libibnetdisc/man/ibnd_find_node_guid.3 | 25 + .../libibnetdisc/man/ibnd_iter_nodes.3 | 24 + .../libibnetdisc/man/ibnd_iter_nodes_type.3 | 2 + .../libibnetdisc/man/ibnd_show_progress.3 | 2 + .../libibnetdisc/man/ibnd_update_node.3 | 21 + infiniband-diags/libibnetdisc/src/chassis.c | 830 ++++++++++++++++++++ infiniband-diags/libibnetdisc/src/chassis.h | 85 ++ infiniband-diags/libibnetdisc/src/ibnetdisc.c | 700 +++++++++++++++++ infiniband-diags/libibnetdisc/src/internal.h | 95 +++ infiniband-diags/libibnetdisc/src/libibnetdisc.map | 27 + infiniband-diags/libibnetdisc/test/testleaks.c | 179 +++++ 20 files changed, 2304 insertions(+), 4 deletions(-) create mode 100644 infiniband-diags/libibnetdisc/Makefile.am create mode 100644 infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h create mode 100644 infiniband-diags/libibnetdisc/libibnetdisc.ver create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_debug.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_destroy_fabric.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_find_node_dr.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_find_node_guid.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_iter_nodes.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_iter_nodes_type.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_show_progress.3 create mode 100644 infiniband-diags/libibnetdisc/man/ibnd_update_node.3 create mode 100644 infiniband-diags/libibnetdisc/src/chassis.c create mode 100644 infiniband-diags/libibnetdisc/src/chassis.h create mode 100644 infiniband-diags/libibnetdisc/src/ibnetdisc.c create mode 100644 infiniband-diags/libibnetdisc/src/internal.h create mode 100644 infiniband-diags/libibnetdisc/src/libibnetdisc.map create mode 100644 infiniband-diags/libibnetdisc/test/testleaks.c diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am index f9cc5bd..7b8523a 100644 --- a/infiniband-diags/Makefile.am +++ b/infiniband-diags/Makefile.am @@ -1,3 +1,4 @@ +SUBDIRS = libibnetdisc INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband diff --git a/infiniband-diags/configure.in b/infiniband-diags/configure.in index 4495a53..2b73167 100644 --- a/infiniband-diags/configure.in +++ b/infiniband-diags/configure.in @@ -44,7 +44,7 @@ fi dnl Checks for header files. AC_HEADER_STDC -AC_CHECK_HEADERS([stdlib.h string.h unistd.h fcntl.h inttypes.h netinet/in.h sys/ioctl.h syslog.h]) +AC_CHECK_HEADERS([stdlib.h string.h unistd.h fcntl.h inttypes.h netinet/in.h sys/ioctl.h]) if test "$disable_libcheck" != "yes" then AC_CHECK_HEADER(infiniband/umad.h, [], @@ -58,7 +58,7 @@ fi dnl Checks for library functions AC_FUNC_ERROR_AT_LINE AC_FUNC_VPRINTF -AC_CHECK_FUNCS([strchr strrchr strtol strtoul memset]) +AC_CHECK_FUNCS([strchr strrchr strtol strtoul memset strtoull]) dnl Checks for typedefs, structures, and compiler characteristics. AC_C_CONST @@ -66,7 +66,7 @@ AC_C_CONST dnl Check if we should include test utilities AC_MSG_CHECKING(for --enable-test-utils) AC_ARG_ENABLE(test-utils, -[ --enable-test-utils build additional test utilities], +[ --enable-test-utils build additional test utilities (default=no)], [case "${enableval}" in yes) tutils=yes ;; no) tutils=no ;; @@ -136,6 +136,15 @@ IBSCRIPTPATH_TMP2="`echo $IBSCRIPTPATH_TMP1 | sed 's/^NONE/$ac_default_prefix/'` IBSCRIPTPATH="`eval echo $IBSCRIPTPATH_TMP2`" AC_SUBST(IBSCRIPTPATH) +dnl Begin libibnetdisc stuff +ibnetdisc_api_version=`grep LIBVERSION $srcdir/libibnetdisc/libibnetdisc.ver | sed 's/LIBVERSION=//'` +if test -z $ibnetdisc_api_version; then + echo "FAILED to find $srcdir/libibnetdisc/libibnetdisc.ver" + exit 1 +fi +AC_SUBST(ibnetdisc_api_version) +dnl End libibnetdisc stuff + AC_CONFIG_FILES([\ Makefile \ infiniband-diags.spec \ @@ -156,6 +165,7 @@ AC_CONFIG_FILES([\ scripts/ibhosts \ scripts/ibnodes \ scripts/ibswitches \ - scripts/ibrouters + scripts/ibrouters \ + libibnetdisc/Makefile ]) AC_OUTPUT diff --git a/infiniband-diags/libibnetdisc/Makefile.am b/infiniband-diags/libibnetdisc/Makefile.am new file mode 100644 index 0000000..e6e3620 --- /dev/null +++ b/infiniband-diags/libibnetdisc/Makefile.am @@ -0,0 +1,47 @@ + +#SUBDIRS = . + +INCLUDES = -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband + +lib_LTLIBRARIES = libibnetdisc.la +sbin_PROGRAMS = + +if ENABLE_TEST_UTILS +sbin_PROGRAMS += test/testleaks +endif + +DBGFLAGS = -g + +if HAVE_LD_VERSION_SCRIPT +libibnetdisc_version_script = -Wl,--version-script=$(srcdir)/src/libibnetdisc.map +else +libibnetdisc_version_script = +endif + +libibnetdisc_la_SOURCES = src/ibnetdisc.c src/chassis.c src/chassis.h +libibnetdisc_la_CFLAGS = -Wall $(DBGFLAGS) +libibnetdisc_la_LDFLAGS = -version-info $(ibnetdisc_api_version) \ + -export-dynamic $(libibnetdisc_version_script) \ + -libmad +libibnetdisc_la_DEPENDENCIES = $(srcdir)/src/libibnetdisc.map + +libibnetdiscincludedir = $(includedir)/infiniband + +test_testleaks_SOURCES = test/testleaks.c +test_testleaks_CFLAGS = -Wall $(DBGFLAGS) +test_testleaks_LDFLAGS = -libnetdisc + +libibnetdiscinclude_HEADERS = $(srcdir)/include/infiniband/ibnetdisc.h + +man_MANS = man/ibnd_debug.3 \ + man/ibnd_destroy_fabric.3 \ + man/ibnd_discover_fabric.3 \ + man/ibnd_find_node_dr.3 \ + man/ibnd_find_node_guid.3 \ + man/ibnd_iter_nodes.3 \ + man/ibnd_iter_nodes_type.3 \ + man/ibnd_show_progress.3 \ + man/ibnd_update_node.3 + +EXTRA_DIST = $(srcdir)/src/libibnetdisc.map libibnetdisc.ver + diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h new file mode 100644 index 0000000..a882994 --- /dev/null +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h @@ -0,0 +1,188 @@ +/* + * Copyright (c) 2008 Lawrence Livermore National Lab. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef _IBNETDISC_H_ +#define _IBNETDISC_H_ + +#include +#include +#include + +struct ib_fabric; /* forward declare */ +struct chassis; /* forward declare */ +struct port; /* forward declare */ + +/** ========================================================================= + * Node + */ +typedef struct node { + struct node *next; /* all node list in fabric */ + struct ib_fabric *fabric; /* the fabric node belongs to */ + + ib_portid_t path_portid; /* path from "from_node" */ + int dist; /* num of hops from "from_node" */ + int smalid; + int smalmc; + + /* quick cache of switchinfo below */ + int smaenhsp0; + /* use libibmad decoder functions for switchinfo */ + //WHY does this not work??? + //uint8_t switchinfo[sizeof (ib_switch_info_t)]; + uint8_t switchinfo[64]; + + /* quick cache of info below */ + uint64_t guid; + int type; + int numports; + /* use libibmad decoder functions for info */ + uint8_t info[sizeof(ib_node_info_t)]; + + char nodedesc[IB_NODE_DESCRIPTION_SIZE]; + + struct port **ports; /* in order array of port pointers */ + /* the size of this array is info.numports + 1 */ + /* items MAY BE NULL! (ie 0 == switches only) */ + + /* chassis info */ + struct node *next_chassis_node; /* next node in ibnd_chassis_t->nodes */ + struct chassis *chassis; /* if != NULL the chassis this node belongs to */ + unsigned char ch_type; + unsigned char ch_anafanum; + unsigned char ch_slotnum; + unsigned char ch_slot; +} ibnd_node_t; + +/** ========================================================================= + * Port + */ +typedef struct port { + uint64_t guid; + int portnum; + int ext_portnum; /* optional if != 0 external port num */ + ibnd_node_t *node; /* node this port belongs to */ + struct port *remoteport; /* null if SMA, or does not exist */ + /* quick cache of info below */ + uint16_t base_lid; + uint8_t lmc; + /* use libibmad decoder functions for info */ + uint8_t info[sizeof(ib_port_info_t)]; +} ibnd_port_t; + + +/** ========================================================================= + * Chassis + */ +typedef struct chassis { + struct chassis *next; + uint64_t chassisguid; + int chassisnum; + + /* generic grouping by SystemImageGUID */ + int nodecount; + ibnd_node_t *nodes; + + /* specific to voltaire type nodes */ +#define SPINES_MAX_NUM 12 +#define LINES_MAX_NUM 36 + ibnd_node_t *spinenode[SPINES_MAX_NUM + 1]; + ibnd_node_t *linenode[LINES_MAX_NUM + 1]; +} ibnd_chassis_t; + +/** ========================================================================= + * Fabric + * Main fabric object which is returned and represents the data discovered + */ +typedef struct ib_fabric { + /* the node the discover was initiated from + * "from" parameter in ibnd_discover_fabric + * or by default the node you ar running on + */ + ibnd_node_t *from_node; + /* NULL term list of all nodes in the fabric */ + ibnd_node_t *nodes; + /* NULL terminated list of all chassis found in the fabric */ + ibnd_chassis_t *chassis; + int maxhops_discovered; +} ibnd_fabric_t; + + +/** ========================================================================= + * Initialization (fabric operations) + */ +void ibnd_debug(int i); +void ibnd_show_progress(int i); + +ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int dev_port, + int timeout_ms, ib_portid_t *from, int hops); + /** + * dev_name: (required) local device name to use to access the fabric + * dev_port: (required) local device port to use to access the fabric + * timeout_ms: (required) gives the timeout for a _SINGLE_ query on + * the fabric. So if there are multiple nodes not + * responding this may result in a lengthy delay. + * from: (optional) specify the node to start scanning from. + * If NULL start from the node we are running on. + * hops: (optional) Specify how much of the fabric to traverse. + * negative value == scan entire fabric + */ +void ibnd_destroy_fabric(ibnd_fabric_t *fabric); + +/** ========================================================================= + * Node operations + */ +ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t *fabric, uint64_t guid); +ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t *fabric, char *dr_str); +ibnd_node_t *ibnd_update_node(ibnd_node_t *node); + +typedef void (*ibnd_iter_node_func_t)(ibnd_node_t *node, void *user_data); +void ibnd_iter_nodes(ibnd_fabric_t *fabric, + ibnd_iter_node_func_t func, + void *user_data); +void ibnd_iter_nodes_type(ibnd_fabric_t *fabric, + ibnd_iter_node_func_t func, + int node_type, + void *user_data); + +/** ========================================================================= + * Chassis queries + */ +uint64_t ibnd_get_chassis_guid(ibnd_fabric_t *fabric, unsigned char chassisnum); +char *ibnd_get_chassis_type(ibnd_node_t *node); +char *ibnd_get_chassis_slot_str(ibnd_node_t *node, char *str, size_t size); + +int ibnd_is_xsigo_guid(uint64_t guid); +int ibnd_is_xsigo_tca(uint64_t guid); +int ibnd_is_xsigo_hca(uint64_t guid); + +#endif /* _IBNETDISC_H_ */ diff --git a/infiniband-diags/libibnetdisc/libibnetdisc.ver b/infiniband-diags/libibnetdisc/libibnetdisc.ver new file mode 100644 index 0000000..a0a5f3c --- /dev/null +++ b/infiniband-diags/libibnetdisc/libibnetdisc.ver @@ -0,0 +1,9 @@ +# In this file we track the current API version +# of the IB net discover interface (and libraries) +# The version is built of the following +# tree numbers: +# API_REV:RUNNING_REV:AGE +# API_REV - advance on any added API +# RUNNING_REV - advance any change to the vendor files +# AGE - number of backward versions the API still supports +LIBVERSION=1:0:0 diff --git a/infiniband-diags/libibnetdisc/man/ibnd_debug.3 b/infiniband-diags/libibnetdisc/man/ibnd_debug.3 new file mode 100644 index 0000000..a4076fc --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_debug.3 @@ -0,0 +1,2 @@ +.\".TH IBND_DEBUG 3 "Aug 04, 2008" "OpenIB" "OpenIB Programmer's Manual" +.so man3/ibnd_discover_fabric.3 diff --git a/infiniband-diags/libibnetdisc/man/ibnd_destroy_fabric.3 b/infiniband-diags/libibnetdisc/man/ibnd_destroy_fabric.3 new file mode 100644 index 0000000..8fe20ae --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_destroy_fabric.3 @@ -0,0 +1,2 @@ +.\".TH IBND_DESTROY_FABRIC 3 "Aug 04, 2008" "OpenIB" "OpenIB Programmer's Manual" +.so man3/ibnd_discover_fabric.3 diff --git a/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 b/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 new file mode 100644 index 0000000..44d8c65 --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 @@ -0,0 +1,49 @@ +.TH IBND_DISCOVER_FABRIC 3 "July 25, 2008" "OpenIB" "OpenIB Programmer's Manual" +.SH "NAME" +ibnd_discover_fabric, ibnd_destroy_fabric, ibnd_debug ibnd_show_progress \- initialize ibnetdiscover library. +.SH "SYNOPSIS" +.nf +.B #include +.sp +.BI "ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int dev_port, int timeout_ms, ib_portid_t *from, int hops)" +.BI "void ibnd_destroy_fabric(ibnd_fabric_t *fabric)" +.BI "void ibnd_debug(int i)" +.BI "void ibnd_show_progress(int i)" + + +.SH "DESCRIPTION" +.B ibnd_discover_fabric() +Discover the fabric connected to the port specified by dev_name and dev_port, using a timeout specified. The "from" and "hops" parameters are optional and allow one to scan part of a fabric by specifying a node "from" and a number of hops away from that node to scan, "hops". This gives the user a "sub-fabric" which is "centered" anywhere they chose. + +.B ibnd_destroy_fabric() +free all memory and resources associated with the fabric. + +.B ibnd_debug() +Set the debug level to be printed as library operations take place. + +.B ibnd_debug() +Indicate that the library should print debug output which shows it's progress +through the fabric. + +.SH "RETURN VALUE" +.B ibnd_discover_fabric() +return NULL on failure, otherwise a valid ibnd_fabric_t object. + +.B ibnd_destory_fabric(), ibnd_debug() +NONE + +.SH "EXAMPLES" + +.B Discover the entire fabric connected to device "mthca0", port 1. + + ibnd_discover_fabric("mthca0", 1, 100, NULL, 0); + +.B Discover only a single node and those nodes connected to it. + + str2drpath(&(port_id.drpath), from, 0, 0); + + ibnd_discover_fabric("mthca0", 1, 100, &port_id, 1); + +.SH "AUTHORS" +.TP +Ira Weiny diff --git a/infiniband-diags/libibnetdisc/man/ibnd_find_node_dr.3 b/infiniband-diags/libibnetdisc/man/ibnd_find_node_dr.3 new file mode 100644 index 0000000..612e501 --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_find_node_dr.3 @@ -0,0 +1,2 @@ +.\".TH IBND_FIND_NODE_DR 3 "Aug 04, 2008" "OpenIB" "OpenIB Programmer's Manual" +.so man3/ibnd_find_node_guid.3 diff --git a/infiniband-diags/libibnetdisc/man/ibnd_find_node_guid.3 b/infiniband-diags/libibnetdisc/man/ibnd_find_node_guid.3 new file mode 100644 index 0000000..676b528 --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_find_node_guid.3 @@ -0,0 +1,25 @@ +.TH IBND_FIND_NODE_GUID 3 "July 25, 2008" "OpenIB" "OpenIB Programmer's Manual" +.SH "NAME" +ibnd_find_node_guid, ibnd_find_node_dr \- given a fabric object find the node object within it which matches the guid or directed route specified. + +.SH "SYNOPSIS" +.nf +.B #include +.sp +.BI "ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t *fabric, uint64_t guid)" +.BI "ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t *fabric, char *dr_str)" + +.SH "DESCRIPTION" +.B ibnd_find_node_guid() +Given a fabric object and a guid, return the ibnd_node_t object with that node guid. +.B ibnd_find_node_dr() +Given a fabric object and a directed route, return the ibnd_node_t object with +that directed route. + +.SH "RETURN VALUE" +.B ibnd_find_node_guid(), ibnd_find_node_dr() +return NULL on failure, otherwise a valid ibnd_node_t object. + +.SH "AUTHORS" +.TP +Ira Weiny diff --git a/infiniband-diags/libibnetdisc/man/ibnd_iter_nodes.3 b/infiniband-diags/libibnetdisc/man/ibnd_iter_nodes.3 new file mode 100644 index 0000000..7199dfb --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_iter_nodes.3 @@ -0,0 +1,24 @@ +.TH IBND_ITER_NODES 3 "July 25, 2008" "OpenIB" "OpenIB Programmer's Manual" +.SH "NAME" +ibnd_iter_nodes, ibnd_iter_nodes_type \- given a fabric object and a function itterate over the nodes in the fabric. + +.SH "SYNOPSIS" +.nf +.B #include +.sp +.BI "void ibnd_iter_nodes(ibnd_fabric_t *fabric, ibnd_iter_func_t func, void *user_data)" +.BI "void ibnd_iter_nodes_type(ibnd_fabric_t *fabric, ibnd_iter_func_t func, ibnd_node_type_t type, void *user_data)" + +.SH "DESCRIPTION" +.B ibnd_iter_nodes() +Itterate through all the nodes in the fabric and call "func" on them. +.B ibnd_iter_nodes_type() +The same as ibnd_iter_nodes except to limit the iteration to the nodes with the specified type. + +.SH "RETURN VALUE" +.B ibnd_iter_nodes(), ibnd_iter_nodes_type() +NONE + +.SH "AUTHORS" +.TP +Ira Weiny diff --git a/infiniband-diags/libibnetdisc/man/ibnd_iter_nodes_type.3 b/infiniband-diags/libibnetdisc/man/ibnd_iter_nodes_type.3 new file mode 100644 index 0000000..dc3ac8f --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_iter_nodes_type.3 @@ -0,0 +1,2 @@ +.\".TH IBND_FIND_NODES_TYPE 3 "Aug 04, 2008" "OpenIB" "OpenIB Programmer's Manual" +.so man3/ibnd_iter_nodes.3 diff --git a/infiniband-diags/libibnetdisc/man/ibnd_show_progress.3 b/infiniband-diags/libibnetdisc/man/ibnd_show_progress.3 new file mode 100644 index 0000000..280af31 --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_show_progress.3 @@ -0,0 +1,2 @@ +.\".TH IBND_SHOW_PROGRESS 3 "Nov 26, 2008" "OpenIB" "OpenIB Programmer's Manual" +.so man3/ibnd_discover_fabric.3 diff --git a/infiniband-diags/libibnetdisc/man/ibnd_update_node.3 b/infiniband-diags/libibnetdisc/man/ibnd_update_node.3 new file mode 100644 index 0000000..d3aa206 --- /dev/null +++ b/infiniband-diags/libibnetdisc/man/ibnd_update_node.3 @@ -0,0 +1,21 @@ +.TH IBND_UPDATE_NODE 3 "July 25, 2008" "OpenIB" "OpenIB Programmer's Manual" +.SH "NAME" +ibnd_update_node \- Update the node specified with new data from the fabric. + +.SH "SYNOPSIS" +.nf +.B #include +.sp +.BI "ibnd_node_t *ibnd_update_node(ibnd_node_t *node)" + +.SH "DESCRIPTION" +.B ibnd_update_node() +Update the node info, port info, and node description of the node specified. + +.SH "RETURN VALUE" +.B ibnd_update_node() +Return NULL on failure, otherwise a valid ibnd_node_t object which is part of the fabric object. + +.SH "AUTHORS" +.TP +Ira Weiny diff --git a/infiniband-diags/libibnetdisc/src/chassis.c b/infiniband-diags/libibnetdisc/src/chassis.c new file mode 100644 index 0000000..a25d710 --- /dev/null +++ b/infiniband-diags/libibnetdisc/src/chassis.c @@ -0,0 +1,830 @@ +/* + * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. + * Copyright (c) 2008 Lawrence Livermore National Lab. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +/*========================================================*/ +/* FABRIC SCANNER SPECIFIC DATA */ +/*========================================================*/ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#include +#include + +#include + +#include "internal.h" +#include "chassis.h" + +static char *ChassisTypeStr[5] = { "", "ISR9288", "ISR9096", "ISR2012", "ISR2004" }; +static char *ChassisSlotTypeStr[4] = { "", "Line", "Spine", "SRBD" }; + +char *ibnd_get_chassis_type(ibnd_node_t *node) +{ + /* Currently, only if Voltaire chassis */ + if (mad_get_field(node->info, 0, IB_NODE_VENDORID_F) != VTR_VENDOR_ID) + return (NULL); + if (!node->chassis) + return (NULL); + if (node->ch_type == UNRESOLVED_CT + || node->ch_type > ISR2004_CT) + return (NULL); + return ChassisTypeStr[node->ch_type]; +} + +char *ibnd_get_chassis_slot_str(ibnd_node_t *node, char *str, size_t size) +{ + /* Currently, only if Voltaire chassis */ + if (mad_get_field(node->info, 0, IB_NODE_VENDORID_F) != VTR_VENDOR_ID) + return (NULL); + if (!node->chassis) + return (NULL); + if (node->ch_slot == UNRESOLVED_CS + || node->ch_slot > SRBD_CS) + return (NULL); + if (!str) + return (NULL); + snprintf(str, size, "%s %d Chip %d", + ChassisSlotTypeStr[node->ch_slot], + node->ch_slotnum, + node->ch_anafanum); + return (str); +} + +static ibnd_chassis_t *find_chassisnum(struct ibnd_fabric *fabric, unsigned char chassisnum) +{ + ibnd_chassis_t *current; + + for (current = fabric->first_chassis; current; current = current->next) { + if (current->chassisnum == chassisnum) + return current; + } + + return NULL; +} + +static uint64_t topspin_chassisguid(uint64_t guid) +{ + /* Byte 3 in system image GUID is chassis type, and */ + /* Byte 4 is location ID (slot) so just mask off byte 4 */ + return guid & 0xffffffff00ffffffULL; +} + +int ibnd_is_xsigo_guid(uint64_t guid) +{ + if ((guid & 0xffffff0000000000ULL) == 0x0013970000000000ULL) + return 1; + else + return 0; +} + +static int is_xsigo_leafone(uint64_t guid) +{ + if ((guid & 0xffffffffff000000ULL) == 0x0013970102000000ULL) + return 1; + else + return 0; +} + +int ibnd_is_xsigo_hca(uint64_t guid) +{ + /* NodeType 2 is HCA */ + if ((guid & 0xffffffff00000000ULL) == 0x0013970200000000ULL) + return 1; + else + return 0; +} + +int ibnd_is_xsigo_tca(uint64_t guid) +{ + /* NodeType 3 is TCA */ + if ((guid & 0xffffffff00000000ULL) == 0x0013970300000000ULL) + return 1; + else + return 0; +} + +static int is_xsigo_ca(uint64_t guid) +{ + if (ibnd_is_xsigo_hca(guid) || ibnd_is_xsigo_tca(guid)) + return 1; + else + return 0; +} + +static int is_xsigo_switch(uint64_t guid) +{ + if ((guid & 0xffffffff00000000ULL) == 0x0013970100000000ULL) + return 1; + else + return 0; +} + +static uint64_t xsigo_chassisguid(ibnd_node_t *node) +{ + uint64_t sysimgguid = mad_get_field64(node->info, 0, IB_NODE_SYSTEM_GUID_F); + if (!is_xsigo_ca(sysimgguid)) { + /* Byte 3 is NodeType and byte 4 is PortType */ + /* If NodeType is 1 (switch), PortType is masked */ + if (is_xsigo_switch(sysimgguid)) + return sysimgguid & 0xffffffff00ffffffULL; + else + return sysimgguid; + } else { + if (!node->ports || !node->ports[1]) + return (0); + + /* Is there a peer port ? */ + if (!node->ports[1]->remoteport) + return sysimgguid; + + /* If peer port is Leaf 1, use its chassis GUID */ + uint64_t remote_sysimgguid = mad_get_field64( + node->ports[1]->remoteport->node->info, + 0, IB_NODE_SYSTEM_GUID_F); + if (is_xsigo_leafone(remote_sysimgguid)) + return remote_sysimgguid & 0xffffffff00ffffffULL; + else + return sysimgguid; + } +} + +static uint64_t get_chassisguid(ibnd_node_t *node) +{ + uint32_t vendid = mad_get_field(node->info, 0, IB_NODE_VENDORID_F); + uint64_t sysimgguid = mad_get_field64(node->info, 0, IB_NODE_SYSTEM_GUID_F); + + if (vendid == TS_VENDOR_ID || vendid == SS_VENDOR_ID) + return topspin_chassisguid(sysimgguid); + else if (vendid == XS_VENDOR_ID || ibnd_is_xsigo_guid(sysimgguid)) + return xsigo_chassisguid(node); + else + return sysimgguid; +} + +static ibnd_chassis_t *find_chassisguid(ibnd_node_t *node) +{ + struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(node->fabric); + ibnd_chassis_t *current; + uint64_t chguid; + + chguid = get_chassisguid(node); + for (current = f->first_chassis; current; current = current->next) { + if (current->chassisguid == chguid) + return current; + } + + return NULL; +} + +uint64_t ibnd_get_chassis_guid(ibnd_fabric_t *fabric, unsigned char chassisnum) +{ + struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); + ibnd_chassis_t *chassis; + + chassis = find_chassisnum(f, chassisnum); + if (chassis) + return chassis->chassisguid; + else + return 0; +} + +static int is_router(struct ibnd_node *n) +{ + uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); + return (devid == VTR_DEVID_IB_FC_ROUTER || + devid == VTR_DEVID_IB_IP_ROUTER); +} + +static int is_spine_9096(struct ibnd_node *n) +{ + uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); + return (devid == VTR_DEVID_SFB4 || + devid == VTR_DEVID_SFB4_DDR); +} + +static int is_spine_9288(struct ibnd_node *n) +{ + uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); + return (devid == VTR_DEVID_SFB12 || + devid == VTR_DEVID_SFB12_DDR); +} + +static int is_spine_2004(struct ibnd_node *n) +{ + uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); + return (devid == VTR_DEVID_SFB2004); +} + +static int is_spine_2012(struct ibnd_node *n) +{ + uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); + return (devid == VTR_DEVID_SFB2012); +} + +static int is_spine(struct ibnd_node *n) +{ + return (is_spine_9096(n) || is_spine_9288(n) || + is_spine_2004(n) || is_spine_2012(n)); +} + +static int is_line_24(struct ibnd_node *n) +{ + uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); + return (devid == VTR_DEVID_SLB24 || + devid == VTR_DEVID_SLB24_DDR || + devid == VTR_DEVID_SRB2004); +} + +static int is_line_8(struct ibnd_node *n) +{ + uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); + return (devid == VTR_DEVID_SLB8); +} + +static int is_line_2024(struct ibnd_node *n) +{ + uint32_t devid = mad_get_field(n->node.info, 0, IB_NODE_DEVID_F); + return (devid == VTR_DEVID_SLB2024); +} + +static int is_line(struct ibnd_node *n) +{ + return (is_line_24(n) || is_line_8(n) || is_line_2024(n)); +} + +int is_chassis_switch(struct ibnd_node *n) +{ + return (is_spine(n) || is_line(n)); +} + +/* these structs help find Line (Anafa) slot number while using spine portnum */ +int line_slot_2_sfb4[25] = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 }; +int anafa_line_slot_2_sfb4[25] = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 }; +int line_slot_2_sfb12[25] = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 }; +int anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 }; + +/* IPR FCR modules connectivity while using sFB4 port as reference */ +int ipr_slot_2_sfb4_port[25] = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 }; + +/* these structs help find Spine (Anafa) slot number while using spine portnum */ +int spine12_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +int anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +int spine4_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +int anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +/* reference { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; */ + +static void get_sfb_slot(struct ibnd_node *node, ibnd_port_t *lineport) +{ + ibnd_node_t *n = (ibnd_node_t *)node; + + n->ch_slot = SPINE_CS; + if (is_spine_9096(node)) { + n->ch_type = ISR9096_CT; + n->ch_slotnum = spine4_slot_2_slb[lineport->portnum]; + n->ch_anafanum = anafa_spine4_slot_2_slb[lineport->portnum]; + } else if (is_spine_9288(node)) { + n->ch_type = ISR9288_CT; + n->ch_slotnum = spine12_slot_2_slb[lineport->portnum]; + n->ch_anafanum = anafa_spine12_slot_2_slb[lineport->portnum]; + } else if (is_spine_2012(node)) { + n->ch_type = ISR2012_CT; + n->ch_slotnum = spine12_slot_2_slb[lineport->portnum]; + n->ch_anafanum = anafa_spine12_slot_2_slb[lineport->portnum]; + } else if (is_spine_2004(node)) { + n->ch_type = ISR2004_CT; + n->ch_slotnum = spine4_slot_2_slb[lineport->portnum]; + n->ch_anafanum = anafa_spine4_slot_2_slb[lineport->portnum]; + } else { + IBPANIC("Unexpected node found: guid 0x%016" PRIx64, + node->node.guid); + } +} + +static void get_router_slot(struct ibnd_node *node, ibnd_port_t *spineport) +{ + ibnd_node_t *n = (ibnd_node_t *)node; + int guessnum = 0; + + node->ch_found = 1; + + n->ch_slot = SRBD_CS; + if (is_spine_9096(CONV_NODE_INTERNAL(spineport->node))) { + n->ch_type = ISR9096_CT; + n->ch_slotnum = line_slot_2_sfb4[spineport->portnum]; + n->ch_anafanum = ipr_slot_2_sfb4_port[spineport->portnum]; + } else if (is_spine_9288(CONV_NODE_INTERNAL(spineport->node))) { + n->ch_type = ISR9288_CT; + n->ch_slotnum = line_slot_2_sfb12[spineport->portnum]; + /* this is a smart guess based on nodeguids order on sFB-12 module */ + guessnum = spineport->node->guid % 4; + /* module 1 <--> remote anafa 3 */ + /* module 2 <--> remote anafa 2 */ + /* module 3 <--> remote anafa 1 */ + n->ch_anafanum = (guessnum == 3 ? 1 : (guessnum == 1 ? 3 : 2)); + } else if (is_spine_2012(CONV_NODE_INTERNAL(spineport->node))) { + n->ch_type = ISR2012_CT; + n->ch_slotnum = line_slot_2_sfb12[spineport->portnum]; + /* this is a smart guess based on nodeguids order on sFB-12 module */ + guessnum = spineport->node->guid % 4; + // module 1 <--> remote anafa 3 + // module 2 <--> remote anafa 2 + // module 3 <--> remote anafa 1 + n->ch_anafanum = (guessnum == 3? 1 : (guessnum == 1 ? 3 : 2)); + } else if (is_spine_2004(CONV_NODE_INTERNAL(spineport->node))) { + n->ch_type = ISR2004_CT; + n->ch_slotnum = line_slot_2_sfb4[spineport->portnum]; + n->ch_anafanum = ipr_slot_2_sfb4_port[spineport->portnum]; + } else { + IBPANIC("Unexpected node found: guid 0x%016" PRIx64, + spineport->node->guid); + } +} + +static void get_slb_slot(ibnd_node_t *n, ibnd_port_t *spineport) +{ + n->ch_slot = LINE_CS; + if (is_spine_9096(CONV_NODE_INTERNAL(spineport->node))) { + n->ch_type = ISR9096_CT; + n->ch_slotnum = line_slot_2_sfb4[spineport->portnum]; + n->ch_anafanum = anafa_line_slot_2_sfb4[spineport->portnum]; + } else if (is_spine_9288(CONV_NODE_INTERNAL(spineport->node))) { + n->ch_type = ISR9288_CT; + n->ch_slotnum = line_slot_2_sfb12[spineport->portnum]; + n->ch_anafanum = anafa_line_slot_2_sfb12[spineport->portnum]; + } else if (is_spine_2012(CONV_NODE_INTERNAL(spineport->node))) { + n->ch_type = ISR2012_CT; + n->ch_slotnum = line_slot_2_sfb12[spineport->portnum]; + n->ch_anafanum = anafa_line_slot_2_sfb12[spineport->portnum]; + } else if (is_spine_2004(CONV_NODE_INTERNAL(spineport->node))) { + n->ch_type = ISR2004_CT; + n->ch_slotnum = line_slot_2_sfb4[spineport->portnum]; + n->ch_anafanum = anafa_line_slot_2_sfb4[spineport->portnum]; + } else { + IBPANIC("Unexpected node found: guid 0x%016" PRIx64, + spineport->node->guid); + } +} + +/* forward declare this */ +static void voltaire_portmap(ibnd_port_t *port); +/* + This function called for every Voltaire node in fabric + It could be optimized so, but time overhead is very small + and its only diag.util +*/ +static void fill_voltaire_chassis_record(struct ibnd_node *node) +{ + ibnd_node_t *n = (ibnd_node_t *)node; + int p = 0; + ibnd_port_t *port; + struct ibnd_node *remnode = 0; + + if (node->ch_found) /* somehow this node has already been passed */ + return; + node->ch_found = 1; + + /* node is router only in case of using unique lid */ + /* (which is lid of chassis router port) */ + /* in such case node->ports is actually a requested port... */ + if (is_router(node)) { + /* find the remote node */ + for (p = 1; p <= node->node.numports; p++) { + port = node->node.ports[p]; + if (port && is_spine(CONV_NODE_INTERNAL(port->remoteport->node))) + get_router_slot(node, port->remoteport); + } + } else if (is_spine(node)) { + for (p = 1; p <= node->node.numports; p++) { + port = node->node.ports[p]; + if (!port || !port->remoteport) + continue; + remnode = CONV_NODE_INTERNAL(port->remoteport->node); + if (remnode->node.type != IB_NODE_SWITCH) { + if (!remnode->ch_found) + get_router_slot(remnode, port); + continue; + } + if (!n->ch_type) + /* we assume here that remoteport belongs to line */ + get_sfb_slot(node, port->remoteport); + + /* we could break here, but need to find if more routers connected */ + } + + } else if (is_line(node)) { + for (p = 1; p <= node->node.numports; p++) { + port = node->node.ports[p]; + if (!port || port->portnum > 12 || !port->remoteport) + continue; + /* we assume here that remoteport belongs to spine */ + get_slb_slot(n, port->remoteport); + break; + } + } + + /* for each port of this node, map external ports */ + for (p = 1; p <= node->node.numports; p++) { + port = node->node.ports[p]; + if (!port) + continue; + voltaire_portmap(port); + } + + return; +} + +static int get_line_index(ibnd_node_t *node) +{ + int retval = 3 * (node->ch_slotnum - 1) + node->ch_anafanum; + + if (retval > LINES_MAX_NUM || retval < 1) + IBPANIC("Internal error"); + return retval; +} + +static int get_spine_index(ibnd_node_t *node) +{ + int retval; + + if (is_spine_9288(CONV_NODE_INTERNAL(node)) || is_spine_2012(CONV_NODE_INTERNAL(node))) + retval = 3 * (node->ch_slotnum - 1) + node->ch_anafanum; + else + retval = node->ch_slotnum; + + if (retval > SPINES_MAX_NUM || retval < 1) + IBPANIC("Internal error"); + return retval; +} + +static void insert_line_router(ibnd_node_t *node, ibnd_chassis_t *chassis) +{ + int i = get_line_index(node); + + if (chassis->linenode[i]) + return; /* already filled slot */ + + chassis->linenode[i] = node; + node->chassis = chassis; +} + +static void insert_spine(ibnd_node_t *node, ibnd_chassis_t *chassis) +{ + int i = get_spine_index(node); + + if (chassis->spinenode[i]) + return; /* already filled slot */ + + chassis->spinenode[i] = node; + node->chassis = chassis; +} + +static void pass_on_lines_catch_spines(ibnd_chassis_t *chassis) +{ + ibnd_node_t *node, *remnode; + ibnd_port_t *port; + int i, p; + + for (i = 1; i <= LINES_MAX_NUM; i++) { + node = chassis->linenode[i]; + + if (!(node && is_line(CONV_NODE_INTERNAL(node)))) + continue; /* empty slot or router */ + + for (p = 1; p <= node->numports; p++) { + port = node->ports[p]; + if (!port || port->portnum > 12 || !port->remoteport) + continue; + + remnode = port->remoteport->node; + + if (!CONV_NODE_INTERNAL(remnode)->ch_found) + continue; /* some error - spine not initialized ? FIXME */ + insert_spine(remnode, chassis); + } + } +} + +static void pass_on_spines_catch_lines(ibnd_chassis_t *chassis) +{ + ibnd_node_t *node, *remnode; + ibnd_port_t *port; + int i, p; + + for (i = 1; i <= SPINES_MAX_NUM; i++) { + node = chassis->spinenode[i]; + if (!node) + continue; /* empty slot */ + for (p = 1; p <= node->numports; p++) { + port = node->ports[p]; + if (!port || !port->remoteport) + continue; + remnode = port->remoteport->node; + + if (!CONV_NODE_INTERNAL(remnode)->ch_found) + continue; /* some error - line/router not initialized ? FIXME */ + insert_line_router(remnode, chassis); + } + } +} + +/* + Stupid interpolation algorithm... + But nothing to do - have to be compliant with VoltaireSM/NMS +*/ +static void pass_on_spines_interpolate_chguid(ibnd_chassis_t *chassis) +{ + ibnd_node_t *node; + int i; + + for (i = 1; i <= SPINES_MAX_NUM; i++) { + node = chassis->spinenode[i]; + if (!node) + continue; /* skip the empty slots */ + + /* take first guid minus one to be consistent with SM */ + chassis->chassisguid = node->guid - 1; + break; + } +} + +/* + This function fills chassis structure with all nodes + in that chassis + chassis structure = structure of one standalone chassis +*/ +static void build_chassis(struct ibnd_node *node, ibnd_chassis_t *chassis) +{ + int p = 0; + struct ibnd_node *remnode = 0; + ibnd_port_t *port = 0; + + /* we get here with node = chassis_spine */ + insert_spine((ibnd_node_t *)node, chassis); + + /* loop: pass on all ports of node */ + for (p = 1; p <= node->node.numports; p++ ) { + port = node->node.ports[p]; + if (!port || !port->remoteport) + continue; + remnode = CONV_NODE_INTERNAL(port->remoteport->node); + + if (!remnode->ch_found) + continue; /* some error - line or router not initialized ? FIXME */ + + insert_line_router(&(remnode->node), chassis); + } + + pass_on_lines_catch_spines(chassis); + /* this pass needed for to catch routers, since routers connected only */ + /* to spines in slot 1 or 4 and we could miss them first time */ + pass_on_spines_catch_lines(chassis); + + /* additional 2 passes needed for to overcome a problem of pure "in-chassis" */ + /* connectivity - extra pass to ensure that all related chips/modules */ + /* inserted into the chassis */ + pass_on_lines_catch_spines(chassis); + pass_on_spines_catch_lines(chassis); + pass_on_spines_interpolate_chguid(chassis); +} + +/*========================================================*/ +/* INTERNAL TO EXTERNAL PORT MAPPING */ +/*========================================================*/ + +/* +Description : On ISR9288/9096 external ports indexing + is not matching the internal ( anafa ) port + indexes. Use this MAP to translate the data you get from + the OpenIB diagnostics (smpquery, ibroute, ibtracert, etc.) + + +Module : sLB-24 + anafa 1 anafa 2 +ext port | 13 14 15 16 17 18 | 19 20 21 22 23 24 +int port | 22 23 24 18 17 16 | 22 23 24 18 17 16 +ext port | 1 2 3 4 5 6 | 7 8 9 10 11 12 +int port | 19 20 21 15 14 13 | 19 20 21 15 14 13 +------------------------------------------------ + +Module : sLB-8 + anafa 1 anafa 2 +ext port | 13 14 15 16 17 18 | 19 20 21 22 23 24 +int port | 24 23 22 18 17 16 | 24 23 22 18 17 16 +ext port | 1 2 3 4 5 6 | 7 8 9 10 11 12 +int port | 21 20 19 15 14 13 | 21 20 19 15 14 13 + +-----------> + anafa 1 anafa 2 +ext port | - - 5 - - 6 | - - 7 - - 8 +int port | 24 23 22 18 17 16 | 24 23 22 18 17 16 +ext port | - - 1 - - 2 | - - 3 - - 4 +int port | 21 20 19 15 14 13 | 21 20 19 15 14 13 +------------------------------------------------ + +Module : sLB-2024 + +ext port | 13 14 15 16 17 18 19 20 21 22 23 24 +A1 int port| 13 14 15 16 17 18 19 20 21 22 23 24 +ext port | 1 2 3 4 5 6 7 8 9 10 11 12 +A2 int port| 13 14 15 16 17 18 19 20 21 22 23 24 +--------------------------------------------------- + +*/ + +int int2ext_map_slb24[2][25] = { + { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 5, 4, 18, 17, 16, 1, 2, 3, 13, 14, 15 }, + { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 11, 10, 24, 23, 22, 7, 8, 9, 19, 20, 21 } + }; +int int2ext_map_slb8[2][25] = { + { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 6, 6, 6, 1, 1, 1, 5, 5, 5 }, + { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 8, 8, 8, 3, 3, 3, 7, 7, 7 } + }; +int int2ext_map_slb2024[2][25] = { + { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }, + { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 } + }; +/* reference { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; */ + +/* map internal ports to external ports if appropriate */ +static void +voltaire_portmap(ibnd_port_t *port) +{ + struct ibnd_node *n = CONV_NODE_INTERNAL(port->node); + int portnum = port->portnum; + int chipnum = 0; + ibnd_node_t *node = port->node; + + if (!n->ch_found || !is_line(CONV_NODE_INTERNAL(node)) || (portnum < 13 || portnum > 24)) { + port->ext_portnum = 0; + return; + } + + if (port->node->ch_anafanum < 1 || port->node->ch_anafanum > 2) { + port->ext_portnum = 0; + return; + } + + chipnum = port->node->ch_anafanum - 1; + + if (is_line_24(CONV_NODE_INTERNAL(node))) + port->ext_portnum = int2ext_map_slb24[chipnum][portnum]; + else if (is_line_2024(CONV_NODE_INTERNAL(node))) + port->ext_portnum = int2ext_map_slb2024[chipnum][portnum]; + else + port->ext_portnum = int2ext_map_slb8[chipnum][portnum]; +} + +static void add_chassis(struct ibnd_fabric *fabric) +{ + if (!(fabric->current_chassis = calloc(1, sizeof(ibnd_chassis_t)))) + IBPANIC("out of mem"); + + if (fabric->first_chassis == NULL) { + fabric->first_chassis = fabric->current_chassis; + fabric->last_chassis = fabric->current_chassis; + } else { + fabric->last_chassis->next = fabric->current_chassis; + fabric->last_chassis = fabric->current_chassis; + } +} + +static void +add_node_to_chassis(ibnd_chassis_t *chassis, ibnd_node_t *node) +{ + node->chassis = chassis; + node->next_chassis_node = chassis->nodes; + chassis->nodes = node; +} + +/* + Main grouping function + Algorithm: + 1. pass on every Voltaire node + 2. catch spine chip for every Voltaire node + 2.1 build/interpolate chassis around this chip + 2.2 go to 1. + 3. pass on non Voltaire nodes (SystemImageGUID based grouping) + 4. now group non Voltaire nodes by SystemImageGUID + Returns: + Pointer to the first chassis in a NULL terminated list of chassis in + the fabric specified. +*/ +ibnd_chassis_t *group_nodes(struct ibnd_fabric *fabric) +{ + struct ibnd_node *node; + int dist; + int chassisnum = 0; + ibnd_chassis_t *chassis; + + fabric->first_chassis = NULL; + fabric->current_chassis = NULL; + + /* first pass on switches and build for every Voltaire node */ + /* an appropriate chassis record (slotnum and position) */ + /* according to internal connectivity */ + /* not very efficient but clear code so... */ + for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) { + for (node = fabric->nodesdist[dist]; node; node = node->dnext) { + if (mad_get_field(node->node.info, 0, IB_NODE_VENDORID_F) == VTR_VENDOR_ID) + fill_voltaire_chassis_record(node); + } + } + + /* separate every Voltaire chassis from each other and build linked list of them */ + /* algorithm: catch spine and find all surrounding nodes */ + for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) { + for (node = fabric->nodesdist[dist]; node; node = node->dnext) { + if (mad_get_field(node->node.info, 0, IB_NODE_VENDORID_F) != VTR_VENDOR_ID) + continue; + //if (!node->node.chrecord || node->node.chrecord->chassisnum || !is_spine(node)) + if (!node->ch_found + || (node->node.chassis && node->node.chassis->chassisnum) + || !is_spine(node)) + continue; + add_chassis(fabric); + fabric->current_chassis->chassisnum = ++chassisnum; + build_chassis(node, fabric->current_chassis); + } + } + + /* now make pass on nodes for chassis which are not Voltaire */ + /* grouped by common SystemImageGUID */ + for (dist = 0; dist <= fabric->fabric.maxhops_discovered; dist++) { + for (node = fabric->nodesdist[dist]; node; node = node->dnext) { + if (mad_get_field(node->node.info, 0, IB_NODE_VENDORID_F) == VTR_VENDOR_ID) + continue; + if (mad_get_field64(node->node.info, 0, IB_NODE_SYSTEM_GUID_F)) { + chassis = find_chassisguid((ibnd_node_t *)node); + if (chassis) + chassis->nodecount++; + else { + /* Possible new chassis */ + add_chassis(fabric); + fabric->current_chassis->chassisguid = + get_chassisguid((ibnd_node_t *)node); + fabric->current_chassis->nodecount = 1; + } + } + } + } + + /* now, make another pass to see which nodes are part of chassis */ + /* (defined as chassis->nodecount > 1) */ + for (dist = 0; dist <= MAXHOPS; ) { + for (node = fabric->nodesdist[dist]; node; node = node->dnext) { + if (mad_get_field(node->node.info, 0, IB_NODE_VENDORID_F) == VTR_VENDOR_ID) + continue; + if (mad_get_field64(node->node.info, 0, IB_NODE_SYSTEM_GUID_F)) { + chassis = find_chassisguid((ibnd_node_t *)node); + if (chassis && chassis->nodecount > 1) { + if (!chassis->chassisnum) + chassis->chassisnum = ++chassisnum; + if (!node->ch_found) { + node->ch_found = 1; + add_node_to_chassis(chassis, (ibnd_node_t *)node); + } + } + } + } + if (dist == fabric->fabric.maxhops_discovered) + dist = MAXHOPS; /* skip to CAs */ + else + dist++; + } + + return (fabric->first_chassis); +} diff --git a/infiniband-diags/libibnetdisc/src/chassis.h b/infiniband-diags/libibnetdisc/src/chassis.h new file mode 100644 index 0000000..16dad49 --- /dev/null +++ b/infiniband-diags/libibnetdisc/src/chassis.h @@ -0,0 +1,85 @@ +/* + * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef _CHASSIS_H_ +#define _CHASSIS_H_ + +#include + +#include "internal.h" + +/*========================================================*/ +/* CHASSIS RECOGNITION SPECIFIC DATA */ +/*========================================================*/ + +/* Device IDs */ +#define VTR_DEVID_IB_FC_ROUTER 0x5a00 +#define VTR_DEVID_IB_IP_ROUTER 0x5a01 +#define VTR_DEVID_ISR9600_SPINE 0x5a02 +#define VTR_DEVID_ISR9600_LEAF 0x5a03 +#define VTR_DEVID_HCA1 0x5a04 +#define VTR_DEVID_HCA2 0x5a44 +#define VTR_DEVID_HCA3 0x6278 +#define VTR_DEVID_SW_6IB4 0x5a05 +#define VTR_DEVID_ISR9024 0x5a06 +#define VTR_DEVID_ISR9288 0x5a07 +#define VTR_DEVID_SLB24 0x5a09 +#define VTR_DEVID_SFB12 0x5a08 +#define VTR_DEVID_SFB4 0x5a0b +#define VTR_DEVID_ISR9024_12 0x5a0c +#define VTR_DEVID_SLB8 0x5a0d +#define VTR_DEVID_RLX_SWITCH_BLADE 0x5a20 +#define VTR_DEVID_ISR9024_DDR 0x5a31 +#define VTR_DEVID_SFB12_DDR 0x5a32 +#define VTR_DEVID_SFB4_DDR 0x5a33 +#define VTR_DEVID_SLB24_DDR 0x5a34 +#define VTR_DEVID_SFB2012 0x5a37 +#define VTR_DEVID_SLB2024 0x5a38 +#define VTR_DEVID_ISR2012 0x5a39 +#define VTR_DEVID_SFB2004 0x5a40 +#define VTR_DEVID_ISR2004 0x5a41 +#define VTR_DEVID_SRB2004 0x5a42 + +/* Vendor IDs (for chassis based systems) */ +#define VTR_VENDOR_ID 0x8f1 /* Voltaire */ +#define TS_VENDOR_ID 0x5ad /* Cisco */ +#define SS_VENDOR_ID 0x66a /* InfiniCon */ +#define XS_VENDOR_ID 0x1397 /* Xsigo */ + +enum ibnd_chassis_type { UNRESOLVED_CT, ISR9288_CT, ISR9096_CT, ISR2012_CT, ISR2004_CT }; +enum ibnd_chassis_slot_type { UNRESOLVED_CS, LINE_CS, SPINE_CS, SRBD_CS }; + +ibnd_chassis_t *group_nodes(struct ibnd_fabric *fabric); + +#endif /* _CHASSIS_H_ */ diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c new file mode 100644 index 0000000..bf7c2a7 --- /dev/null +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -0,0 +1,700 @@ +/* + * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. + * Copyright (c) 2008 Lawrence Livermore National Laboratory + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include +#include + +#include "internal.h" +#include "chassis.h" + +static int timeout_ms = 2000; +static int show_progress = 0; + +void +decode_port_info(ibnd_port_t *port) +{ + port->base_lid = mad_get_field(port->info, 0, IB_PORT_LID_F); + port->lmc = mad_get_field(port->info, 0, IB_PORT_LMC_F); +} + +static int +get_port_info(struct ibnd_fabric *fabric, struct ibnd_port *port, + int portnum, ib_portid_t *portid) +{ + char width[64], speed[64]; + port->port.portnum = portnum; + int iwidth = mad_get_field(port->port.info, 0, + IB_PORT_LINK_WIDTH_ACTIVE_F); + int ispeed = mad_get_field(port->port.info, 0, + IB_PORT_LINK_SPEED_ACTIVE_F); + + if (!smp_query_via(port->port.info, portid, IB_ATTR_PORT_INFO, portnum, timeout_ms, + fabric->ibmad_port)) + return -1; + + decode_port_info(&(port->port)); + + IBND_DEBUG("portid %s portnum %d: base lid %d state %d physstate %d %s %s\n", + portid2str(portid), portnum, port->port.base_lid, + mad_get_field(port->port.info, 0, IB_PORT_STATE_F), + mad_get_field(port->port.info, 0, IB_PORT_PHYS_STATE_F), + mad_dump_val(IB_PORT_LINK_WIDTH_ACTIVE_F, width, 64, &iwidth), + mad_dump_val(IB_PORT_LINK_SPEED_ACTIVE_F, speed, 64, &ispeed)); + return 0; +} + +/* + * Returns -1 if error. + */ +static int +query_node_info(struct ibnd_fabric *fabric, struct ibnd_node *node, ib_portid_t *portid) +{ + if (!smp_query_via(&(node->node.info), portid, IB_ATTR_NODE_INFO, 0, timeout_ms, + fabric->ibmad_port)) + return -1; + + /* decode just a couple of fields for quicker reference. */ + mad_decode_field(node->node.info, IB_NODE_GUID_F, &(node->node.guid)); + mad_decode_field(node->node.info, IB_NODE_TYPE_F, &(node->node.type)); + mad_decode_field(node->node.info, IB_NODE_NPORTS_F, + &(node->node.numports)); + + return (0); +} + +/* + * Returns 0 if non switch node is found, 1 if switch is found, -1 if error. + */ +static int +query_node(struct ibnd_fabric *fabric, struct ibnd_node *inode, + struct ibnd_port *iport, ib_portid_t *portid) +{ + ibnd_node_t *node = &(inode->node); + ibnd_port_t *port = &(iport->port); + void *nd = inode->node.nodedesc; + + if (query_node_info(fabric, inode, portid)) + return -1; + + port->portnum = mad_get_field(node->info, 0, IB_NODE_LOCAL_PORT_F); + port->guid = mad_get_field64(node->info, 0, IB_NODE_PORT_GUID_F); + + if (!smp_query_via(nd, portid, IB_ATTR_NODE_DESC, 0, timeout_ms, + fabric->ibmad_port)) + return -1; + + if (!smp_query_via(port->info, portid, IB_ATTR_PORT_INFO, 0, timeout_ms, + fabric->ibmad_port)) + return -1; + decode_port_info(port); + + if (node->type != IB_NODE_SWITCH) + return 0; + + node->smalid = port->base_lid; + node->smalmc = port->lmc; + + /* after we have the sma information find out the real PortInfo for this port */ + if (!smp_query_via(port->info, portid, IB_ATTR_PORT_INFO, port->portnum, timeout_ms, + fabric->ibmad_port)) + return -1; + decode_port_info(port); + + if (!smp_query_via(node->switchinfo, portid, IB_ATTR_SWITCH_INFO, 0, timeout_ms, + fabric->ibmad_port)) + node->smaenhsp0 = 0; /* assume base SP0 */ + else + mad_decode_field(node->switchinfo, IB_SW_ENHANCED_PORT0_F, &node->smaenhsp0); + + IBND_DEBUG("portid %s: got switch node %" PRIx64 " '%s'\n", + portid2str(portid), node->guid, node->nodedesc); + return 0; +} + +static int +add_port_to_dpath(ib_dr_path_t *path, int nextport) +{ + if (path->cnt+2 >= sizeof(path->p)) + return -1; + ++path->cnt; + path->p[path->cnt] = nextport; + return path->cnt; +} + +static int +extend_dpath(struct ibnd_fabric *f, ib_dr_path_t *path, int nextport) +{ + int rc = add_port_to_dpath(path, nextport); + if ((rc != -1) && (path->cnt > f->fabric.maxhops_discovered)) + f->fabric.maxhops_discovered = path->cnt; + return (rc); +} + +static void +dump_endnode(ib_portid_t *path, char *prompt, + struct ibnd_node *node, struct ibnd_port *port) +{ + char type[64]; + if (!show_progress) + return; + + mad_dump_node_type(type, 64, &(node->node.type), sizeof(int)), + + printf("%s -> %s %s {%016" PRIx64 "} portnum %d base lid %d-%d\"%s\"\n", + portid2str(path), prompt, type, + node->node.guid, + node->node.type == IB_NODE_SWITCH ? 0 : port->port.portnum, + port->port.base_lid, port->port.base_lid + (1 << port->port.lmc) - 1, + node->node.nodedesc); +} + +static struct ibnd_node * +find_existing_node(struct ibnd_fabric *fabric, struct ibnd_node *new) +{ + int hash = HASHGUID(new->node.guid) % HTSZ; + struct ibnd_node *node; + + for (node = fabric->nodestbl[hash]; node; node = node->htnext) + if (node->node.guid == new->node.guid) + return node; + + return NULL; +} + +ibnd_node_t * +ibnd_find_node_guid(ibnd_fabric_t *fabric, uint64_t guid) +{ + struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); + int hash = HASHGUID(guid) % HTSZ; + struct ibnd_node *node; + + for (node = f->nodestbl[hash]; node; node = node->htnext) + if (node->node.guid == guid) + return (ibnd_node_t *)node; + + return NULL; +} + +ibnd_node_t * +ibnd_update_node(ibnd_node_t *node) +{ + char portinfo_port0[sizeof (ib_port_info_t)]; + void *nd = node->nodedesc; + int p = 0; + struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(node->fabric); + struct ibnd_node *n = CONV_NODE_INTERNAL(node); + + if (query_node_info(f, n, &(n->node.path_portid))) + return (NULL); + + if (!smp_query_via(nd, &(n->node.path_portid), IB_ATTR_NODE_DESC, 0, timeout_ms, + f->ibmad_port)) + return (NULL); + + /* update all the port info's */ + for (p = 1; p >= n->node.numports; p++) { + get_port_info(f, CONV_PORT_INTERNAL(n->node.ports[p]), p, &(n->node.path_portid)); + } + + if (n->node.type != IB_NODE_SWITCH) + goto done; + + if (!smp_query_via(portinfo_port0, &(n->node.path_portid), IB_ATTR_PORT_INFO, 0, timeout_ms, + f->ibmad_port)) + return (NULL); + + n->node.smalid = mad_get_field(portinfo_port0, 0, IB_PORT_LID_F); + n->node.smalmc = mad_get_field(portinfo_port0, 0, IB_PORT_LMC_F); + + if (!smp_query_via(node->switchinfo, &(n->node.path_portid), IB_ATTR_SWITCH_INFO, 0, timeout_ms, + f->ibmad_port)) + node->smaenhsp0 = 0; /* assume base SP0 */ + else + mad_decode_field(node->switchinfo, IB_SW_ENHANCED_PORT0_F, &n->node.smaenhsp0); + +done: + return (node); +} + +ibnd_node_t * +ibnd_find_node_dr(ibnd_fabric_t *fabric, char *dr_str) +{ + struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); + int i = 0; + ibnd_node_t *rc = f->fabric.from_node; + ib_dr_path_t path; + + if (str2drpath(&path, dr_str, 0, 0) == -1) { + return (NULL); + } + + for (i = 0; i <= path.cnt; i++) { + ibnd_port_t *remote_port = NULL; + if (path.p[i] == 0) + continue; + if (!rc->ports) + return (NULL); + + remote_port = rc->ports[path.p[i]]->remoteport; + if (!remote_port) + return (NULL); + + rc = remote_port->node; + } + + return (rc); +} + +static void +add_to_nodeguid_hash(struct ibnd_node *node, struct ibnd_node *hash[]) +{ + int hash_idx = HASHGUID(node->node.guid) % HTSZ; + + node->htnext = hash[hash_idx]; + hash[hash_idx] = node; +} + +static void +add_to_portguid_hash(struct ibnd_port *port, struct ibnd_port *hash[]) +{ + int hash_idx = HASHGUID(port->port.guid) % HTSZ; + + port->htnext = hash[hash_idx]; + hash[hash_idx] = port; +} + +static void +add_to_type_list(struct ibnd_node*node, struct ibnd_fabric *fabric) +{ + switch (node->node.type) { + case IB_NODE_CA: + node->type_next = fabric->ch_adapters; + fabric->ch_adapters = node; + break; + case IB_NODE_SWITCH: + node->type_next = fabric->switches; + fabric->switches = node; + break; + case IB_NODE_ROUTER: + node->type_next = fabric->routers; + fabric->routers = node; + break; + } +} + +static void +add_to_nodedist(struct ibnd_node *node, struct ibnd_fabric *fabric) +{ + int dist = node->node.dist; + if (node->node.type != IB_NODE_SWITCH) + dist = MAXHOPS; /* special Ca list */ + + node->dnext = fabric->nodesdist[dist]; + fabric->nodesdist[dist] = node; +} + + +static struct ibnd_node * +create_node(struct ibnd_fabric *fabric, struct ibnd_node *temp, ib_portid_t *path, int dist) +{ + struct ibnd_node *node; + + node = malloc(sizeof(*node)); + if (!node) { + IBPANIC("OOM: node creation failed\n"); + return NULL; + } + + memcpy(node, temp, sizeof(*node)); + node->node.dist = dist; + node->node.path_portid = *path; + node->node.fabric = (ibnd_fabric_t *)fabric; + + add_to_nodeguid_hash(node, fabric->nodestbl); + + /* add this to the all nodes list */ + node->node.next = fabric->fabric.nodes; + fabric->fabric.nodes = (ibnd_node_t *)node; + + add_to_type_list(node, fabric); + add_to_nodedist(node, fabric); + + return node; +} + +static struct ibnd_port * +find_existing_port_node(struct ibnd_node *node, struct ibnd_port *port) +{ + if (port->port.portnum > node->node.numports || node->node.ports == NULL ) + return (NULL); + + return (CONV_PORT_INTERNAL(node->node.ports[port->port.portnum])); +} + +static struct ibnd_port * +add_port_to_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd_port *temp) +{ + struct ibnd_port *port; + + port = malloc(sizeof(*port)); + if (!port) + return NULL; + + memcpy(port, temp, sizeof(*port)); + port->port.node = (ibnd_node_t *)node; + port->port.ext_portnum = 0; + + if (node->node.ports == NULL) { + node->node.ports = calloc(sizeof(*node->node.ports), node->node.numports + 1); + if (!node->node.ports) { + IBND_ERROR("Failed to allocate the ports array\n"); + return (NULL); + } + } + + node->node.ports[temp->port.portnum] = (ibnd_port_t *)port; + + add_to_portguid_hash(port, fabric->portstbl); + return port; +} + +static void +link_ports(struct ibnd_node *node, struct ibnd_port *port, + struct ibnd_node *remotenode, struct ibnd_port *remoteport) +{ + IBND_DEBUG("linking: 0x%" PRIx64 " %p->%p:%u and 0x%" PRIx64 " %p->%p:%u\n", + node->node.guid, node, port, port->port.portnum, + remotenode->node.guid, remotenode, + remoteport, remoteport->port.portnum); + if (port->port.remoteport) + port->port.remoteport->remoteport = NULL; + if (remoteport->port.remoteport) + remoteport->port.remoteport->remoteport = NULL; + port->port.remoteport = (ibnd_port_t *)remoteport; + remoteport->port.remoteport = (ibnd_port_t *)port; +} + +static int +get_remote_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd_port *port, ib_portid_t *path, + int portnum, int dist) +{ + struct ibnd_node node_buf; + struct ibnd_port port_buf; + struct ibnd_node *remotenode, *oldnode; + struct ibnd_port *remoteport, *oldport; + + memset(&node_buf, 0, sizeof(node_buf)); + memset(&port_buf, 0, sizeof(port_buf)); + + IBND_DEBUG("handle node %p port %p:%d dist %d\n", node, port, portnum, dist); + + if (mad_get_field(port->port.info, 0, IB_PORT_PHYS_STATE_F) + != IB_PORT_PHYS_STATE_LINKUP) + return -1; + + if (extend_dpath(fabric, &path->drpath, portnum) < 0) + return -1; + + if (query_node(fabric, &node_buf, &port_buf, path)) { + IBWARN("NodeInfo on %s failed, skipping port", + portid2str(path)); + path->drpath.cnt--; /* restore path */ + return -1; + } + + oldnode = find_existing_node(fabric, &node_buf); + if (oldnode) + remotenode = oldnode; + else if (!(remotenode = create_node(fabric, &node_buf, path, dist + 1))) + IBPANIC("no memory"); + + oldport = find_existing_port_node(remotenode, &port_buf); + if (oldport) { + remoteport = oldport; + } else if (!(remoteport = add_port_to_node(fabric, remotenode, &port_buf))) + IBPANIC("no memory"); + + dump_endnode(path, oldnode ? "known remote" : "new remote", + remotenode, remoteport); + + link_ports(node, port, remotenode, remoteport); + + path->drpath.cnt--; /* restore path */ + return 0; +} + +static void * +ibnd_init_port(char *dev_name, int dev_port) +{ + int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; + + /* Crank up the mad lib */ + return (mad_rpc_open_port(dev_name, dev_port, mgmt_classes, 2)); +} + +ibnd_fabric_t * +ibnd_discover_fabric(char *dev_name, int dev_port, int timeout_ms, + ib_portid_t *from, int hops) +{ + struct ibnd_fabric *fabric = NULL; + ib_portid_t my_portid = {0}; + struct ibnd_node node_buf; + struct ibnd_port port_buf; + struct ibnd_node *node; + struct ibnd_port *port; + int i; + int dist = 0; + ib_portid_t *path; + int max_hops = MAXHOPS-1; /* default find everything */ + + /* if not everything how much? */ + if (hops >= 0) { + max_hops = hops; + } + + /* If not specified start from "my" port */ + if (!from) { + from = &my_portid; + } + + fabric = malloc(sizeof(*fabric)); + + if (!fabric) { + IBPANIC("OOM: failed to malloc ibnd_fabric_t\n"); + return (NULL); + } + + memset(fabric, 0, sizeof(*fabric)); + + fabric->ibmad_port = ibnd_init_port(dev_name, dev_port); + if (!fabric->ibmad_port) { + IBPANIC("OOM: failed to open \"%s\" port %d\n", + dev_name, dev_port); + goto error; + } + + IBND_DEBUG("from %s\n", portid2str(from)); + + memset(&node_buf, 0, sizeof(node_buf)); + memset(&port_buf, 0, sizeof(port_buf)); + + if (query_node(fabric, &node_buf, &port_buf, from)) { + IBWARN("can't reach node %s\n", portid2str(from)); + goto error; + } + + node = create_node(fabric, &node_buf, from, 0); + if (!node) + goto error; + + fabric->fabric.from_node = (ibnd_node_t *)node; + + port = add_port_to_node(fabric, node, &port_buf); + if (!port) + IBPANIC("out of memory"); + + if (node->node.type != IB_NODE_SWITCH && + get_remote_node(fabric, node, port, from, + mad_get_field(node->node.info, 0, IB_NODE_LOCAL_PORT_F), + 0) < 0) + return ((ibnd_fabric_t *)fabric); + + for (dist = 0; dist <= max_hops; dist++) { + + for (node = fabric->nodesdist[dist]; node; node = node->dnext) { + + path = &node->node.path_portid; + + IBND_DEBUG("dist %d node %p\n", dist, node); + dump_endnode(path, "processing", node, port); + + for (i = 1; i <= node->node.numports; i++) { + if (i == mad_get_field(node->node.info, 0, + IB_NODE_LOCAL_PORT_F)) + continue; + + if (get_port_info(fabric, &port_buf, i, path)) { + IBWARN("can't reach node %s port %d", portid2str(path), i); + continue; + } + + port = find_existing_port_node(node, &port_buf); + if (port) + continue; + + port = add_port_to_node(fabric, node, &port_buf); + if (!port) + IBPANIC("out of memory"); + + /* If switch, set port GUID to node port GUID */ + if (node->node.type == IB_NODE_SWITCH) { + port->port.guid = mad_get_field64(node->node.info, + 0, IB_NODE_PORT_GUID_F); + } + + get_remote_node(fabric, node, port, path, i, dist); + } + } + } + + fabric->fabric.chassis = group_nodes(fabric); + + return ((ibnd_fabric_t *)fabric); +error: + free(fabric); + return (NULL); +} + +static void +destroy_node(struct ibnd_node *node) +{ + int p = 0; + + for (p = 0; p <= node->node.numports; p++) { + free(node->node.ports[p]); + } + free(node->node.ports); + free(node); +} + +void +ibnd_destroy_fabric(ibnd_fabric_t *fabric) +{ + struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); + int dist = 0; + struct ibnd_node *node = NULL; + struct ibnd_node *next = NULL; + ibnd_chassis_t *ch, *ch_next; + + ch = f->first_chassis; + while (ch) { + ch_next = ch->next; + free(ch); + ch = ch_next; + } + for (dist = 0; dist <= MAXHOPS; dist++) { + node = f->nodesdist[dist]; + while (node) { + next = node->dnext; + destroy_node(node); + node = next; + } + } + if (f->ibmad_port) + mad_rpc_close_port(f->ibmad_port); + free(f); +} + +void +ibnd_debug(int i) +{ + if (i) { + ibdebug++; + madrpc_show_errors(1); + umad_debug(i); + } else { + ibdebug = 0; + madrpc_show_errors(0); + umad_debug(0); + } +} + +void +ibnd_show_progress(int i) +{ + show_progress = i; +} + +void +ibnd_iter_nodes(ibnd_fabric_t *fabric, + ibnd_iter_node_func_t func, + void *user_data) +{ + ibnd_node_t *cur = NULL; + + for (cur = fabric->nodes; cur; cur = cur->next) { + func(cur, user_data); + } +} + + +void +ibnd_iter_nodes_type(ibnd_fabric_t *fabric, + ibnd_iter_node_func_t func, + int node_type, + void *user_data) +{ + struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(fabric); + struct ibnd_node *list = NULL; + struct ibnd_node *cur = NULL; + + switch (node_type) { + case IB_NODE_SWITCH: + list = f->switches; + break; + case IB_NODE_CA: + list = f->ch_adapters; + break; + case IB_NODE_ROUTER: + list = f->routers; + break; + default: + IBND_DEBUG("Invalid node_type specified %d\n", node_type); + break; + } + + for (cur = list; cur; cur = cur->type_next) { + func((ibnd_node_t *)cur, user_data); + } +} + diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h new file mode 100644 index 0000000..afed25e --- /dev/null +++ b/infiniband-diags/libibnetdisc/src/internal.h @@ -0,0 +1,95 @@ +/* + * Copyright (c) 2008 Lawrence Livermore National Laboratory + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +/** ========================================================================= + * Define the internal data structures. + */ + +#ifndef _INTERNAL_H_ +#define _INTERNAL_H_ + +#include + +#define MAXHOPS 63 + +#define IBND_DEBUG(fmt, ...) \ + if (ibdebug) { \ + printf("%s:%u; " fmt, __FILE__, __LINE__, ## __VA_ARGS__); \ + } +#define IBND_ERROR(fmt, ...) \ + fprintf(stderr, "%s:%u; " fmt, __FILE__, __LINE__, ## __VA_ARGS__) + +struct ibnd_node { + /* This member MUST BE FIRST */ + ibnd_node_t node; + + /* internal use only */ + unsigned char ch_found; + struct ibnd_node *htnext; /* hash table list */ + struct ibnd_node *dnext; /* nodesdist next */ + struct ibnd_node *type_next; /* next based on type */ +}; +#define CONV_NODE_INTERNAL(node) ((struct ibnd_node *)node) + +struct ibnd_port { + /* This member MUST BE FIRST */ + ibnd_port_t port; + + /* internal use only */ + struct ibnd_port *htnext; +}; +#define CONV_PORT_INTERNAL(port) ((struct ibnd_port *)port) + +/* HASH table defines */ +#define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103))) +#define HTSZ 137 + +struct ibnd_fabric { + /* This member MUST BE FIRST */ + ibnd_fabric_t fabric; + + /* internal use only */ + void *ibmad_port; + struct ibnd_node *nodestbl[HTSZ]; + struct ibnd_port *portstbl[HTSZ]; + struct ibnd_node *nodesdist[MAXHOPS+1]; + ibnd_chassis_t *first_chassis; + ibnd_chassis_t *current_chassis; + ibnd_chassis_t *last_chassis; + struct ibnd_node *switches; + struct ibnd_node *ch_adapters; + struct ibnd_node *routers; +}; +#define CONV_FABRIC_INTERNAL(fabric) ((struct ibnd_fabric *)fabric) + +#endif /* _INTERNAL_H_ */ diff --git a/infiniband-diags/libibnetdisc/src/libibnetdisc.map b/infiniband-diags/libibnetdisc/src/libibnetdisc.map new file mode 100644 index 0000000..5e8c315 --- /dev/null +++ b/infiniband-diags/libibnetdisc/src/libibnetdisc.map @@ -0,0 +1,27 @@ +IBNETDISC_1.0 { + global: + ibnd_debug; + ibnd_show_progress; + ibnd_discover_fabric; + ibnd_cache_fabric; + ibnd_read_fabric; + ibnd_destroy_fabric; + ibnd_find_node_guid; + ibnd_update_node; + ibnd_find_node_dr; + ibnd_linkwidth_str; + ibnd_linkspeed_str; + ibnd_node_type_str; + ibnd_node_type_str_short; + ibnd_is_xsigo_guid; + ibnd_is_xsigo_tca; + ibnd_is_xsigo_hca; + ibnd_get_chassis_guid; + ibnd_get_chassis_type; + ibnd_get_chassis_slot_str; + ibnd_linkstate_str; + ibnd_physstate_str; + ibnd_iter_nodes; + ibnd_iter_nodes_type; + local: *; +}; diff --git a/infiniband-diags/libibnetdisc/test/testleaks.c b/infiniband-diags/libibnetdisc/test/testleaks.c new file mode 100644 index 0000000..1fabaac --- /dev/null +++ b/infiniband-diags/libibnetdisc/test/testleaks.c @@ -0,0 +1,179 @@ +/* + * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. + * Copyright (c) 2008 Lawrence Livermore National Lab. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +char *argv0 = "iblinkinfotest"; +static FILE *f; + +static int timeout_ms = 500; + +void +usage(void) +{ + fprintf(stderr, + "Usage: %s [-hclp -S -D -C -P ]\n" + " Report link speed and connection for each port of each switch which is active\n" + " -h This help message\n" + " -i Number of iterations to run (default -1 == infinate)\n" + + " -S output only the node specified by guid\n" + " -D print only node specified by \n" + " -f specify node to start \"from\"\n" + " -n Number of hops to include away from specified node\n" + + " -t timeout for any single fabric query\n" + " -s show errors\n" + + " -C use selected Channel Adaptor name for queries\n" + " -P use selected channel adaptor port for queries\n" + " --debug print debug messages\n" + , + argv0); + exit(-1); +} + +int +main(int argc, char **argv) +{ + char *ca = 0; + int ca_port = 0; + ibnd_fabric_t *fabric = NULL; + uint64_t guid = 0; + char *dr_path = NULL; + char *from = NULL; + int hops = 0; + ib_portid_t port_id; + int iters = -1; + + static char const str_opts[] = "S:D:n:C:P:t:shuf:i:"; + static const struct option long_opts[] = { + { "S", 1, 0, 'S'}, + { "D", 1, 0, 'D'}, + { "num-hops", 1, 0, 'n'}, + { "ca-name", 1, 0, 'C'}, + { "ca-port", 1, 0, 'P'}, + { "timeout", 1, 0, 't'}, + { "show", 0, 0, 's'}, + { "help", 0, 0, 'h'}, + { "usage", 0, 0, 'u'}, + { "debug", 0, 0, 2}, + { "from", 1, 0, 'f'}, + { "iters", 1, 0, 'i'}, + { } + }; + + f = stdout; + + argv0 = argv[0]; + + while (1) { + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); + if ( ch == -1 ) + break; + switch(ch) { + case 2: + ibnd_debug(1); + break; + case 'f': + from = strdup(optarg); + break; + case 'C': + ca = strdup(optarg); + break; + case 'P': + ca_port = strtoul(optarg, 0, 0); + break; + case 'D': + dr_path = strdup(optarg); + break; + case 'n': + hops = (int)strtol(optarg, NULL, 0); + break; + case 'i': + iters = (int)strtol(optarg, NULL, 0); + break; + case 't': + timeout_ms = strtoul(optarg, 0, 0); + break; + case 'S': + guid = (uint64_t)strtoull(optarg, 0, 0); + break; + default: + usage(); + break; + } + } + argc -= optind; + argv += optind; + + while (iters == -1 || iters-- > 0) { + if (from) { + /* only scan part of the fabric */ + str2drpath(&(port_id.drpath), from, 0, 0); + if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, + &port_id, hops)) == NULL) { + fprintf(stderr, "discover failed\n"); + exit(1); + } + guid = 0; + } else { + if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, NULL, -1)) == NULL) { + fprintf(stderr, "discover failed\n"); + exit(1); + } + } + + ibnd_destroy_fabric(fabric); + } + + exit(0); +} -- 1.5.4.5 From weiny2 at llnl.gov Fri Apr 3 15:43:01 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 3 Apr 2009 15:43:01 -0700 Subject: [ofa-general] [PATCH v3 3/3] Convert ibnetdiscover to use new ibnetdisc library. Message-ID: <20090403154301.f656e7a4.weiny2@llnl.gov> >From e506ac4d6accefb49b89811cc9dd77775ad481f7 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Fri, 3 Apr 2009 15:28:29 -0700 Subject: [PATCH] Convert ibnetdiscover to use new ibnetdisc library. All other functionality is preserved Signed-off-by: Ira Weiny --- infiniband-diags/Makefile.am | 5 +- infiniband-diags/include/grouping.h | 113 --- infiniband-diags/libibnetdisc/src/chassis.c | 20 +- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 5 +- infiniband-diags/man/ibnetdiscover.8 | 10 +- infiniband-diags/src/grouping.c | 785 -------------------- infiniband-diags/src/ibnetdiscover.c | 974 +++++++++---------------- 7 files changed, 345 insertions(+), 1567 deletions(-) delete mode 100644 infiniband-diags/include/grouping.h delete mode 100644 infiniband-diags/src/grouping.c diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am index b480a4a..19b992c 100644 --- a/infiniband-diags/Makefile.am +++ b/infiniband-diags/Makefile.am @@ -41,7 +41,8 @@ LDADD = libcommon.a libcommon_a_SOURCES = src/ibdiag_common.c src_ibaddr_SOURCES = src/ibaddr.c -src_ibnetdiscover_SOURCES = src/ibnetdiscover.c src/grouping.c +src_ibnetdiscover_SOURCES = src/ibnetdiscover.c +src_ibnetdiscover_LDFLAGS = -L$(top_srcdir)/libibnetdisc -libnetdisc src_ibping_SOURCES = src/ibping.c src_ibportstate_SOURCES = src/ibportstate.c src_ibroute_SOURCES = src/ibroute.c @@ -57,7 +58,7 @@ src_ibsendtrap_SOURCES = src/ibsendtrap.c src_vendstat_SOURCES = src/vendstat.c src_mcm_rereg_test_SOURCES = src/mcm_rereg_test.c src_iblinkinfo_SOURCES = src/iblinkinfo.c -src_iblinkinfo_LDADD = -libnetdisc +src_iblinkinfo_LDFLAGS = -L$(top_srcdir)/libibnetdisc -libnetdisc man_MANS = man/ibaddr.8 man/ibcheckerrors.8 man/ibcheckerrs.8 \ man/ibchecknet.8 man/ibchecknode.8 man/ibcheckport.8 \ diff --git a/infiniband-diags/include/grouping.h b/infiniband-diags/include/grouping.h deleted file mode 100644 index 811e372..0000000 --- a/infiniband-diags/include/grouping.h +++ /dev/null @@ -1,113 +0,0 @@ -/* - * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. - * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -#ifndef _GROUPING_H_ -#define _GROUPING_H_ - -/*========================================================*/ -/* FABRIC SCANNER SPECIFIC DATA */ -/*========================================================*/ - -#define SPINES_MAX_NUM 12 -#define LINES_MAX_NUM 36 - -typedef struct ChassisList ChassisList; -typedef struct AllChassisList AllChassisList; - -struct ChassisList { - ChassisList *next; - uint64_t chassisguid; - unsigned char chassisnum; - unsigned char chassistype; - unsigned int nodecount; /* used for grouping by SystemImageGUID */ - Node *spinenode[SPINES_MAX_NUM + 1]; - Node *linenode[LINES_MAX_NUM + 1]; -}; - -struct AllChassisList { - ChassisList *first; - ChassisList *current; - ChassisList *last; -}; - -/*========================================================*/ -/* CHASSIS RECOGNITION SPECIFIC DATA */ -/*========================================================*/ - -/* Device IDs */ -#define VTR_DEVID_IB_FC_ROUTER 0x5a00 -#define VTR_DEVID_IB_IP_ROUTER 0x5a01 -#define VTR_DEVID_ISR9600_SPINE 0x5a02 -#define VTR_DEVID_ISR9600_LEAF 0x5a03 -#define VTR_DEVID_HCA1 0x5a04 -#define VTR_DEVID_HCA2 0x5a44 -#define VTR_DEVID_HCA3 0x6278 -#define VTR_DEVID_SW_6IB4 0x5a05 -#define VTR_DEVID_ISR9024 0x5a06 -#define VTR_DEVID_ISR9288 0x5a07 -#define VTR_DEVID_SLB24 0x5a09 -#define VTR_DEVID_SFB12 0x5a08 -#define VTR_DEVID_SFB4 0x5a0b -#define VTR_DEVID_ISR9024_12 0x5a0c -#define VTR_DEVID_SLB8 0x5a0d -#define VTR_DEVID_RLX_SWITCH_BLADE 0x5a20 -#define VTR_DEVID_ISR9024_DDR 0x5a31 -#define VTR_DEVID_SFB12_DDR 0x5a32 -#define VTR_DEVID_SFB4_DDR 0x5a33 -#define VTR_DEVID_SLB24_DDR 0x5a34 -#define VTR_DEVID_SFB2012 0x5a37 -#define VTR_DEVID_SLB2024 0x5a38 -#define VTR_DEVID_ISR2012 0x5a39 -#define VTR_DEVID_SFB2004 0x5a40 -#define VTR_DEVID_ISR2004 0x5a41 -#define VTR_DEVID_SRB2004 0x5a42 - -enum ChassisType { UNRESOLVED_CT, ISR9288_CT, ISR9096_CT, ISR2012_CT, ISR2004_CT }; -enum ChassisSlot { UNRESOLVED_CS, LINE_CS, SPINE_CS, SRBD_CS }; - -/*========================================================*/ -/* External interface */ -/*========================================================*/ - -ChassisList *group_nodes(); -char *portmapstring(Port *port); -char *get_chassis_type(unsigned char chassistype); -char *get_chassis_slot(unsigned char chassisslot); -uint64_t get_chassis_guid(unsigned char chassisnum); - -int is_xsigo_guid(uint64_t guid); -int is_xsigo_tca(uint64_t guid); -int is_xsigo_hca(uint64_t guid); - -#endif /* _GROUPING_H_ */ diff --git a/infiniband-diags/libibnetdisc/src/chassis.c b/infiniband-diags/libibnetdisc/src/chassis.c index a25d710..6b4930e 100644 --- a/infiniband-diags/libibnetdisc/src/chassis.c +++ b/infiniband-diags/libibnetdisc/src/chassis.c @@ -292,19 +292,19 @@ int is_chassis_switch(struct ibnd_node *n) } /* these structs help find Line (Anafa) slot number while using spine portnum */ -int line_slot_2_sfb4[25] = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 }; -int anafa_line_slot_2_sfb4[25] = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 }; -int line_slot_2_sfb12[25] = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 }; -int anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 }; +char line_slot_2_sfb4[25] = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 }; +char anafa_line_slot_2_sfb4[25] = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 }; +char line_slot_2_sfb12[25] = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 }; +char anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 }; /* IPR FCR modules connectivity while using sFB4 port as reference */ -int ipr_slot_2_sfb4_port[25] = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 }; +char ipr_slot_2_sfb4_port[25] = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 }; /* these structs help find Spine (Anafa) slot number while using spine portnum */ -int spine12_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -int anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -int spine4_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -int anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +char spine12_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +char anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +char spine4_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +char anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; /* reference { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; */ static void get_sfb_slot(struct ibnd_node *node, ibnd_port_t *lineport) @@ -337,7 +337,7 @@ static void get_sfb_slot(struct ibnd_node *node, ibnd_port_t *lineport) static void get_router_slot(struct ibnd_node *node, ibnd_port_t *spineport) { ibnd_node_t *n = (ibnd_node_t *)node; - int guessnum = 0; + uint64_t guessnum = 0; node->ch_found = 1; diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index bf7c2a7..479bae7 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -150,6 +150,9 @@ query_node(struct ibnd_fabric *fabric, struct ibnd_node *inode, return -1; decode_port_info(port); + port->base_lid = node->smalid; /* LID is still defined by port 0 */ + port->lmc = node->smalmc; + if (!smp_query_via(node->switchinfo, portid, IB_ATTR_SWITCH_INFO, 0, timeout_ms, fabric->ibmad_port)) node->smaenhsp0 = 0; /* assume base SP0 */ @@ -167,7 +170,7 @@ add_port_to_dpath(ib_dr_path_t *path, int nextport) if (path->cnt+2 >= sizeof(path->p)) return -1; ++path->cnt; - path->p[path->cnt] = nextport; + path->p[path->cnt] = (uint8_t) nextport; return path->cnt; } diff --git a/infiniband-diags/man/ibnetdiscover.8 b/infiniband-diags/man/ibnetdiscover.8 index 958efa9..768d392 100644 --- a/infiniband-diags/man/ibnetdiscover.8 +++ b/infiniband-diags/man/ibnetdiscover.8 @@ -5,7 +5,7 @@ ibnetdiscover \- discover InfiniBand topology .SH SYNOPSIS .B ibnetdiscover -[\-d(ebug)] [\-e(rr_show)] [\-v(erbose)] [\-s(how)] [\-l(ist)] [\-g(rouping)] [\-H(ca_list)] [\-S(witch_list)] [\-R(outer_list)] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] [\-V(ersion)] [\--node-name-map ] [\-p(orts)] [\-h(elp)] [] +[\-d(ebug)] [\-s(how)] [\-l(ist)] [\-g(rouping)] [\-H(ca_list)] [\-S(witch_list)] [\-R(outer_list)] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] [\-V(ersion)] [\--node-name-map ] [\-p(orts)] [\-h(elp)] [] .SH DESCRIPTION .PP @@ -37,7 +37,7 @@ List of connected switches List of connected routers .TP \fB\-s\fR, \fB\-\-show\fR -Show more information +Show progress information during discovery. .TP \fB\-\-node\-name\-map\fR Specify a node name map. The node name map file maps GUIDs to more user friendly @@ -57,15 +57,9 @@ using the util_name -h syntax. # Debugging flags .PP \-d raise the IB debugging level. - May be used several times (-ddd or -d -d -d). -.PP -\-e show send and receive errors (timeouts and others) .PP \-h show the usage message .PP -\-v increase the application verbosity level. - May be used several times (-vv or -v -v -v) -.PP \-V show the version info. # Other common flags: diff --git a/infiniband-diags/src/grouping.c b/infiniband-diags/src/grouping.c deleted file mode 100644 index 0c30726..0000000 --- a/infiniband-diags/src/grouping.c +++ /dev/null @@ -1,785 +0,0 @@ -/* - * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. - * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -/*========================================================*/ -/* FABRIC SCANNER SPECIFIC DATA */ -/*========================================================*/ - -#if HAVE_CONFIG_H -# include -#endif /* HAVE_CONFIG_H */ - -#include -#include - -#include - -#include "ibnetdiscover.h" -#include "grouping.h" - -#define OUT_BUFFER_SIZE 16 - - -extern Node *nodesdist[MAXHOPS+1]; /* last is CA list */ -extern Node *mynode; -extern Port *myport; -extern int maxhops_discovered; - -AllChassisList mylist; - -char *ChassisTypeStr[5] = { "", "ISR9288", "ISR9096", "ISR2012", "ISR2004" }; -char *ChassisSlotStr[4] = { "", "Line", "Spine", "SRBD" }; - - -char *get_chassis_type(unsigned char chassistype) -{ - if (chassistype == UNRESOLVED_CT || chassistype > ISR2004_CT) - return NULL; - return ChassisTypeStr[chassistype]; -} - -char *get_chassis_slot(unsigned char chassisslot) -{ - if (chassisslot == UNRESOLVED_CS || chassisslot > SRBD_CS) - return NULL; - return ChassisSlotStr[chassisslot]; -} - -static struct ChassisList *find_chassisnum(unsigned char chassisnum) -{ - ChassisList *current; - - for (current = mylist.first; current; current = current->next) { - if (current->chassisnum == chassisnum) - return current; - } - - return NULL; -} - -static uint64_t topspin_chassisguid(uint64_t guid) -{ - /* Byte 3 in system image GUID is chassis type, and */ - /* Byte 4 is location ID (slot) so just mask off byte 4 */ - return guid & 0xffffffff00ffffffULL; -} - -int is_xsigo_guid(uint64_t guid) -{ - if ((guid & 0xffffff0000000000ULL) == 0x0013970000000000ULL) - return 1; - else - return 0; -} - -static int is_xsigo_leafone(uint64_t guid) -{ - if ((guid & 0xffffffffff000000ULL) == 0x0013970102000000ULL) - return 1; - else - return 0; -} - -int is_xsigo_hca(uint64_t guid) -{ - /* NodeType 2 is HCA */ - if ((guid & 0xffffffff00000000ULL) == 0x0013970200000000ULL) - return 1; - else - return 0; -} - -int is_xsigo_tca(uint64_t guid) -{ - /* NodeType 3 is TCA */ - if ((guid & 0xffffffff00000000ULL) == 0x0013970300000000ULL) - return 1; - else - return 0; -} - -static int is_xsigo_ca(uint64_t guid) -{ - if (is_xsigo_hca(guid) || is_xsigo_tca(guid)) - return 1; - else - return 0; -} - -static int is_xsigo_switch(uint64_t guid) -{ - if ((guid & 0xffffffff00000000ULL) == 0x0013970100000000ULL) - return 1; - else - return 0; -} - -static uint64_t xsigo_chassisguid(Node *node) -{ - if (!is_xsigo_ca(node->sysimgguid)) { - /* Byte 3 is NodeType and byte 4 is PortType */ - /* If NodeType is 1 (switch), PortType is masked */ - if (is_xsigo_switch(node->sysimgguid)) - return node->sysimgguid & 0xffffffff00ffffffULL; - else - return node->sysimgguid; - } else { - /* Is there a peer port ? */ - if (!node->ports->remoteport) - return node->sysimgguid; - - /* If peer port is Leaf 1, use its chassis GUID */ - if (is_xsigo_leafone(node->ports->remoteport->node->sysimgguid)) - return node->ports->remoteport->node->sysimgguid & - 0xffffffff00ffffffULL; - else - return node->sysimgguid; - } -} - -static uint64_t get_chassisguid(Node *node) -{ - if (node->vendid == TS_VENDOR_ID || node->vendid == SS_VENDOR_ID) - return topspin_chassisguid(node->sysimgguid); - else if (node->vendid == XS_VENDOR_ID || is_xsigo_guid(node->sysimgguid)) - return xsigo_chassisguid(node); - else - return node->sysimgguid; -} - -static struct ChassisList *find_chassisguid(Node *node) -{ - ChassisList *current; - uint64_t chguid; - - chguid = get_chassisguid(node); - for (current = mylist.first; current; current = current->next) { - if (current->chassisguid == chguid) - return current; - } - - return NULL; -} - -uint64_t get_chassis_guid(unsigned char chassisnum) -{ - ChassisList *chassis; - - chassis = find_chassisnum(chassisnum); - if (chassis) - return chassis->chassisguid; - else - return 0; -} - -static int is_router(Node *node) -{ - return (node->devid == VTR_DEVID_IB_FC_ROUTER || - node->devid == VTR_DEVID_IB_IP_ROUTER); -} - -static int is_spine_9096(Node *node) -{ - return (node->devid == VTR_DEVID_SFB4 || - node->devid == VTR_DEVID_SFB4_DDR); -} - -static int is_spine_9288(Node *node) -{ - return (node->devid == VTR_DEVID_SFB12 || - node->devid == VTR_DEVID_SFB12_DDR); -} - -static int is_spine_2004(Node *node) -{ - return (node->devid == VTR_DEVID_SFB2004); -} - -static int is_spine_2012(Node *node) -{ - return (node->devid == VTR_DEVID_SFB2012); -} - -static int is_spine(Node *node) -{ - return (is_spine_9096(node) || is_spine_9288(node) || - is_spine_2004(node) || is_spine_2012(node)); -} - -static int is_line_24(Node *node) -{ - return (node->devid == VTR_DEVID_SLB24 || - node->devid == VTR_DEVID_SLB24_DDR || - node->devid == VTR_DEVID_SRB2004); -} - -static int is_line_8(Node *node) -{ - return (node->devid == VTR_DEVID_SLB8); -} - -static int is_line_2024(Node *node) -{ - return (node->devid == VTR_DEVID_SLB2024); -} - -static int is_line(Node *node) -{ - return (is_line_24(node) || is_line_8(node) || is_line_2024(node)); -} - -int is_chassis_switch(Node *node) -{ - return (is_spine(node) || is_line(node)); -} - -/* these structs help find Line (Anafa) slot number while using spine portnum */ -char line_slot_2_sfb4[25] = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 }; -char anafa_line_slot_2_sfb4[25] = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 }; -char line_slot_2_sfb12[25] = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 }; -char anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 }; - -/* IPR FCR modules connectivity while using sFB4 port as reference */ -char ipr_slot_2_sfb4_port[25] = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 }; - -/* these structs help find Spine (Anafa) slot number while using spine portnum */ -char spine12_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -char anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -char spine4_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -char anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -/* reference { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24 }; */ - -static void get_sfb_slot(Node *node, Port *lineport) -{ - ChassisRecord *ch = node->chrecord; - - ch->chassisslot = SPINE_CS; - if (is_spine_9096(node)) { - ch->chassistype = ISR9096_CT; - ch->slotnum = spine4_slot_2_slb[lineport->portnum]; - ch->anafanum = anafa_spine4_slot_2_slb[lineport->portnum]; - } else if (is_spine_9288(node)) { - ch->chassistype = ISR9288_CT; - ch->slotnum = spine12_slot_2_slb[lineport->portnum]; - ch->anafanum = anafa_spine12_slot_2_slb[lineport->portnum]; - } else if (is_spine_2012(node)) { - ch->chassistype = ISR2012_CT; - ch->slotnum = spine12_slot_2_slb[lineport->portnum]; - ch->anafanum = anafa_spine12_slot_2_slb[lineport->portnum]; - } else if (is_spine_2004(node)) { - ch->chassistype = ISR2004_CT; - ch->slotnum = spine4_slot_2_slb[lineport->portnum]; - ch->anafanum = anafa_spine4_slot_2_slb[lineport->portnum]; - } else { - IBPANIC("Unexpected node found: guid 0x%016" PRIx64, node->nodeguid); - } -} - -static void get_router_slot(Node *node, Port *spineport) -{ - ChassisRecord *ch = node->chrecord; - uint64_t guessnum = 0; - - if (!ch) { - if (!(node->chrecord = calloc(1, sizeof(ChassisRecord)))) - IBPANIC("out of mem"); - ch = node->chrecord; - } - - ch->chassisslot = SRBD_CS; - if (is_spine_9096(spineport->node)) { - ch->chassistype = ISR9096_CT; - ch->slotnum = line_slot_2_sfb4[spineport->portnum]; - ch->anafanum = ipr_slot_2_sfb4_port[spineport->portnum]; - } else if (is_spine_9288(spineport->node)) { - ch->chassistype = ISR9288_CT; - ch->slotnum = line_slot_2_sfb12[spineport->portnum]; - /* this is a smart guess based on nodeguids order on sFB-12 module */ - guessnum = spineport->node->nodeguid % 4; - /* module 1 <--> remote anafa 3 */ - /* module 2 <--> remote anafa 2 */ - /* module 3 <--> remote anafa 1 */ - ch->anafanum = (guessnum == 3 ? 1 : (guessnum == 1 ? 3 : 2)); - } else if (is_spine_2012(spineport->node)) { - ch->chassistype = ISR2012_CT; - ch->slotnum = line_slot_2_sfb12[spineport->portnum]; - /* this is a smart guess based on nodeguids order on sFB-12 module */ - guessnum = spineport->node->nodeguid % 4; - /* module 1 <--> remote anafa 3 */ - /* module 2 <--> remote anafa 2 */ - /* module 3 <--> remote anafa 1 */ - ch->anafanum = (guessnum == 3? 1 : (guessnum == 1 ? 3 : 2)); - } else if (is_spine_2004(spineport->node)) { - ch->chassistype = ISR2004_CT; - ch->slotnum = line_slot_2_sfb4[spineport->portnum]; - ch->anafanum = ipr_slot_2_sfb4_port[spineport->portnum]; - } else { - IBPANIC("Unexpected node found: guid 0x%016" PRIx64, spineport->node->nodeguid); - } -} - -static void get_slb_slot(ChassisRecord *ch, Port *spineport) -{ - ch->chassisslot = LINE_CS; - if (is_spine_9096(spineport->node)) { - ch->chassistype = ISR9096_CT; - ch->slotnum = line_slot_2_sfb4[spineport->portnum]; - ch->anafanum = anafa_line_slot_2_sfb4[spineport->portnum]; - } else if (is_spine_9288(spineport->node)) { - ch->chassistype = ISR9288_CT; - ch->slotnum = line_slot_2_sfb12[spineport->portnum]; - ch->anafanum = anafa_line_slot_2_sfb12[spineport->portnum]; - } else if (is_spine_2012(spineport->node)) { - ch->chassistype = ISR2012_CT; - ch->slotnum = line_slot_2_sfb12[spineport->portnum]; - ch->anafanum = anafa_line_slot_2_sfb12[spineport->portnum]; - } else if (is_spine_2004(spineport->node)) { - ch->chassistype = ISR2004_CT; - ch->slotnum = line_slot_2_sfb4[spineport->portnum]; - ch->anafanum = anafa_line_slot_2_sfb4[spineport->portnum]; - } else { - IBPANIC("Unexpected node found: guid 0x%016" PRIx64, spineport->node->nodeguid); - } -} - -/* - This function called for every Voltaire node in fabric - It could be optimized so, but time overhead is very small - and its only diag.util -*/ -static void fill_chassis_record(Node *node) -{ - Port *port; - Node *remnode = 0; - ChassisRecord *ch = 0; - - if (node->chrecord) /* somehow this node has already been passed */ - return; - - if (!(node->chrecord = calloc(1, sizeof(ChassisRecord)))) - IBPANIC("out of mem"); - - ch = node->chrecord; - - /* node is router only in case of using unique lid */ - /* (which is lid of chassis router port) */ - /* in such case node->ports is actually a requested port... */ - if (is_router(node) && is_spine(node->ports->remoteport->node)) - get_router_slot(node, node->ports->remoteport); - else if (is_spine(node)) { - for (port = node->ports; port; port = port->next) { - if (!port->remoteport) - continue; - remnode = port->remoteport->node; - if (remnode->type != SWITCH_NODE) { - if (!remnode->chrecord) - get_router_slot(remnode, port); - continue; - } - if (!ch->chassistype) - /* we assume here that remoteport belongs to line */ - get_sfb_slot(node, port->remoteport); - - /* we could break here, but need to find if more routers connected */ - } - - } else if (is_line(node)) { - for (port = node->ports; port; port = port->next) { - if (port->portnum > 12) - continue; - if (!port->remoteport) - continue; - /* we assume here that remoteport belongs to spine */ - get_slb_slot(ch, port->remoteport); - break; - } - } - - return; -} - -static int get_line_index(Node *node) -{ - int retval = 3 * (node->chrecord->slotnum - 1) + node->chrecord->anafanum; - - if (retval > LINES_MAX_NUM || retval < 1) - IBPANIC("Internal error"); - return retval; -} - -static int get_spine_index(Node *node) -{ - int retval; - - if (is_spine_9288(node) || is_spine_2012(node)) - retval = 3 * (node->chrecord->slotnum - 1) + node->chrecord->anafanum; - else - retval = node->chrecord->slotnum; - - if (retval > SPINES_MAX_NUM || retval < 1) - IBPANIC("Internal error"); - return retval; -} - -static void insert_line_router(Node *node, ChassisList *chassislist) -{ - int i = get_line_index(node); - - if (chassislist->linenode[i]) - return; /* already filled slot */ - - chassislist->linenode[i] = node; - node->chrecord->chassisnum = chassislist->chassisnum; -} - -static void insert_spine(Node *node, ChassisList *chassislist) -{ - int i = get_spine_index(node); - - if (chassislist->spinenode[i]) - return; /* already filled slot */ - - chassislist->spinenode[i] = node; - node->chrecord->chassisnum = chassislist->chassisnum; -} - -static void pass_on_lines_catch_spines(ChassisList *chassislist) -{ - Node *node, *remnode; - Port *port; - int i; - - for (i = 1; i <= LINES_MAX_NUM; i++) { - node = chassislist->linenode[i]; - - if (!(node && is_line(node))) - continue; /* empty slot or router */ - - for (port = node->ports; port; port = port->next) { - if (port->portnum > 12) - continue; - - if (!port->remoteport) - continue; - remnode = port->remoteport->node; - - if (!remnode->chrecord) - continue; /* some error - spine not initialized ? FIXME */ - insert_spine(remnode, chassislist); - } - } -} - -static void pass_on_spines_catch_lines(ChassisList *chassislist) -{ - Node *node, *remnode; - Port *port; - int i; - - for (i = 1; i <= SPINES_MAX_NUM; i++) { - node = chassislist->spinenode[i]; - if (!node) - continue; /* empty slot */ - for (port = node->ports; port; port = port->next) { - if (!port->remoteport) - continue; - remnode = port->remoteport->node; - - if (!remnode->chrecord) - continue; /* some error - line/router not initialized ? FIXME */ - insert_line_router(remnode, chassislist); - } - } -} - -/* - Stupid interpolation algorithm... - But nothing to do - have to be compliant with VoltaireSM/NMS -*/ -static void pass_on_spines_interpolate_chguid(ChassisList *chassislist) -{ - Node *node; - int i; - - for (i = 1; i <= SPINES_MAX_NUM; i++) { - node = chassislist->spinenode[i]; - if (!node) - continue; /* skip the empty slots */ - - /* take first guid minus one to be consistent with SM */ - chassislist->chassisguid = node->nodeguid - 1; - break; - } -} - -/* - This function fills chassislist structure with all nodes - in that chassis - chassislist structure = structure of one standalone chassis -*/ -static void build_chassis(Node *node, ChassisList *chassislist) -{ - Node *remnode = 0; - Port *port = 0; - - /* we get here with node = chassis_spine */ - chassislist->chassistype = node->chrecord->chassistype; - insert_spine(node, chassislist); - - /* loop: pass on all ports of node */ - for (port = node->ports; port; port = port->next) { - if (!port->remoteport) - continue; - remnode = port->remoteport->node; - - if (!remnode->chrecord) - continue; /* some error - line or router not initialized ? FIXME */ - - insert_line_router(remnode, chassislist); - } - - pass_on_lines_catch_spines(chassislist); - /* this pass needed for to catch routers, since routers connected only */ - /* to spines in slot 1 or 4 and we could miss them first time */ - pass_on_spines_catch_lines(chassislist); - - /* additional 2 passes needed for to overcome a problem of pure "in-chassis" */ - /* connectivity - extra pass to ensure that all related chips/modules */ - /* inserted into the chassislist */ - pass_on_lines_catch_spines(chassislist); - pass_on_spines_catch_lines(chassislist); - pass_on_spines_interpolate_chguid(chassislist); -} - -/*========================================================*/ -/* INTERNAL TO EXTERNAL PORT MAPPING */ -/*========================================================*/ - -/* -Description : On ISR9288/9096 external ports indexing - is not matching the internal ( anafa ) port - indexes. Use this MAP to translate the data you get from - the OpenIB diagnostics (smpquery, ibroute, ibtracert, etc.) - - -Module : sLB-24 - anafa 1 anafa 2 -ext port | 13 14 15 16 17 18 | 19 20 21 22 23 24 -int port | 22 23 24 18 17 16 | 22 23 24 18 17 16 -ext port | 1 2 3 4 5 6 | 7 8 9 10 11 12 -int port | 19 20 21 15 14 13 | 19 20 21 15 14 13 ------------------------------------------------- - -Module : sLB-8 - anafa 1 anafa 2 -ext port | 13 14 15 16 17 18 | 19 20 21 22 23 24 -int port | 24 23 22 18 17 16 | 24 23 22 18 17 16 -ext port | 1 2 3 4 5 6 | 7 8 9 10 11 12 -int port | 21 20 19 15 14 13 | 21 20 19 15 14 13 - ------------> - anafa 1 anafa 2 -ext port | - - 5 - - 6 | - - 7 - - 8 -int port | 24 23 22 18 17 16 | 24 23 22 18 17 16 -ext port | - - 1 - - 2 | - - 3 - - 4 -int port | 21 20 19 15 14 13 | 21 20 19 15 14 13 ------------------------------------------------- - -Module : sLB-2024 - -ext port | 13 14 15 16 17 18 19 20 21 22 23 24 -A1 int port| 13 14 15 16 17 18 19 20 21 22 23 24 -ext port | 1 2 3 4 5 6 7 8 9 10 11 12 -A2 int port| 13 14 15 16 17 18 19 20 21 22 23 24 ---------------------------------------------------- - -*/ - -int int2ext_map_slb24[2][25] = { - { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 5, 4, 18, 17, 16, 1, 2, 3, 13, 14, 15 }, - { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 11, 10, 24, 23, 22, 7, 8, 9, 19, 20, 21 } - }; -int int2ext_map_slb8[2][25] = { - { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 6, 6, 6, 1, 1, 1, 5, 5, 5 }, - { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 8, 8, 8, 3, 3, 3, 7, 7, 7 } - }; -int int2ext_map_slb2024[2][25] = { - { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }, - { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 } - }; -/* reference { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; */ - -/* - This function relevant only for line modules/chips - Returns string with external port index -*/ -char *portmapstring(Port *port) -{ - static char mapping[OUT_BUFFER_SIZE]; - ChassisRecord *ch = port->node->chrecord; - int portnum = port->portnum; - int chipnum = 0; - int pindex = 0; - Node *node = port->node; - - if (!ch || !is_line(node) || (portnum < 13 || portnum > 24)) - return NULL; - - if (ch->anafanum < 1 || ch->anafanum > 2) - return NULL; - - memset(mapping, 0, sizeof(mapping)); - - chipnum = ch->anafanum - 1; - - if (is_line_24(node)) - pindex = int2ext_map_slb24[chipnum][portnum]; - else if (is_line_2024(node)) - pindex = int2ext_map_slb2024[chipnum][portnum]; - else - pindex = int2ext_map_slb8[chipnum][portnum]; - - sprintf(mapping, "[ext %d]", pindex); - - return mapping; -} - -static void add_chassislist() -{ - if (!(mylist.current = calloc(1, sizeof(ChassisList)))) - IBPANIC("out of mem"); - - if (mylist.first == NULL) { - mylist.first = mylist.current; - mylist.last = mylist.current; - } else { - mylist.last->next = mylist.current; - mylist.current->next = NULL; - mylist.last = mylist.current; - } -} - -/* - Main grouping function - Algorithm: - 1. pass on every Voltaire node - 2. catch spine chip for every Voltaire node - 2.1 build/interpolate chassis around this chip - 2.2 go to 1. - 3. pass on non Voltaire nodes (SystemImageGUID based grouping) - 4. now group non Voltaire nodes by SystemImageGUID -*/ -ChassisList *group_nodes() -{ - Node *node; - int dist; - int chassisnum = 0; - struct ChassisList *chassis; - - mylist.first = NULL; - mylist.current = NULL; - mylist.last = NULL; - - /* first pass on switches and build for every Voltaire node */ - /* an appropriate chassis record (slotnum and position) */ - /* according to internal connectivity */ - /* not very efficient but clear code so... */ - for (dist = 0; dist <= maxhops_discovered; dist++) { - for (node = nodesdist[dist]; node; node = node->dnext) { - if (node->vendid == VTR_VENDOR_ID) - fill_chassis_record(node); - } - } - - /* separate every Voltaire chassis from each other and build linked list of them */ - /* algorithm: catch spine and find all surrounding nodes */ - for (dist = 0; dist <= maxhops_discovered; dist++) { - for (node = nodesdist[dist]; node; node = node->dnext) { - if (node->vendid != VTR_VENDOR_ID) - continue; - if (!node->chrecord || node->chrecord->chassisnum || !is_spine(node)) - continue; - add_chassislist(); - mylist.current->chassisnum = ++chassisnum; - build_chassis(node, mylist.current); - } - } - - /* now make pass on nodes for chassis which are not Voltaire */ - /* grouped by common SystemImageGUID */ - for (dist = 0; dist <= maxhops_discovered; dist++) { - for (node = nodesdist[dist]; node; node = node->dnext) { - if (node->vendid == VTR_VENDOR_ID) - continue; - if (node->sysimgguid) { - chassis = find_chassisguid(node); - if (chassis) - chassis->nodecount++; - else { - /* Possible new chassis */ - add_chassislist(); - mylist.current->chassisguid = get_chassisguid(node); - mylist.current->nodecount = 1; - } - } - } - } - - /* now, make another pass to see which nodes are part of chassis */ - /* (defined as chassis->nodecount > 1) */ - for (dist = 0; dist <= MAXHOPS; ) { - for (node = nodesdist[dist]; node; node = node->dnext) { - if (node->vendid == VTR_VENDOR_ID) - continue; - if (node->sysimgguid) { - chassis = find_chassisguid(node); - if (chassis && chassis->nodecount > 1) { - if (!chassis->chassisnum) - chassis->chassisnum = ++chassisnum; - if (!node->chrecord) { - if (!(node->chrecord = calloc(1, sizeof(ChassisRecord)))) - IBPANIC("out of mem"); - node->chrecord->chassisnum = chassis->chassisnum; - } - } - } - } - if (dist == maxhops_discovered) - dist = MAXHOPS; /* skip to CAs */ - else - dist++; - } - - return (mylist.first); -} diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 25c1f7f..99750f0 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -1,6 +1,7 @@ /* * Copyright (c) 2004-2008 Voltaire Inc. All rights reserved. * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. + * Copyright (c) 2008 Lawrence Livermore National Lab. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -48,445 +49,108 @@ #include #include #include +#include #include "ibnetdiscover.h" -#include "grouping.h" #include "ibdiag_common.h" struct ibmad_port *srcport; -static char *node_type_str[] = { - "???", - "ca", - "switch", - "router", - "iwarp rnic" -}; - -static char *linkwidth_str[] = { - "??", - "1x", - "4x", - "??", - "8x", - "??", - "??", - "??", - "12x" -}; - -static char *linkspeed_str[] = { - "???", - "SDR", - "DDR", - "???", - "QDR" -}; - static int timeout = 2000; /* ms */ -static int dumplevel = 0; static FILE *f; static char *node_name_map_file = NULL; static nn_map_t *node_name_map = NULL; -Node *nodesdist[MAXHOPS+1]; /* last is Ca list */ -Node *mynode; -int maxhops_discovered = 0; - -struct ChassisList *chassis = NULL; - -static char * -get_linkwidth_str(int linkwidth) +/** + * Define our own conversion functions to maintain compatibility with the old + * ibnetdiscover which did not use the ibmad conversion functions. + */ +char *dump_linkspeed_compat(uint32_t speed) { - if (linkwidth > 8) - return linkwidth_str[0]; - else - return linkwidth_str[linkwidth]; + switch (speed) { + case 1: + return ("SDR"); + break; + case 2: + return ("DDR"); + break; + case 4: + return ("QDR"); + break; + } + return ("???"); } -static char * -get_linkspeed_str(int linkspeed) +char *dump_linkwidth_compat(uint32_t width) { - if (linkspeed > 4) - return linkspeed_str[0]; - else - return linkspeed_str[linkspeed]; + switch (width) { + case 1: + return ("1x"); + break; + case 2: + return ("4x"); + break; + case 4: + return ("8x"); + break; + case 8: + return ("12x"); + break; + } + return ("??"); } static inline const char* -node_type_str2(Node *node) +ports_nt_str_compat(ibnd_node_t *node) { switch(node->type) { - case SWITCH_NODE: return "SW"; - case CA_NODE: return "CA"; - case ROUTER_NODE: return "RT"; + case IB_NODE_SWITCH: return "SW"; + case IB_NODE_CA: return "CA"; + case IB_NODE_ROUTER: return "RT"; } return "??"; } -void -decode_port_info(void *pi, Port *port) -{ - mad_decode_field(pi, IB_PORT_LID_F, &port->lid); - mad_decode_field(pi, IB_PORT_LMC_F, &port->lmc); - mad_decode_field(pi, IB_PORT_STATE_F, &port->state); - mad_decode_field(pi, IB_PORT_PHYS_STATE_F, &port->physstate); - mad_decode_field(pi, IB_PORT_LINK_WIDTH_ACTIVE_F, &port->linkwidth); - mad_decode_field(pi, IB_PORT_LINK_SPEED_ACTIVE_F, &port->linkspeed); -} - - -int -get_port(Port *port, int portnum, ib_portid_t *portid) -{ - char portinfo[64]; - void *pi = portinfo; - - port->portnum = portnum; - - if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout, - srcport)) - return -1; - decode_port_info(pi, port); - - DEBUG("portid %s portnum %d: lid %d state %d physstate %d %s %s", - portid2str(portid), portnum, port->lid, port->state, port->physstate, get_linkwidth_str(port->linkwidth), get_linkspeed_str(port->linkspeed)); - return 1; -} -/* - * Returns 0 if non switch node is found, 1 if switch is found, -1 if error. - */ -int -get_node(Node *node, Port *port, ib_portid_t *portid) -{ - char portinfo[64]; - char switchinfo[64]; - void *pi = portinfo, *ni = node->nodeinfo, *nd = node->nodedesc; - void *si = switchinfo; - - if (!smp_query_via(ni, portid, IB_ATTR_NODE_INFO, 0, timeout, srcport)) - return -1; - - mad_decode_field(ni, IB_NODE_GUID_F, &node->nodeguid); - mad_decode_field(ni, IB_NODE_TYPE_F, &node->type); - mad_decode_field(ni, IB_NODE_NPORTS_F, &node->numports); - mad_decode_field(ni, IB_NODE_DEVID_F, &node->devid); - mad_decode_field(ni, IB_NODE_VENDORID_F, &node->vendid); - mad_decode_field(ni, IB_NODE_SYSTEM_GUID_F, &node->sysimgguid); - mad_decode_field(ni, IB_NODE_PORT_GUID_F, &node->portguid); - mad_decode_field(ni, IB_NODE_LOCAL_PORT_F, &node->localport); - port->portnum = node->localport; - port->portguid = node->portguid; - - if (!smp_query_via(nd, portid, IB_ATTR_NODE_DESC, 0, timeout, srcport)) - return -1; - - if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, 0, timeout, srcport)) - return -1; - decode_port_info(pi, port); - - if (node->type != SWITCH_NODE) - return 0; - - node->smalid = port->lid; - node->smalmc = port->lmc; - - /* after we have the sma information find out the real PortInfo for this port */ - if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, node->localport, - timeout, srcport)) - return -1; - decode_port_info(pi, port); - - port->lid = node->smalid; /* LID is still defined by port 0 */ - port->lmc = node->smalmc; - - if (!smp_query_via(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout, srcport)) - node->smaenhsp0 = 0; /* assume base SP0 */ - else - mad_decode_field(si, IB_SW_ENHANCED_PORT0_F, &node->smaenhsp0); - - DEBUG("portid %s: got switch node %" PRIx64 " '%s'", - portid2str(portid), node->nodeguid, node->nodedesc); - return 1; -} - -static int -extend_dpath(ib_dr_path_t *path, int nextport) -{ - if (path->cnt+2 >= sizeof(path->p)) - return -1; - ++path->cnt; - if (path->cnt > maxhops_discovered) - maxhops_discovered = path->cnt; - path->p[path->cnt] = (uint8_t) nextport; - return path->cnt; -} - -static void -dump_endnode(ib_portid_t *path, char *prompt, Node *node, Port *port) -{ - if (!dumplevel) - return; - - fprintf(f, "%s -> %s %s {%016" PRIx64 "} portnum %d lid %d-%d\"%s\"\n", - portid2str(path), prompt, - (node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"), - node->nodeguid, node->type == SWITCH_NODE ? 0 : port->portnum, - port->lid, port->lid + (1 << port->lmc) - 1, - clean_nodedesc(node->nodedesc)); -} - -#define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103))) -#define HTSZ 137 - -static Node *nodestbl[HTSZ]; - -static Node * -find_node(Node *new) -{ - int hash = HASHGUID(new->nodeguid) % HTSZ; - Node *node; - - for (node = nodestbl[hash]; node; node = node->htnext) - if (node->nodeguid == new->nodeguid) - return node; - - return NULL; -} - -static Node * -create_node(Node *temp, ib_portid_t *path, int dist) -{ - Node *node; - int hash = HASHGUID(temp->nodeguid) % HTSZ; - - node = malloc(sizeof(*node)); - if (!node) - return NULL; - - memcpy(node, temp, sizeof(*node)); - node->dist = dist; - node->path = *path; - - node->htnext = nodestbl[hash]; - nodestbl[hash] = node; - - if (node->type != SWITCH_NODE) - dist = MAXHOPS; /* special Ca list */ - - node->dnext = nodesdist[dist]; - nodesdist[dist] = node; - - return node; -} - -static Port * -find_port(Node *node, Port *port) -{ - Port *old; - - for (old = node->ports; old; old = old->next) - if (old->portnum == port->portnum) - return old; - - return NULL; -} - -static Port * -create_port(Node *node, Port *temp) -{ - Port *port; - - port = malloc(sizeof(*port)); - if (!port) - return NULL; - - memcpy(port, temp, sizeof(*port)); - port->node = node; - port->next = node->ports; - node->ports = port; - - return port; -} - -static void -link_ports(Node *node, Port *port, Node *remotenode, Port *remoteport) -{ - DEBUG("linking: 0x%" PRIx64 " %p->%p:%u and 0x%" PRIx64 " %p->%p:%u", - node->nodeguid, node, port, port->portnum, - remotenode->nodeguid, remotenode, remoteport, remoteport->portnum); - if (port->remoteport) - port->remoteport->remoteport = NULL; - if (remoteport->remoteport) - remoteport->remoteport->remoteport = NULL; - port->remoteport = remoteport; - remoteport->remoteport = port; -} - -static int -handle_port(Node *node, Port *port, ib_portid_t *path, int portnum, int dist) -{ - Node node_buf; - Port port_buf; - Node *remotenode, *oldnode; - Port *remoteport, *oldport; - - memset(&node_buf, 0, sizeof(node_buf)); - memset(&port_buf, 0, sizeof(port_buf)); - - DEBUG("handle node %p port %p:%d dist %d", node, port, portnum, dist); - if (port->physstate != 5) /* LinkUp */ - return -1; - - if (extend_dpath(&path->drpath, portnum) < 0) - return -1; - - if (get_node(&node_buf, &port_buf, path) < 0) { - IBWARN("NodeInfo on %s failed, skipping port", - portid2str(path)); - path->drpath.cnt--; /* restore path */ - return -1; - } - - oldnode = find_node(&node_buf); - if (oldnode) - remotenode = oldnode; - else if (!(remotenode = create_node(&node_buf, path, dist + 1))) - IBERROR("no memory"); - - oldport = find_port(remotenode, &port_buf); - if (oldport) { - remoteport = oldport; - if (node != remotenode || port != remoteport) - IBWARN("port moving..."); - } else if (!(remoteport = create_port(remotenode, &port_buf))) - IBERROR("no memory"); - - dump_endnode(path, oldnode ? "known remote" : "new remote", - remotenode, remoteport); - - link_ports(node, port, remotenode, remoteport); - - path->drpath.cnt--; /* restore path */ - return 0; -} - -/* - * Return 1 if found, 0 if not, -1 on errors. - */ -static int -discover(ib_portid_t *from) -{ - Node node_buf; - Port port_buf; - Node *node; - Port *port; - int i; - int dist = 0; - ib_portid_t *path; - - DEBUG("from %s", portid2str(from)); - - memset(&node_buf, 0, sizeof(node_buf)); - memset(&port_buf, 0, sizeof(port_buf)); - - if (get_node(&node_buf, &port_buf, from) < 0) { - IBWARN("can't reach node %s", portid2str(from)); - return -1; - } - - node = create_node(&node_buf, from, 0); - if (!node) - IBERROR("out of memory"); - - mynode = node; - - port = create_port(node, &port_buf); - if (!port) - IBERROR("out of memory"); - - if (node->type != SWITCH_NODE && - handle_port(node, port, from, node->localport, 0) < 0) - return 0; - - for (dist = 0; dist < MAXHOPS; dist++) { - - for (node = nodesdist[dist]; node; node = node->dnext) { - - path = &node->path; - - DEBUG("dist %d node %p", dist, node); - dump_endnode(path, "processing", node, port); - - for (i = 1; i <= node->numports; i++) { - if (i == node->localport) - continue; - - if (get_port(&port_buf, i, path) < 0) { - IBWARN("can't reach node %s port %d", portid2str(path), i); - continue; - } - - port = find_port(node, &port_buf); - if (port) - continue; - - port = create_port(node, &port_buf); - if (!port) - IBERROR("out of memory"); - - /* If switch, set port GUID to node GUID */ - if (node->type == SWITCH_NODE) - port->portguid = node->portguid; - - handle_port(node, port, path, i, dist); - } - } - } - - return 0; -} - char * -node_name(Node *node) +node_name(ibnd_node_t *node) { static char buf[256]; switch(node->type) { - case SWITCH_NODE: + case IB_NODE_SWITCH: sprintf(buf, "\"%s", "S"); break; - case CA_NODE: + case IB_NODE_CA: sprintf(buf, "\"%s", "H"); break; - case ROUTER_NODE: + case IB_NODE_ROUTER: sprintf(buf, "\"%s", "R"); break; default: sprintf(buf, "\"%s", "?"); break; } - sprintf(buf+2, "-%016" PRIx64 "\"", node->nodeguid); + sprintf(buf+2, "-%016" PRIx64 "\"", node->guid); return buf; } void -list_node(Node *node) +list_node(ibnd_node_t *node, void *user_data) { char *node_type; - char *nodename = remap_node_name(node_name_map, node->nodeguid, + char *nodename = remap_node_name(node_name_map, node->guid, node->nodedesc); switch(node->type) { - case SWITCH_NODE: + case IB_NODE_SWITCH: node_type = "Switch"; break; - case CA_NODE: + case IB_NODE_CA: node_type = "Ca"; break; - case ROUTER_NODE: + case IB_NODE_ROUTER: node_type = "Router"; break; default: @@ -495,36 +159,58 @@ list_node(Node *node) } fprintf(f, "%s\t : 0x%016" PRIx64 " ports %d devid 0x%x vendid 0x%x \"%s\"\n", node_type, - node->nodeguid, node->numports, node->devid, node->vendid, + node->guid, node->numports, + mad_get_field(node->info, 0, IB_NODE_DEVID_F), + mad_get_field(node->info, 0, IB_NODE_VENDORID_F), nodename); free(nodename); } void -out_ids(Node *node, int group, char *chname) +list_nodes(ibnd_fabric_t *fabric, int list) { - fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", node->vendid, node->devid); - if (node->sysimgguid) - fprintf(f, "sysimgguid=0x%" PRIx64, node->sysimgguid); + if (list & LIST_CA_NODE) { + ibnd_iter_nodes_type(fabric, list_node, IB_NODE_CA, NULL); + } + if (list & LIST_SWITCH_NODE) { + ibnd_iter_nodes_type(fabric, list_node, IB_NODE_SWITCH, NULL); + } + if (list & LIST_ROUTER_NODE) { + ibnd_iter_nodes_type(fabric, list_node, IB_NODE_ROUTER, NULL); + } +} + +void +out_ids(ibnd_node_t *node, int group, char *chname) +{ + uint64_t sysimgguid = mad_get_field64(node->info, 0, IB_NODE_SYSTEM_GUID_F); + + fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", + mad_get_field(node->info, 0, IB_NODE_VENDORID_F), + mad_get_field(node->info, 0, IB_NODE_DEVID_F)); + if (sysimgguid) + fprintf(f, "sysimgguid=0x%" PRIx64, sysimgguid); if (group - && node->chrecord && node->chrecord->chassisnum) { - fprintf(f, "\t\t# Chassis %d", node->chrecord->chassisnum); + && node->chassis && node->chassis->chassisnum) { + fprintf(f, "\t\t# Chassis %d", node->chassis->chassisnum); if (chname) - fprintf(f, " (%s)", chname); - if (is_xsigo_tca(node->nodeguid) && node->ports->remoteport) - fprintf(f, " slot %d", node->ports->remoteport->portnum); + fprintf(f, " (%s)", clean_nodedesc(chname)); + if (ibnd_is_xsigo_tca(node->guid) + && node->ports[1] + && node->ports[1]->remoteport) + fprintf(f, " slot %d", node->ports[1]->remoteport->portnum); } fprintf(f, "\n"); } uint64_t -out_chassis(unsigned char chassisnum) +out_chassis(ibnd_fabric_t *fabric, int chassisnum) { uint64_t guid; fprintf(f, "\nChassis %d", chassisnum); - guid = get_chassis_guid(chassisnum); + guid = ibnd_get_chassis_guid(fabric, chassisnum); if (guid) fprintf(f, " (guid 0x%" PRIx64 ")", guid); fprintf(f, "\n"); @@ -532,29 +218,25 @@ out_chassis(unsigned char chassisnum) } void -out_switch(Node *node, int group, char *chname) +out_switch(ibnd_node_t *node, int group, char *chname) { char *str; + char str2[256]; char *nodename = NULL; out_ids(node, group, chname); - fprintf(f, "switchguid=0x%" PRIx64, node->nodeguid); - fprintf(f, "(%" PRIx64 ")", node->portguid); - /* Currently, only if Voltaire chassis */ - if (group - && node->chrecord && node->chrecord->chassisnum - && node->vendid == VTR_VENDOR_ID) { - str = get_chassis_type(node->chrecord->chassistype); + fprintf(f, "switchguid=0x%" PRIx64, node->guid); + fprintf(f, "(%" PRIx64 ")", mad_get_field64(node->info, 0, IB_NODE_PORT_GUID_F)); + if (group) { + str = ibnd_get_chassis_type(node); if (str) fprintf(f, "%s ", str); - str = get_chassis_slot(node->chrecord->chassisslot); + str = ibnd_get_chassis_slot_str(node, str2, 256); if (str) - fprintf(f, "%s ", str); - fprintf(f, "%d Chip %d", node->chrecord->slotnum, node->chrecord->anafanum); + fprintf(f, "%s", str); } - nodename = remap_node_name(node_name_map, node->nodeguid, - node->nodedesc); + nodename = remap_node_name(node_name_map, node->guid, node->nodedesc); fprintf(f, "\nSwitch\t%d %s\t\t# \"%s\" %s port 0 lid %d lmc %d\n", node->numports, node_name(node), @@ -566,20 +248,18 @@ out_switch(Node *node, int group, char *chname) } void -out_ca(Node *node, int group, char *chname) +out_ca(ibnd_node_t *node, int group, char *chname) { char *node_type; char *node_type2; - char *nodename = remap_node_name(node_name_map, node->nodeguid, - node->nodedesc); out_ids(node, group, chname); switch(node->type) { - case CA_NODE: + case IB_NODE_CA: node_type = "ca"; node_type2 = "Ca"; break; - case ROUTER_NODE: + case IB_NODE_ROUTER: node_type = "rt"; node_type2 = "Rt"; break; @@ -589,37 +269,41 @@ out_ca(Node *node, int group, char *chname) break; } - fprintf(f, "%sguid=0x%" PRIx64 "\n", node_type, node->nodeguid); + fprintf(f, "%sguid=0x%" PRIx64 "\n", node_type, node->guid); fprintf(f, "%s\t%d %s\t\t# \"%s\"", node_type2, node->numports, node_name(node), - nodename); - if (group && is_xsigo_hca(node->nodeguid)) + clean_nodedesc(node->nodedesc)); + if (group && ibnd_is_xsigo_hca(node->guid)) fprintf(f, " (scp)"); fprintf(f, "\n"); - - free(nodename); } +#define OUT_BUFFER_SIZE 16 static char * -out_ext_port(Port *port, int group) +out_ext_port(ibnd_port_t *port, int group) { - char *str = NULL; + static char mapping[OUT_BUFFER_SIZE]; - /* Currently, only if Voltaire chassis */ - if (group - && port->node->chrecord && port->node->vendid == VTR_VENDOR_ID) - str = portmapstring(port); + if (group && port->ext_portnum != 0) { + snprintf(mapping, OUT_BUFFER_SIZE, + "[ext %d]", port->ext_portnum); + return (mapping); + } - return (str); + return (NULL); } void -out_switch_port(Port *port, int group) +out_switch_port(ibnd_port_t *port, int group) { char *ext_port_str = NULL; char *rem_nodename = NULL; + uint32_t iwidth = mad_get_field(port->info, 0, + IB_PORT_LINK_WIDTH_ACTIVE_F); + uint32_t ispeed = mad_get_field(port->info, 0, + IB_PORT_LINK_SPEED_ACTIVE_F); - DEBUG("port %p:%d remoteport %p", port, port->portnum, port->remoteport); + DEBUG("port %p:%d remoteport %p\n", port, port->portnum, port->remoteport); fprintf(f, "[%d]", port->portnum); ext_port_str = out_ext_port(port, group); @@ -627,7 +311,7 @@ out_switch_port(Port *port, int group) fprintf(f, "%s", ext_port_str); rem_nodename = remap_node_name(node_name_map, - port->remoteport->node->nodeguid, + port->remoteport->node->guid, port->remoteport->node->nodedesc); ext_port_str = out_ext_port(port->remoteport, group); @@ -635,17 +319,19 @@ out_switch_port(Port *port, int group) node_name(port->remoteport->node), port->remoteport->portnum, ext_port_str ? ext_port_str : ""); - if (port->remoteport->node->type != SWITCH_NODE) - fprintf(f, "(%" PRIx64 ") ", port->remoteport->portguid); + if (port->remoteport->node->type != IB_NODE_SWITCH) + fprintf(f, "(%" PRIx64 ") ", port->remoteport->guid); fprintf(f, "\t\t# \"%s\" lid %d %s%s", rem_nodename, - port->remoteport->node->type == SWITCH_NODE ? port->remoteport->node->smalid : port->remoteport->lid, - get_linkwidth_str(port->linkwidth), - get_linkspeed_str(port->linkspeed)); + port->remoteport->node->type == IB_NODE_SWITCH ? + port->remoteport->node->smalid : + port->remoteport->base_lid, + dump_linkwidth_compat(iwidth), + dump_linkspeed_compat(ispeed)); - if (is_xsigo_tca(port->remoteport->portguid)) + if (ibnd_is_xsigo_tca(port->remoteport->guid)) fprintf(f, " slot %d", port->portnum); - else if (is_xsigo_hca(port->remoteport->portguid)) + else if (ibnd_is_xsigo_hca(port->remoteport->guid)) fprintf(f, " (scp)"); fprintf(f, "\n"); @@ -653,281 +339,275 @@ out_switch_port(Port *port, int group) } void -out_ca_port(Port *port, int group) +out_ca_port(ibnd_port_t *port, int group) { char *str = NULL; char *rem_nodename = NULL; + uint32_t iwidth = mad_get_field(port->info, 0, + IB_PORT_LINK_WIDTH_ACTIVE_F); + uint32_t ispeed = mad_get_field(port->info, 0, + IB_PORT_LINK_SPEED_ACTIVE_F); fprintf(f, "[%d]", port->portnum); - if (port->node->type != SWITCH_NODE) - fprintf(f, "(%" PRIx64 ") ", port->portguid); + if (port->node->type != IB_NODE_SWITCH) + fprintf(f, "(%" PRIx64 ") ", port->guid); fprintf(f, "\t%s[%d]", node_name(port->remoteport->node), port->remoteport->portnum); str = out_ext_port(port->remoteport, group); if (str) fprintf(f, "%s", str); - if (port->remoteport->node->type != SWITCH_NODE) - fprintf(f, " (%" PRIx64 ") ", port->remoteport->portguid); + if (port->remoteport->node->type != IB_NODE_SWITCH) + fprintf(f, " (%" PRIx64 ") ", port->remoteport->guid); rem_nodename = remap_node_name(node_name_map, - port->remoteport->node->nodeguid, + port->remoteport->node->guid, port->remoteport->node->nodedesc); fprintf(f, "\t\t# lid %d lmc %d \"%s\" lid %d %s%s\n", - port->lid, port->lmc, rem_nodename, - port->remoteport->node->type == SWITCH_NODE ? port->remoteport->node->smalid : port->remoteport->lid, - get_linkwidth_str(port->linkwidth), - get_linkspeed_str(port->linkspeed)); + port->base_lid, port->lmc, rem_nodename, + port->remoteport->node->type == IB_NODE_SWITCH ? + port->remoteport->node->smalid : + port->remoteport->base_lid, + dump_linkwidth_compat(iwidth), + dump_linkspeed_compat(ispeed)); free(rem_nodename); } + +struct iter_user_data { + int group; + int skip_chassis_nodes; +}; + +static void +switch_iter_func(ibnd_node_t *node, void *iter_user_data) +{ + ibnd_port_t *port; + int p = 0; + struct iter_user_data *data = (struct iter_user_data *)iter_user_data; + + DEBUG("SWITCH: node %p\n", node); + + /* skip chassis based switches if flagged */ + if (data->skip_chassis_nodes && node->chassis && node->chassis->chassisnum) + return; + + out_switch(node, data->group, NULL); + for (p = 1; p <= node->numports; p++) { + port = node->ports[p]; + if (port && port->remoteport) + out_switch_port(port, data->group); + } +} + +static void +ca_iter_func(ibnd_node_t *node, void *iter_user_data) +{ + ibnd_port_t *port; + int p = 0; + struct iter_user_data *data = (struct iter_user_data *)iter_user_data; + + DEBUG("CA: node %p\n", node); + /* Now, skip chassis based CAs */ + if (data->group && node->chassis && node->chassis->chassisnum) + return; + out_ca(node, data->group, NULL); + + for (p = 1; p <= node->numports; p++) { + port = node->ports[p]; + if (port && port->remoteport) + out_ca_port(port, data->group); + } +} + +static void +router_iter_func(ibnd_node_t *node, void *iter_user_data) +{ + ibnd_port_t *port; + int p = 0; + struct iter_user_data *data = (struct iter_user_data *)iter_user_data; + + DEBUG("RT: node %p\n", node); + /* Now, skip chassis based RTs */ + if (data->group && node->chassis && + node->chassis->chassisnum) + return; + out_ca(node, data->group, NULL); + for (p = 1; p <= node->numports; p++) { + port = node->ports[p]; + if (port && port->remoteport) + out_ca_port(port, data->group); + } +} + int -dump_topology(int listtype, int group) +dump_topology(int group, ibnd_fabric_t *fabric) { - Node *node; - Port *port; - int i = 0, dist = 0; + ibnd_node_t *node; + ibnd_port_t *port; + int i = 0, p = 0; time_t t = time(0); uint64_t chguid; char *chname = NULL; + struct iter_user_data iter_user_data; - if (!listtype) { - fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t)); - fprintf(f, "# Max of %d hops discovered\n", maxhops_discovered); - fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n", mynode->nodeguid, mynode->portguid); - } + fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t)); + fprintf(f, "# Max of %d hops discovered\n", fabric->maxhops_discovered); + fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n", + fabric->from_node->guid, + mad_get_field64(fabric->from_node->info, 0, IB_NODE_PORT_GUID_F)); /* Make pass on switches */ - if (group && !listtype) { - ChassisList *ch = NULL; + if (group) { + ibnd_chassis_t *ch = NULL; /* Chassis based switches first */ - for (ch = chassis; ch; ch = ch->next) { + for (ch = fabric->chassis; ch; ch = ch->next) { int n = 0; if (!ch->chassisnum) continue; - chguid = out_chassis(ch->chassisnum); - if (chname) - free(chname); + chguid = out_chassis(fabric, ch->chassisnum); + chname = NULL; - if (is_xsigo_guid(chguid)) { - for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { - if (!node->chrecord || - !node->chrecord->chassisnum) - continue; - - if (node->chrecord->chassisnum != ch->chassisnum) - continue; - - if (is_xsigo_hca(node->nodeguid)) { - chname = remap_node_name(node_name_map, - node->nodeguid, - node->nodedesc); - fprintf(f, "Hostname: %s\n", chname); +/** + * Will this work for Xsigo? + */ + if (ibnd_is_xsigo_guid(chguid)) { + for (node = ch->nodes; node; + node = node->next_chassis_node) { + if (ibnd_is_xsigo_hca(node->guid)) { + chname = node->nodedesc; + fprintf(f, "Hostname: %s\n", clean_nodedesc(node->nodedesc)); } } } fprintf(f, "\n# Spine Nodes"); - for (n = 1; n <= (SPINES_MAX_NUM+1); n++) { + for (n = 1; n <= SPINES_MAX_NUM; n++) { if (ch->spinenode[n]) { out_switch(ch->spinenode[n], group, chname); - for (port = ch->spinenode[n]->ports; port; port = port->next, i++) - if (port->remoteport) + for (p = 1; p <= ch->spinenode[n]->numports; p++) { + port = ch->spinenode[n]->ports[p]; + if (port && port->remoteport) out_switch_port(port, group); + } } } fprintf(f, "\n# Line Nodes"); - for (n = 1; n <= (LINES_MAX_NUM+1); n++) { + for (n = 1; n <= LINES_MAX_NUM; n++) { if (ch->linenode[n]) { out_switch(ch->linenode[n], group, chname); - for (port = ch->linenode[n]->ports; port; port = port->next, i++) - if (port->remoteport) + for (p = 1; p <= ch->linenode[n]->numports; p++) { + port = ch->linenode[n]->ports[p]; + if (port && port->remoteport) out_switch_port(port, group); + } } } fprintf(f, "\n# Chassis Switches"); - for (dist = 0; dist <= maxhops_discovered; dist++) { - - for (node = nodesdist[dist]; node; node = node->dnext) { - - /* Non Voltaire chassis */ - if (node->vendid == VTR_VENDOR_ID) - continue; - if (!node->chrecord || - !node->chrecord->chassisnum) - continue; - - if (node->chrecord->chassisnum != ch->chassisnum) - continue; - + for (node = ch->nodes; node; + node = node->next_chassis_node) { + if (node->type == IB_NODE_SWITCH) { out_switch(node, group, chname); - for (port = node->ports; port; port = port->next, i++) - if (port->remoteport) + for (p = 1; p <= node->numports; p++) { + port = node->ports[p]; + if (port && port->remoteport) out_switch_port(port, group); - + } } } fprintf(f, "\n# Chassis CAs"); - for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { - if (!node->chrecord || - !node->chrecord->chassisnum) - continue; - - if (node->chrecord->chassisnum != ch->chassisnum) - continue; - - out_ca(node, group, chname); - for (port = node->ports; port; port = port->next, i++) - if (port->remoteport) - out_ca_port(port, group); - + for (node = ch->nodes; node; + node = node->next_chassis_node) { + if (node->type == IB_NODE_CA) { + out_ca(node, group, chname); + for (p = 1; p <= node->numports; p++) { + port = node->ports[p]; + if (port && port->remoteport) + out_ca_port(port, group); + } + } } } - } else { - for (dist = 0; dist <= maxhops_discovered; dist++) { - - for (node = nodesdist[dist]; node; node = node->dnext) { - DEBUG("SWITCH: dist %d node %p", dist, node); - if (!listtype) - out_switch(node, group, chname); - else { - if (listtype & LIST_SWITCH_NODE) - list_node(node); - continue; - } - - for (port = node->ports; port; port = port->next, i++) - if (port->remoteport) - out_switch_port(port, group); - } - } + } else { /* !group */ + iter_user_data.group = group; + iter_user_data.skip_chassis_nodes = 0; + ibnd_iter_nodes_type(fabric, switch_iter_func, + IB_NODE_SWITCH, &iter_user_data); } - if (chname) - free(chname); chname = NULL; - if (group && !listtype) { + if (group) { + iter_user_data.group = group; + iter_user_data.skip_chassis_nodes = 1; fprintf(f, "\nNon-Chassis Nodes\n"); - for (dist = 0; dist <= maxhops_discovered; dist++) { - - for (node = nodesdist[dist]; node; node = node->dnext) { - - DEBUG("SWITCH: dist %d node %p", dist, node); - /* Now, skip chassis based switches */ - if (node->chrecord && - node->chrecord->chassisnum) - continue; - out_switch(node, group, chname); - - for (port = node->ports; port; port = port->next, i++) - if (port->remoteport) - out_switch_port(port, group); - } - - } - + ibnd_iter_nodes_type(fabric, switch_iter_func, + IB_NODE_SWITCH, &iter_user_data); } + iter_user_data.group = group; + iter_user_data.skip_chassis_nodes = 0; /* Make pass on CAs */ - for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { + ibnd_iter_nodes_type(fabric, ca_iter_func, IB_NODE_CA, + &iter_user_data); - DEBUG("CA: dist %d node %p", dist, node); - if (!listtype) { - /* Now, skip chassis based CAs */ - if (group && node->chrecord && - node->chrecord->chassisnum) - continue; - out_ca(node, group, chname); - } else { - if (((listtype & LIST_CA_NODE) && (node->type == CA_NODE)) || - ((listtype & LIST_ROUTER_NODE) && (node->type == ROUTER_NODE))) - list_node(node); - continue; - } - - for (port = node->ports; port; port = port->next, i++) - if (port->remoteport) - out_ca_port(port, group); - } - - if (chname) - free(chname); + /* make pass on routers */ + ibnd_iter_nodes_type(fabric, router_iter_func, IB_NODE_ROUTER, + &iter_user_data); return i; } -void dump_ports_report () +void dump_ports_report (ibnd_node_t *node, void *user_data) { - int b, n = 0, p; - Node *node; - Port *port; - - /* - * If switch and LID == 0, search of other switch ports with - * valid LID and assign it to all ports of that switch - */ - for (b = 0; b <= MAXHOPS; b++) - for (node = nodesdist[b]; node; node = node->dnext) - if (node->type == SWITCH_NODE) { - int swlid = 0; - for (p = 0, port = node->ports; - p < node->numports && port && !swlid; - port = port->next) - if (port->lid != 0) - swlid = port->lid; - for (p = 0, port = node->ports; - p < node->numports && port; - port = port->next) - port->lid = swlid; - } - - for (b = 0; b <= MAXHOPS; b++) - for (node = nodesdist[b]; node; node = node->dnext) { - for (p = 0, port = node->ports; - p < node->numports && port; - p++, port = port->next) { - fprintf(stdout, - "%2s %5d %2d 0x%016" PRIx64 " %s %s", - node_type_str2(port->node), port->lid, - port->portnum, - port->portguid, - get_linkwidth_str(port->linkwidth), - get_linkspeed_str(port->linkspeed)); - if (port->remoteport) - fprintf(stdout, - " - %2s %5d %2d 0x%016" PRIx64 - " ( '%s' - '%s' )\n", - node_type_str2(port->remoteport->node), - port->remoteport->lid, - port->remoteport->portnum, - port->remoteport->portguid, - remap_node_name(node_name_map, - port->node->nodeguid, - port->node->nodedesc), - remap_node_name(node_name_map, - port->remoteport->node->nodeguid, - port->remoteport->node->nodedesc)); - else - fprintf(stdout, "%36s'%s'\n", "", - remap_node_name(node_name_map, - port->node->nodeguid, - port->node->nodedesc)); - - } - n++; - } + int p = 0; + ibnd_port_t *port = NULL; + + /* for each port */ + for (p = node->numports, port = node->ports[p]; + p > 0; + port = node->ports[--p]) { + uint32_t iwidth, ispeed; + if (port == NULL) + continue; + iwidth = mad_get_field(port->info, 0, IB_PORT_LINK_WIDTH_ACTIVE_F); + ispeed = mad_get_field(port->info, 0, IB_PORT_LINK_SPEED_ACTIVE_F); + fprintf(stdout, + "%2s %5d %2d 0x%016" PRIx64 " %s %s", + ports_nt_str_compat(node), + node->type == IB_NODE_SWITCH ? + node->smalid : port->base_lid, + port->portnum, + port->guid, + dump_linkwidth_compat(iwidth), + dump_linkspeed_compat(ispeed)); + if (port->remoteport) + fprintf(stdout, + " - %2s %5d %2d 0x%016" PRIx64 + " ( '%s' - '%s' )\n", + ports_nt_str_compat(port->remoteport->node), + port->remoteport->node->type == IB_NODE_SWITCH ? + port->remoteport->node->smalid : + port->remoteport->base_lid, + port->remoteport->portnum, + port->remoteport->guid, + port->node->nodedesc, + port->remoteport->node->nodedesc); + else + fprintf(stdout, "%36s'%s'\n", "", + port->node->nodedesc); + } } static int list, group, ports_report; @@ -939,7 +619,7 @@ static int process_opt(void *context, int ch, char *optarg) node_name_map_file = strdup(optarg); break; case 's': - dumplevel = 1; + ibnd_show_progress(1); break; case 'l': list = LIST_CA_NODE | LIST_SWITCH_NODE | LIST_ROUTER_NODE; @@ -968,8 +648,7 @@ static int process_opt(void *context, int ch, char *optarg) int main(int argc, char **argv) { - int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; - ib_portid_t my_portid = {0}; + ibnd_fabric_t *fabric = NULL; const struct ibdiag_opt opts[] = { { "show", 's', 0, NULL, "show more information" }, @@ -996,29 +675,28 @@ int main(int argc, char **argv) timeout = ibd_timeout; if (ibverbose) - dumplevel = 1; + ibnd_debug(1); if (argc && !(f = fopen(argv[0], "w"))) IBERROR("can't open file %s for writing", argv[0]); - srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2); - if (!srcport) - IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); - node_name_map = open_node_name_map(node_name_map_file); - if (discover(&my_portid) < 0) - IBERROR("discover"); - - if (group) - chassis = group_nodes(); + if ((fabric = ibnd_discover_fabric(ibd_ca, ibd_ca_port, ibd_timeout, NULL, -1)) == NULL) { + fprintf(stderr, "discover failed\n"); + exit(1); + } if (ports_report) - dump_ports_report(); + ibnd_iter_nodes(fabric, + dump_ports_report, + NULL); + else if (list) + list_nodes(fabric, list); else - dump_topology(list, group); + dump_topology(group, fabric); + ibnd_destroy_fabric(fabric); close_node_name_map(node_name_map); - mad_rpc_close_port(srcport); exit(0); } -- 1.5.4.5 From sean.hefty at intel.com Fri Apr 3 16:02:16 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 3 Apr 2009 16:02:16 -0700 Subject: [ofa-general] [PATCH v3 0/3] Create a new library libibnetdisc and convert iblinkinfo and ibnetdiscover to that library. In-Reply-To: <20090403154244.a65227b5.weiny2@llnl.gov> References: <20090403154244.a65227b5.weiny2@llnl.gov> Message-ID: <74883F6A2E3C44958EDA28C22519016E@amr.corp.intel.com> >This new series uses the current master version ibmad to decode the data. If >you accept the mad_*printf functions then I can convert later. For now I want >to get this library in! :-D It would be helpful to check libibnetdisc into a branch in the management.git tree. I need some time to add libibnetdisc to windows. (Where exactly is this library?) - Sean From weiny2 at llnl.gov Fri Apr 3 16:02:49 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 3 Apr 2009 16:02:49 -0700 Subject: [ofa-general] [PATCH] Fix ibidsverify.pl to use the correct cache file Message-ID: <20090403160249.f67e29dd.weiny2@llnl.gov> Sasha, I found this bug when I was testing the libibnetdisc stuff. This applies to the master. Ira >From 656ad88a1f3ca6bcd7601b03da1b3822e4091156 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Fri, 3 Apr 2009 16:00:46 -0700 Subject: [PATCH] Fix ibidsverify.pl to use the correct cache file In addition add the -C and -P options for specifying a different HCA and port Signed-off-by: Ira Weiny --- infiniband-diags/scripts/ibidsverify.pl | 19 ++++++++++++------- 1 files changed, 12 insertions(+), 7 deletions(-) diff --git a/infiniband-diags/scripts/ibidsverify.pl b/infiniband-diags/scripts/ibidsverify.pl index 0d017ba..06a8903 100755 --- a/infiniband-diags/scripts/ibidsverify.pl +++ b/infiniband-diags/scripts/ibidsverify.pl @@ -46,16 +46,22 @@ sub usage_and_exit print " -h This help message\n"; print " -R Recalculate ibnetdiscover information (Default is to reuse ibnetdiscover output)\n"; + print " -C use selected Channel Adaptor name for queries\n"; + print " -P use selected channel adaptor port for queries\n"; exit 2; } my $argv0 = `basename $0`; my $regenerate_map = undef; +my $ca_name = ""; +my $ca_port = ""; chomp $argv0; -if (!getopts("hR")) { usage_and_exit $argv0; } +if (!getopts("hRC:P:")) { usage_and_exit $argv0; } if (defined $Getopt::Std::opt_h) { usage_and_exit $argv0; } if (defined $Getopt::Std::opt_R) { $regenerate_map = $Getopt::Std::opt_R; } +if (defined $Getopt::Std::opt_C) { $ca_name = $Getopt::Std::opt_C; } +if (defined $Getopt::Std::opt_P) { $ca_port = $Getopt::Std::opt_P; } sub validate_non_zero_lid { @@ -163,13 +169,12 @@ sub insert_portguid sub main { - if ($regenerate_map - || !(-f "$IBswcountlimits::cache_dir/ibnetdiscover.topology")) - { - generate_ibnetdiscover_topology; - } + my $cache_file = get_cache_file($ca_name, $ca_port); - open IBNET_TOPO, "<$IBswcountlimits::cache_dir/ibnetdiscover.topology" + if ($regenerate_map || !(-f "$cache_file")) { + generate_ibnetdiscover_topology($ca_name, $ca_port); + } + open IBNET_TOPO, "<$cache_file" or die "Failed to open ibnet topology: $!\n"; my $nodetype = ""; -- 1.5.4.5 From weiny2 at llnl.gov Fri Apr 3 16:08:07 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 3 Apr 2009 16:08:07 -0700 Subject: [ofa-general] [PATCH v3 0/3] Create a new library libibnetdisc and convert iblinkinfo and ibnetdiscover to that library. In-Reply-To: <74883F6A2E3C44958EDA28C22519016E@amr.corp.intel.com> References: <20090403154244.a65227b5.weiny2@llnl.gov> <74883F6A2E3C44958EDA28C22519016E@amr.corp.intel.com> Message-ID: <20090403160807.d185979e.weiny2@llnl.gov> On Fri, 3 Apr 2009 16:02:16 -0700 "Sean Hefty" wrote: > >This new series uses the current master version ibmad to decode the data. If > >you accept the mad_*printf functions then I can convert later. For now I want > >to get this library in! :-D > > It would be helpful to check libibnetdisc into a branch in the management.git > tree. I need some time to add libibnetdisc to windows. (Where exactly is this > library?) The patch creates a subdirectory in infiniband-diags call libibnetdisc. Is that what you mean? Unfortunately I don't have a public git tree I can point you to here at the lab. :-( As an aside, I was careful to add your ibnetdiscover changes which have recently been posted to the mailing list. Ira -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab weiny2 at llnl.gov From sean.hefty at intel.com Fri Apr 3 16:18:27 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 3 Apr 2009 16:18:27 -0700 Subject: [ofa-general] [PATCH v3 0/3] Create a new library libibnetdisc and convert iblinkinfo and ibnetdiscover to that library. In-Reply-To: <20090403160807.d185979e.weiny2@llnl.gov> References: <20090403154244.a65227b5.weiny2@llnl.gov> <74883F6A2E3C44958EDA28C22519016E@amr.corp.intel.com> <20090403160807.d185979e.weiny2@llnl.gov> Message-ID: >The patch creates a subdirectory in infiniband-diags call libibnetdisc. Is >that what you mean? Unfortunately I don't have a public git tree I can point >you to here at the lab. :-( My mailer tossed patch 1/3 into my junk mail folder, so I missed the patch for the actual library itself... If it's possible, I'd like for Sasha to add these to a branch in his management.git tree until I can setup the windows build and verify that everything compiles. I should only need a few days to do this. - Sean From weiny2 at llnl.gov Fri Apr 3 17:01:14 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 3 Apr 2009 17:01:14 -0700 Subject: [ofa-general] [PATCH v3 0/3] Create a new library libibnetdisc and convert iblinkinfo and ibnetdiscover to that library. In-Reply-To: References: <20090403154244.a65227b5.weiny2@llnl.gov> <74883F6A2E3C44958EDA28C22519016E@amr.corp.intel.com> <20090403160807.d185979e.weiny2@llnl.gov> Message-ID: <20090403170114.dfebcb38.weiny2@llnl.gov> On Fri, 3 Apr 2009 16:18:27 -0700 "Sean Hefty" wrote: > >The patch creates a subdirectory in infiniband-diags call libibnetdisc. Is > >that what you mean? Unfortunately I don't have a public git tree I can point > >you to here at the lab. :-( > > My mailer tossed patch 1/3 into my junk mail folder, so I missed the patch for > the actual library itself... Ok, sorry about that. > > If it's possible, I'd like for Sasha to add these to a branch in his > management.git tree until I can setup the windows build and verify that > everything compiles. I should only need a few days to do this. Sounds fine to me, Ira > > - Sean > -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab weiny2 at llnl.gov From vlad at lists.openfabrics.org Sat Apr 4 03:23:16 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 4 Apr 2009 03:23:16 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090404-0200 daily build status Message-ID: <20090404102316.789E0E60E35@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From eli at dev.mellanox.co.il Sat Apr 4 23:50:47 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Sun, 5 Apr 2009 09:50:47 +0300 Subject: [ofa-general] Re: [PATCH v3] mlx4_ib: Optimize hugetlab pages support In-Reply-To: References: <20090129152725.GA26284@mtls03> <49D0EA11.2040409@Voltaire.COM> <20090402084406.GB21370@mtls03> Message-ID: <20090405065047.GA567@mtls03> On Thu, Apr 02, 2009 at 11:31:42PM +0300, Yossi Etigin wrote: > Hi Eli, > > I've placed a printk in the new hugetlb function to print n and j. > While running the mckey test (attached in bugzilla), I got j=0, n=1. > > Why do you say that the number of pages must cover HUGE_PAGES? > In ib_umem_get, hugetlb is set to 0 only if any of the pages is > not-hugetlb - otherwise it's 1. Am I missing something? > Only if ALL of the registered area is huge pages, then umem->hugetlb will remain 1; otherwise it is cleared and we don't execute handle_hugetlb_user_mr(). In case we do execute handle_hugetlb_user_mr(), I expect the number of PAGE_SIZEed pages to fill full huge pages. From vlad at lists.openfabrics.org Sun Apr 5 03:22:11 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 5 Apr 2009 03:22:11 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090405-0200 daily build status Message-ID: <20090405102212.45A1BE60E08@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From acceptany at gmail.com Sun Apr 5 05:57:43 2009 From: acceptany at gmail.com (Jordan) Date: Sun, 5 Apr 2009 20:57:43 +0800 Subject: [ofa-general] ***SPAM*** problem about the assignment of LID Message-ID: <91fe68d50904050557u2b57ce6dl9de1804d188555e4@mail.gmail.com> Hi ,I am a new comer, now I doubt about the LID assignment. Is that a switch is assigned with only one LID? or , every port is assigned with one? It seems not necessary to assign a LID to every port. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dsamson2002 at gmail.com Sun Apr 5 21:48:52 2009 From: dsamson2002 at gmail.com (Drake Samson) Date: Mon, 6 Apr 2009 00:48:52 -0400 Subject: [ofa-general] ***SPAM*** UD latency higher than RC on AMD quad core blades with ConnectX? Message-ID: <55084d280904052148l562c1e5egcc73ae04be6ea826@mail.gmail.com> Hello IB people, I set up an AMD dual quad-core system recently and ran some IB level tests. The "ibv_ud_pingpong" and "ibv_rc_pingpong" tests show pretty different results for UD vs RC (latency is more than double!). I'm wondering if someone could shed light on the issue? Is there something that needs to be updated or changed? Has someone else noticed this phenomena? $ numactl --physcpubind=0 --membind=0 ibv_ud_pingpong -s 1024 -d mlx4_0 local address: LID 0x003e, QPN 0x2c004a, PSN 0x7426cf remote address: LID 0x0045, QPN 0x2e004a, PSN 0x352c7e 2048000 bytes in 0.03 seconds = 609.23 Mbit/sec 1000 iters in 0.03 seconds = 26.89 usec/iter $ numactl --physcpubind=0 --membind=0 ibv_rc_pingpong -s 1024 -d mlx4_0 local address: LID 0x003e, QPN 0x2e004a, PSN 0xf8fcb5 remote address: LID 0x0045, QPN 0x30004a, PSN 0x221e94 2048000 bytes in 0.01 seconds = 1413.39 Mbit/sec 1000 iters in 0.01 seconds = 11.59 usec/iter [there is no difference with/without numactl] Here is the system description: OS: Red Hat Enterprise Linux Server release 5.2 (Tikanga); kernel 2.6.18-92.el5 Processor: Quad-Core AMD Opteron(tm) Processor 2356 IB software: OFED-1.4 Firmware version: 2.5 Harware version: 0xA0 Vendor part id: 25418 -D -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Sun Apr 5 23:49:49 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 06 Apr 2009 09:49:49 +0300 Subject: [ofa-general] [PATCH] IB/mlx4: Use pgprot_writecombine() for BlueFlame pages In-Reply-To: References: <15ddcffd0903291006g4b7549cfj1879dd67518f8bff@mail.gmail.com> <200903301117.32355.jackm@dev.mellanox.co.il> <49D1FD74.9040205@Voltaire.com> <49D31884.5020609@Voltaire.com> <49D31B4F.3000700@Voltaire.com> <49D48811.2080204@voltaire.com> Message-ID: <49D9A60D.8090909@voltaire.com> Roland Dreier wrote: > Those results make sense. The only thing you said that I don't understand: > > I noted that I have CONFIG_MTRR=y so maybe this can explain the nice latency > > even without setting X86_PAT? > > But you are getting the worse latency without X86_PAT -- you need X86_PAT and a patch to use PAT in mlx4_ib to get better latency, which is as I expect. So I'm not sure why you talk about "nice" latency without X86_PAT. > When I said "nice latency" I referred to the 1.4us number, and the "best latency" is the 1.1us number - I'm not sure what you referred to in "the worse latency without X86_PAT" - unless you wanted to say that 1.4us is the worse... anyway, as I wrote you, on one of the benchmarking sessions I did last week I was getting about 3.5us with mainline kernel - but I don't manage to reproduce this anymore. Or. From olga.shern at gmail.com Mon Apr 6 00:18:03 2009 From: olga.shern at gmail.com (Olga Shern (Voltaire)) Date: Mon, 6 Apr 2009 10:18:03 +0300 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** problem about the assignment of LID In-Reply-To: <91fe68d50904050557u2b57ce6dl9de1804d188555e4@mail.gmail.com> References: <91fe68d50904050557u2b57ce6dl9de1804d188555e4@mail.gmail.com> Message-ID: SM assigns one LID per switch chip - no matter the number of ports. But for HCA, SM assignes LID per port On Sun, Apr 5, 2009 at 3:57 PM, Jordan wrote: > Hi ,I am a new comer, now I doubt about the LID assignment.  Is that a > switch is assigned with only one LID?  or , every port is assigned with one? > It seems not necessary to assign a LID to every port. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From vlad at lists.openfabrics.org Mon Apr 6 03:20:01 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 6 Apr 2009 03:20:01 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090406-0200 daily build status Message-ID: <20090406102001.C47EBE609E2@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From dzieko at wcss.pl Mon Apr 6 03:54:24 2009 From: dzieko at wcss.pl (Pawel Dziekonski) Date: Mon, 6 Apr 2009 12:54:24 +0200 Subject: [ofa-general] ib_mthca 0000:0d:00.0: Async event 16 for bogus QP 00da0407 In-Reply-To: <200904022007.20630.bs_lists@aakef.fastmail.fm> References: <200904022007.20630.bs_lists@aakef.fastmail.fm> Message-ID: <20090406105424.GC6165@cefeid.wcss.wroc.pl> On Thu, 02 Apr 2009 at 08:07:20PM +0200, Bernd Schubert wrote: > Hello, > > I'm fighting (as usual) with some Lustre problems and I think this time it is > IB related. In the logs of some systems I see messages like these: > > ib_mthca 0000:0d:00.0: Async event 16 for bogus QP 00da0407 > > Anyone knows what is the meaning of that? The kernel modules are from > OFED-1.3.1. Hi Bernd, we are also using 1.3.1 and Lustre, as you have seen recently at our site ;-) I'm getting messages like these only when large computing jobs are running using IPoIB. I believe that this is a issue with send/receive buffers, because I see dropped packets on IPoIB iface. Those jobs work usually fine (usually because this app is buggy itself) so I find those messages rather harmless. regards, P -- Pawel Dziekonski Wroclaw Centre for Networking & Supercomputing, HPC Department Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl From tziporet at dev.mellanox.co.il Mon Apr 6 04:04:59 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 06 Apr 2009 14:04:59 +0300 Subject: [ofa-general] ib_mthca 0000:0d:00.0: Async event 16 for bogus QP 00da0407 In-Reply-To: <200904022007.20630.bs_lists@aakef.fastmail.fm> References: <200904022007.20630.bs_lists@aakef.fastmail.fm> Message-ID: <49D9E1DB.5050502@mellanox.co.il> Bernd Schubert wrote: > Hello, > > I'm fighting (as usual) with some Lustre problems and I think this time it is > IB related. In the logs of some systems I see messages like these: > > ib_mthca 0000:0d:00.0: Async event 16 for bogus QP 00da0407 > > This message means the driver get an asynchronous event from the HW for a QP that was already closed. Tziporet From tziporet at dev.mellanox.co.il Mon Apr 6 06:09:10 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 06 Apr 2009 16:09:10 +0300 Subject: [ofa-general] ***SPAM*** UD latency higher than RC on AMD quad core blades with ConnectX? In-Reply-To: <55084d280904052148l562c1e5egcc73ae04be6ea826@mail.gmail.com> References: <55084d280904052148l562c1e5egcc73ae04be6ea826@mail.gmail.com> Message-ID: <49D9FEF6.80001@mellanox.co.il> Drake Samson wrote: > Hello IB people, > > I set up an AMD dual quad-core system recently and ran some IB level > tests. The "ibv_ud_pingpong" and "ibv_rc_pingpong" tests show pretty > different results for UD vs RC (latency is more than double!). I'm > wondering if someone could shed light on the issue? Is there something > that needs to be updated or changed? Has someone else noticed this > phenomena? > I suggest you to use the performance utilities to measure performance: ib_send_bw and ib_send_lat - for RC & UD send ib_write_bw and ib_write_lat - for RDMA Tziporet From tziporet at mellanox.co.il Mon Apr 6 07:12:46 2009 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 6 Apr 2009 17:12:46 +0300 Subject: [ofa-general] EWG OFED meeting agenda for today - Apr 6, 09 Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD024BE557@mtlexch01.mtl.com> This is OFED meeting agenda for today (April 6): a. OFED 1.4.1 release status: Reminder for OFED 1.4.1 schedule: * RC1 & RC2 - done * RC3 - Apr 6 - to be released today * GA - Apr 23 Critical bugs Note - we have 18 critical bugs :-( We should decide if we wish to fix all of them - the release will be delayed, or we need to reduce some of them to normal. bug_id bug_severity op_sys assigned_to bug_status short_short_desc 1578 blocker All Jeffrey.C.Becker at nasa.gov NEW ofa_kernel/include/linux/autoconf.h undefines macros 1538 blocker Other vlad at mellanox.co.il NEW how to deal with /usr/src/ofa_kernel/include/linux/autoconf.h? 1540 critical SLES 10 andy.grover at oracle.com NEW openib stop hangs on sles10sp2, ppc64 after kernel Oops in rds 1567 critical All andy.grover at oracle.com ASSIGNED kernel panic while unloading rds module 1571 critical RHEL 5 jon at opengridcomputing.com NEW nfsrdma server crash @test5 connotation basic test, 1574 critical RHEL 5 jon at opengridcomputing.com NEW nfsrdma server run out-of-memory and reboot 1589 critical RHEL 5 jon at opengridcomputing.com NEW FRMR registration errors logged by cxgb3 during NFSRDMA iozone runs 1255 critical RHEL 5 monis at voltaire.com NEW bringing down bonding device hangs 1549 critical SLES 10 vlad at mellanox.co.il NEW kernel-ib / sles 10sp2 / zypper install issue 1568 major RHEL 4 eli at mellanox.co.il NEW mthca driver hangs during unload 1587 major Other eli at mellanox.co.il NEW Adding and removing pkey from partition.conf cause to machine to hung 1287 major RHEL 5 jackm at mellanox.co.il NEW IPoIB datagram mode initial packet loss 1506 major RHEL 5 jackm at mellanox.co.il NEW driver occasionally fails to start on driver restart 1528 major RHEL 5 jackm at mellanox.co.il NEW IPoIB get stack when running Hadoop application. 1529 major RHEL 5 jackm at mellanox.co.il NEW Opensm cannot be stopped following openib failure. 1545 major Other jackm at mellanox.co.il NEW Performance degradation in ofed 1.4.1 in TCP BW for some packets size 1570 major SLES 10 Jeffrey.C.Becker at nasa.gov NEW rnfs-utils fails to compile on sles10sp2 1579 major RHEL 5 jsquyres at cisco.com ASSIGNED OpenMPI-1.3.1-1: segfault during close b. OFED 1.5 release process: Make sure all participants are inline with decisions we took in Sonoma in the OFED/distro discussion: 1) Identify all user components which are not packaged as tarballs and notify them that it will be a condition of inclusion in OFED 1.5 that they are packaged as tarballs. 2) Ensure that each of the tarballs for OFED 1.5 has a unique version number, and that version numbers are updated appropriately as the tarball contents change. 3) Ensure that each use level component that depends on specific kernel feature actually checks for the existence of that kernel feature, and politely declines to compile/install if that component is not available. 4) Investigate making the OFED install work as non-root - Doug has offered tutorials using BUILD_ROOT and chroot mechanisms to do non-root builds. Tziporet -------------- next part -------------- An HTML attachment was scrubbed... URL: From dledford at redhat.com Mon Apr 6 08:33:31 2009 From: dledford at redhat.com (Doug Ledford) Date: Mon, 6 Apr 2009 11:33:31 -0400 Subject: [ofa-general] MVAPICH_GCC seems incompatible with (RedHat)gcc-4.1.2-44 In-Reply-To: <49D4D835.2010908@clustervision.com> References: <49D46FC3.4050704@clustervision.com> <20090402150411.GI3078@cse.ohio-state.edu> <49D4D835.2010908@clustervision.com> Message-ID: On Apr 2, 2009, at 11:22 AM, Guido Passet wrote: > Hi Jonathan, many thanks for looking in to this. > > Both mvapich-1.1 and mvapich2-1.2p1 seem to build fine from source. > My guess is that the logic in the OFED wrapper build scripts are > somehow > confused. Most likely the problem has nothing to do with the gcc version. The Red Hat supplied mvapich and mvapich2 packages (new for RHEL5.3) are not named mvapich_gcc and mvapich2_gcc, but instead just plain old mvapich and mvapich2. Since we don't ship anything but the gcc version, the distinction is moot in our packages. > > Best regards, > Guido Passet. > > Jonathan Perkins wrote: >> Guido: >> Thanks for reporting this issue. I'll take a look into the install >> process and see if I can find any logic that may be leading to this >> behavior. >> >> While I'm looking into this, can you also try installing our packages >> directly to see if this works. You can find our tarballs from the >> following urls: >> http://mvapich.cse.ohio-state.edu/download/mvapich/ >> http://mvapich.cse.ohio-state.edu/download/mvapich2/ >> >> On Thu, Apr 02, 2009 at 09:56:51AM +0200, Guido Passet wrote: >>> Dear OpenFabrics list, >>> >>> >>> RedHat and spinoffs recently shipped v5.3 and many RedHat alike >>> distributions like Scientific Linux picked up the new flood of >>> packages. >>> >>> One curiosity I am trying to debug is the fact that MVAPICH/ >>> MVAPICH2 do >>> not seem to compile anymore after the update from: >>> >>> gcc-4.1.2-42.el5.x86_64 (RedHat 5.2) >>> >>> to: >>> >>> gcc-4.1.2-44.el5.x86_64 (RedHat 5.3) >>> >>> On running ./install.pl --build32 --all >>> >>> I get: >>> >>> mvapich_gcc is not available on this platform >>> mvapich2_gcc is not available on this platform >>> >>> and no rpms are being build.. >>> >>> While all other compilers (PGI/Pathscale and Intel) seem to work >>> fine. >>> >>> I tried OFED-1.4.1-20090401-0600, OFED-1.4.1-20090319-0600, >>> OFED-1.4-20090301-0600 and a couple more but I seem to be in >>> trouble on >>> all versions. >>> >>> Any pointers into a direction to get these stack parts compiled >>> against >>> GCC would be more than welcome. >>> >>> Cheers, >>> Guido. >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> > > > -- > Guido Passet Email: guido.passet at clustervision.com > Engineering Manager ClusterVision BV > Nieuw-Zeelandweg 15B Web: http://www.clustervision.com > 1045 AL Amsterdam Tel: +31 20 407 7550 > The Netherlands Fax: +31 84 759 8389 > KvK Amsterdam 30184312 VAT/BTW NL8117.05.195.B01 > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford InfiniBand Specific RPMS http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 203 bytes Desc: This is a digitally signed message part URL: From guido.passet at clustervision.com Mon Apr 6 08:44:14 2009 From: guido.passet at clustervision.com (Guido Passet) Date: Mon, 06 Apr 2009 17:44:14 +0200 Subject: [ofa-general] MVAPICH_GCC seems incompatible with (RedHat)gcc-4.1.2-44 In-Reply-To: References: <49D46FC3.4050704@clustervision.com> <20090402150411.GI3078@cse.ohio-state.edu> <49D4D835.2010908@clustervision.com> Message-ID: <49DA234E.1050507@clustervision.com> Dear Doug, Doug Ledford wrote: > On Apr 2, 2009, at 11:22 AM, Guido Passet wrote: > >> Hi Jonathan, many thanks for looking in to this. >> >> Both mvapich-1.1 and mvapich2-1.2p1 seem to build fine from source. >> My guess is that the logic in the OFED wrapper build scripts are somehow >> confused. > > Most likely the problem has nothing to do with the gcc version. The Red > Hat supplied mvapich and mvapich2 packages (new for RHEL5.3) are not > named mvapich_gcc and mvapich2_gcc, but instead just plain old mvapich > and mvapich2. Since we don't ship anything but the gcc version, the > distinction is moot in our packages. You could be right about this not being an GCC version problem, however I am not using the RedHat supplied packages but I am trying to build all mvapich/mvapich2/openmpi against all available compilers using the OFED stack. Best regards, Guido. From dledford at redhat.com Mon Apr 6 08:52:36 2009 From: dledford at redhat.com (Doug Ledford) Date: Mon, 6 Apr 2009 11:52:36 -0400 Subject: [ofa-general] MVAPICH_GCC seems incompatible with (RedHat)gcc-4.1.2-44 In-Reply-To: <49DA234E.1050507@clustervision.com> References: <49D46FC3.4050704@clustervision.com> <20090402150411.GI3078@cse.ohio-state.edu> <49D4D835.2010908@clustervision.com> <49DA234E.1050507@clustervision.com> Message-ID: <5C672C59-C867-4F0E-B49E-191112EDFB52@redhat.com> On Apr 6, 2009, at 11:44 AM, Guido Passet wrote: > Dear Doug, > > Doug Ledford wrote: >> On Apr 2, 2009, at 11:22 AM, Guido Passet wrote: >> >>> Hi Jonathan, many thanks for looking in to this. >>> >>> Both mvapich-1.1 and mvapich2-1.2p1 seem to build fine from source. >>> My guess is that the logic in the OFED wrapper build scripts are >>> somehow >>> confused. >> >> Most likely the problem has nothing to do with the gcc version. >> The Red >> Hat supplied mvapich and mvapich2 packages (new for RHEL5.3) are not >> named mvapich_gcc and mvapich2_gcc, but instead just plain old >> mvapich >> and mvapich2. Since we don't ship anything but the gcc version, the >> distinction is moot in our packages. > > You could be right about this not being an GCC version problem, > however > I am not using the RedHat supplied packages but I am trying to build > all > mvapich/mvapich2/openmpi against all available compilers using the > OFED > stack. Right. My point was if the OFED install.pl script thinks that having mvapich installed means installing mvapich_gcc, as you probably already have Red Hat's plain mvapich installed, there is a good chance you might get file conflicts or other problems attempting to install mvapich_gcc, so it would never get installed, so install.pl would crash out. If you change install.pl to expect just plain mvapich/ mvapich2 for the gcc versions, it might just magically start working (although using our packages, not the OFED packages). -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford InfiniBand Specific RPMS http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 203 bytes Desc: This is a digitally signed message part URL: From guido.passet at clustervision.com Mon Apr 6 08:56:49 2009 From: guido.passet at clustervision.com (Guido Passet) Date: Mon, 06 Apr 2009 17:56:49 +0200 Subject: [ofa-general] MVAPICH_GCC seems incompatible with (RedHat)gcc-4.1.2-44 In-Reply-To: <5C672C59-C867-4F0E-B49E-191112EDFB52@redhat.com> References: <49D46FC3.4050704@clustervision.com> <20090402150411.GI3078@cse.ohio-state.edu> <49D4D835.2010908@clustervision.com> <49DA234E.1050507@clustervision.com> <5C672C59-C867-4F0E-B49E-191112EDFB52@redhat.com> Message-ID: <49DA2641.3000604@clustervision.com> Doug Ledford wrote: > On Apr 6, 2009, at 11:44 AM, Guido Passet wrote: >> Dear Doug, >> >> Doug Ledford wrote: >>> On Apr 2, 2009, at 11:22 AM, Guido Passet wrote: >>> >>>> Hi Jonathan, many thanks for looking in to this. >>>> >>>> Both mvapich-1.1 and mvapich2-1.2p1 seem to build fine from source. >>>> My guess is that the logic in the OFED wrapper build scripts are >>>> somehow >>>> confused. >>> >>> Most likely the problem has nothing to do with the gcc version. The Red >>> Hat supplied mvapich and mvapich2 packages (new for RHEL5.3) are not >>> named mvapich_gcc and mvapich2_gcc, but instead just plain old mvapich >>> and mvapich2. Since we don't ship anything but the gcc version, the >>> distinction is moot in our packages. >> >> You could be right about this not being an GCC version problem, however >> I am not using the RedHat supplied packages but I am trying to build all >> mvapich/mvapich2/openmpi against all available compilers using the OFED >> stack. > > Right. My point was if the OFED install.pl script thinks that having > mvapich installed means installing mvapich_gcc, as you probably already > have Red Hat's plain mvapich installed, there is a good chance you might > get file conflicts or other problems attempting to install mvapich_gcc, > so it would never get installed, so install.pl would crash out. If you > change install.pl to expect just plain mvapich/mvapich2 for the gcc > versions, it might just magically start working (although using our > packages, not the OFED packages). I am aware of the available RedHat packages which I have explicitly excluded from being installed. I am in need of a pure OFED based stack. At this moment the install.pl is not even trying to build it... Best regards, Guido. From dledford at redhat.com Mon Apr 6 08:58:19 2009 From: dledford at redhat.com (Doug Ledford) Date: Mon, 6 Apr 2009 11:58:19 -0400 Subject: [ofa-general] MVAPICH_GCC seems incompatible with (RedHat)gcc-4.1.2-44 In-Reply-To: <49DA2641.3000604@clustervision.com> References: <49D46FC3.4050704@clustervision.com> <20090402150411.GI3078@cse.ohio-state.edu> <49D4D835.2010908@clustervision.com> <49DA234E.1050507@clustervision.com> <5C672C59-C867-4F0E-B49E-191112EDFB52@redhat.com> <49DA2641.3000604@clustervision.com> Message-ID: <15287E79-559A-46CF-B9A1-A4E58EF2AF0F@redhat.com> On Apr 6, 2009, at 11:56 AM, Guido Passet wrote: > I am aware of the available RedHat packages which I have explicitly > excluded from being installed. I am in need of a pure OFED based > stack. > At this moment the install.pl is not even trying to build it... OK, then you can ignore my suggestions as they aren't applicable. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford InfiniBand Specific RPMS http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 203 bytes Desc: This is a digitally signed message part URL: From yosefe at voltaire.com Mon Apr 6 10:49:43 2009 From: yosefe at voltaire.com (Yossi Etigin) Date: Mon, 06 Apr 2009 20:49:43 +0300 Subject: [ofa-general] Re: [PATCH v3] mlx4_ib: Optimize hugetlab pages support In-Reply-To: <20090405065047.GA567@mtls03> References: <20090129152725.GA26284@mtls03> <49D0EA11.2040409@Voltaire.COM> <20090402084406.GB21370@mtls03> <20090405065047.GA567@mtls03> Message-ID: <49DA40B7.3040004@voltaire.com> Eli Cohen wrote: > On Thu, Apr 02, 2009 at 11:31:42PM +0300, Yossi Etigin wrote: >> Hi Eli, >> >> I've placed a printk in the new hugetlb function to print n and j. >> While running the mckey test (attached in bugzilla), I got j=0, n=1. >> >> Why do you say that the number of pages must cover HUGE_PAGES? >> In ib_umem_get, hugetlb is set to 0 only if any of the pages is >> not-hugetlb - otherwise it's 1. Am I missing something? >> > > Only if ALL of the registered area is huge pages, then umem->hugetlb > will remain 1; otherwise it is cleared and we don't execute > handle_hugetlb_user_mr(). In case we do execute > handle_hugetlb_user_mr(), I expect the number of PAGE_SIZEed pages to > fill full huge pages. I don't understand - if all area is huge pages, it does not mean that it fills full huge pages - I can have just 4096 bytes in huge page memory and umem->hugetlb will remain 1, right? From faisal.latif at intel.com Mon Apr 6 12:16:20 2009 From: faisal.latif at intel.com (Faisal Latif) Date: Mon, 6 Apr 2009 14:16:20 -0500 Subject: [ofa-general] [PATCH 1/2] RDMA/nes: fix error handling issues Message-ID: <20090406191620.GA9048@flatif-MOBL> Fix following 3 issues. (1) Check if cm_node was successfully created for loopback connection. (2) in schedule_nes_timer(), it does not free up the allocated memory after encountering an error. There is WARN_ON() for this condition. (3) there is a cm_node->freed flag which is set but not used. Reported-by: Dan Carpenter Signed-off-by: Faisal Latif --- drivers/infiniband/hw/nes/nes_cm.c | 8 ++++++-- drivers/infiniband/hw/nes/nes_cm.h | 1 - 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index 5242515..572231c 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -426,6 +426,7 @@ int schedule_nes_timer(struct nes_cm_node *cm_node, struct sk_buff *skb, if (type == NES_TIMER_TYPE_CLOSE) { new_send->timetosend += (HZ/10); if (cm_node->recv_entry) { + kfree(new_send); WARN_ON(1); return -EINVAL; } @@ -1262,7 +1263,6 @@ static int rem_ref_cm_node(struct nes_cm_core *cm_core, cm_node->nesqp = NULL; } - cm_node->freed = 1; kfree(cm_node); return 0; } @@ -1999,13 +1999,17 @@ static struct nes_cm_node *mini_cm_connect(struct nes_cm_core *cm_core, if (loopbackremotelistener == NULL) { create_event(cm_node, NES_CM_EVENT_ABORTED); } else { - atomic_inc(&cm_loopbacks); loopback_cm_info = *cm_info; loopback_cm_info.loc_port = cm_info->rem_port; loopback_cm_info.rem_port = cm_info->loc_port; loopback_cm_info.cm_id = loopbackremotelistener->cm_id; loopbackremotenode = make_cm_node(cm_core, nesvnic, &loopback_cm_info, loopbackremotelistener); + if (!loopbackremotenode) { + rem_ref_cm_node(cm_node->cm_core, cm_node); + return NULL; + } + atomic_inc(&cm_loopbacks); loopbackremotenode->loopbackpartner = cm_node; loopbackremotenode->tcp_cntxt.rcv_wscale = NES_CM_DEFAULT_RCV_WND_SCALE; diff --git a/drivers/infiniband/hw/nes/nes_cm.h b/drivers/infiniband/hw/nes/nes_cm.h index d5f7782..80bba18 100644 --- a/drivers/infiniband/hw/nes/nes_cm.h +++ b/drivers/infiniband/hw/nes/nes_cm.h @@ -298,7 +298,6 @@ struct nes_cm_node { struct nes_vnic *nesvnic; int apbvt_set; int accept_pend; - int freed; struct list_head timer_entry; struct list_head reset_entry; struct nes_qp *nesqp; -- 1.5.3.3 From d.katz at ieee.org Mon Apr 6 11:22:32 2009 From: d.katz at ieee.org (Daniel S. Katz) Date: Mon, 6 Apr 2009 14:22:32 -0400 Subject: [ofa-general] ***SPAM*** [hpc-announce] =?windows-1252?q?CFP=3A_Cluster_2009_pa?= =?windows-1252?q?pers=2C_posters=2C_and_workshop_papers_for=3A_Workshop_o?= =?windows-1252?q?n_Interfaces_and_Architectures_for_Scientific_Data_Stora?= =?windows-1252?q?ge_=28IASDS09=29=3B_The_impact_and_influence_of_Web_2=2E?= =?windows-1252?q?0_on_e-Research_Infrastructure=2C_Services_and_Applicati?= =?windows-1252?q?ons=3B_Workshop_on_High_Performance_Interconnects_for_Di?= =?windows-1252?q?stributed_Computing_=28HPI-DC=9209=29=3B_Parallel_Progra?= =?windows-1252?q?mming_on_Accelerator_Clusters?= References: Message-ID: <5A0CB93D-39DC-41F4-86ED-BFA0112AC587@ieee.org> Hi, Below this are a number of CFPs associated with the IEEE International Conference on Cluster Computing (Cluster 2009) conference, to be held in New Orleans, Louisiana, 31 August - 4 September 2009. Specifically, included below are the CFPs for: Papers for the conference itself. Posters for the conference Papers for Workshop on Interfaces and Architectures for Scientific Data Storage (IASDS09) Papers for Workshop: The impact and influence of Web 2.0 on e-Research Infrastructure, Services and Applications Papers for Workshop on High Performance Interconnects for Distributed Computing (HPI-DC’09) Papers for Workshop: Parallel Programming on Accelerator Clusters I apologize in advance for any duplicate mailings. ********************************************************************* Call for Papers 2009 IEEE International Conference on Cluster Computing (Cluster 2009) http://www.cluster2009.org/ 29 August - 4 September 2009 New Orleans, Louisiana, USA ********************************************************************** Cluster 2009 welcomes paper and poster submissions on innovative work from researchers in academia, industry, and government, describing original research in the field of cluster computing. Topics of interest include, but are not limited to: - Cluster Architecture and Hardware Systems - Node architectures - Packaging, Power, and Cooling - Cluster Software and Middleware - Software Environments and Tools - Single-System Image Services - Parallel File Systems and I/O Libraries - Standard Software for Clusters - Virtualization - Cluster Networking - High-Speed Interconnects - High Performance Message Passing Libraries - Lightweight Communication Protocols - Implications of Multicore and Clouds on Clusters - Hardware Architecture - Software and Tools - Networking - Management - Applications - Applications - Application Methods and Algorithms - Adaptation to Multicore - Data Distribution, Load Balancing & Scaling - MPI/OpenMP Hybrid Computing - Visualization - Performance Analysis and Evaluation - Benchmarking & Profiling Tools - Performance Prediction & Modeling - Cluster Management - Security and Reliability - High Availability Solutions - Resource and Job Management For submitting and formatting instructions, see the conference web site: http://www.cluster2009.org/ Important Dates: Tutorial proposal deadline: 31 March 2009 Technical paper submissions: 14 April 2009 Tutorial notification: 31 May 2009 Technical paper notification: 5 June 2009 Poster submissions: 12 June 2009 Poster notification: 17 July 2009 Poster camera ready deadline: 31 July 2009 Paper camera ready deadline: 31 July 2009 Conference Organizing Chairs and Committees: General Chair Daniel S. Katz, University of Chicago, USA General Vice Chair Mark Baker, University of Reading, UK Program Chair Thomas Sterling, Louisiana State University, USA Program Vice Chairs Pete Beckman, Argonne National Lab, USA William Camp, Intel, USA Jack Dongarra, University of Tennessee at Knoxville, USA William Gropp, University of Illinois, USA Satoshi Matsuoka, Tokyo Institute of Technology, Japan Bart Miller, Univ. of Wisconsin, USA Poster Co-Chairs Sushil Prasad, Georgia State University, USA Eric Aubanel, University of New Brunswick, Canada Workshops Chair Wu Feng, Virginia Tech, USA Tutorials Co-Chairs Robert Ferraro, Jet Propulsion Laboratory, USA Bryan Biegel, NASA Ames, USA Proceedings Chair Ron Brightwell, Sandia National Laboratories, USA Publicity Co-Chairs Omer Rana, Cardiff University, UK Feilong Tang, Shanghai Jiao Tong University, China Tevfik Kosar, Louisiana State University, USA Finance Chair Box Leangsuksun, Louisiana Tech University, USA Sponsor and Exhibitors Co-Chairs George Jones, Data Direct Networks, USA Charlie McMahon, Tulane University, USA Ali Butt, Virginia Tech, USA Local Arrangements Chair Karen Jones, Louisiana State University, USA PR/Graphics Chair Kristen Sunde, Louisiana State University, USA IEEE Cluster 2009 Call for Posters http://www.cluster2009.org/cfposters.php Scope: The 2009 IEEE International Conference on Cluster Computing (Cluster 2009) will be held in New Orleans, Louisiana, USA, from August 29 to September 4, 2009. Authors are invited to present their on-going or recently finished work related to cluster computing in the poster session of Cluster 2009. All accepted poster papers will be included in the Cluster 2009 proceedings. Topics include, but are not limited to the following; - Novel Architectures - multicore CPUs - GPUs - FPGA accelerators - Cluster Software and Middleware - System Tools for "Green" Computing - Software Environments and Tools - Single-System Image Services - Parallel File Systems and I/O Libraries - Standard Software for Clusters - Cluster Networking - High-Speed Interconnects - High Performance Message Passing Libraries - Lightweight Communication Protocols - Applications - Application Methods and Algorithms - Adaptation to Multi-Core - Data Distribution, Load Balancing & Scaling - MPI/OpenMP Hybrid Computing - Visualization - Performance Analysis and Evaluation - Benchmarking & Profiling Tools - Performance Prediction & Modeling - Cluster Management - Virtualization Technology - Security and Reliability - High Availability Solutions - Resource and Job Management - Administration and Maintenance Tools Submission, Format, Publication: Consideration for a Cluster 2009 Poster will be based on the submission of a paper (maximum 4 pages). It must follow the same format as the technical papers (see the Call for Papers announcement). Submissions must be made via the Poster Submission link at http://www.easychair.org/conferences/?conf=cluster09posters. All accepted poster papers will be included in the Cluster 2009 proceedings. Important dates: Submission: June 12 2009 Acceptance notification: July 17 2009 Contact address: Please contact the Poster co-chairs for further information: Eric Aubanel, University of New Brunswick, Canada: aubanel at unb.ca Sushil K. Prasad, Georgia State University, USA: sprasad at gsu.edu Workshop on Interfaces and Architectures for Scientific Data Storage (IASDS09) http://www.mcs.anl.gov/events/workshops/iasds09/ Held in conjunction with IEEE Cluster 2009 http://www.cluster2009.org Friday September 4, 2009 New Orleans, LA, USA Call for Papers High-performance computing simulations and large scientific experiments such as those in high energy physics generate tens of terabytes of data, and these data sizes grow each year. Existing systems for storing, managing, and analyzing data are being pushed to their limits by these applications, and new techniques are necessary to enable efficient data processing for future simulations and experiments. The purpose of this workshop is to provide a forum for engineers and scientists to present and discuss their most recent work related to the storage, management, and analysis of data for scientific workloads. Emphasis is placed on forward-looking approaches to tackle the challenges of storage at extreme scale or to provide better abstractions for use in scientific workloads. Topics of interest include, but are not limited to: - parallel file systems - scientific databases - active storage - scientific I/O middleware - extreme scale storage Registration fee for the workshop will be included as part of the Cluster 2009 conference registration fee. Paper Submission: Papers must be submitted via our EasyChair web site in PDF format: http://www.easychair.org/conferences/?conf=iasds09 Workshop papers will be peer-reviewed and will appear as part of the IEEE Cluster 2009 proceedings. Submissions must follow the Cluster 2009 format: Since the camera-ready version of accepted papers must be compliant with the IEEE Xplore format for publication, submitted papers must conform to the following Xplore layout, page limit, and font size. This will insure a size consistency and a uniform layout for the reviewers. (With minimal changes, accepted document can be styled for publication according to Xplore requirements explained in the Xplore formatting guide, which is also in Xplore format). - Maximum 10 pages - Single-spaced - 8.5x11-inch, Two-column numbered pages in IEEE Xplore format - Format instructions are available at: - IEEE Paper LaTeX Template (ZIP file) http://www.cluster2009.org/IEEE_Paper_LaTeX_Template_LETTER_V3.zip - IEEE Paper Word Template (ZIP file) http://www.cluster2009.org/IEEE_Paper_Word_Template_LETTER_V3.zip In the event of problems with paper submission, please contact Robert Ross (rross at mcs.anl.gov). Important Dates Paper Submission Deadline June 12, 2009 (firm deadline) Author Notification July 10, 2009 Final Manuscript July 31, 2009 Workshop September 4, 2009 Program Committee Robert Ross, Argonne National Laboratory Jacek Becla, SLAC National Accelerator Laboratory Evan Felix, Pacific Northwest National Laboratory Gary Grider, Los Alamos National Laboratory Quincey Koziol, The HDF Group Wei-Keng Liao, Northwestern University Carlos Maltzahn, University of California Santa Cruz Doron Rotem, Lawrence Berkeley National Laboratory Lee Ward, Sandia National Laboratories Kesheng Wu, Lawrence Berkeley National Laboratory Workshop: The impact and influence of Web 2.0 on e-Research Infrastructure, Services and Applications http://research3.org/events/workshop_one.php IEEE International Conference on Cluster Computing (Cluster 2009) Friday 4th September 2009, New Orleans, Louisiana, USA Call for papers The number of Web 2.0 services and applications, widely used by Internet users, academics, industry and enterprise, are growing rapidly, which demonstrates its solid foundations. These technologies and services are based on the open standards that underpin the Internet and Web, and are used in many forms, e.g. blogs, wikis, mashups, social websites, podcasting and content tagging. This field is having a significant impact on distributed infrastructure and applications, and on the way users and developers interact. It is important to understand the influence of this theme because Web 2.0 is providing endless opportunities in academia; the general public, which in turn is driving the business agenda in enterprises and industry, is increasingly using it. This workshop aims to deliver a greater understanding of the influence and changes to be expected regarding e-Research infrastructure, applications and the way users and developers interact. Topics of Interest: - Infrastructure and Services - The use of Cloud-based services, - Using RESTful services, - The applications of mash-ups and using other Web 2.0 technologies, - Web 2.0 security versus existing security, - Using virtualisation technologies, - Compare and contrast existing services with emerging Web 2.0 technologies, - Combining the Semantic Web with Web 2.0. - Applications - Using data and services mash-ups, - The development and use of gadgets with applications, - Adapting applications to more usable and user friendly, - Collaboration, joint development and the integration of social web sites, - Using blogs, wikis and other Web 2.0 technologies with applications, - Using Web 2.0 technologies with applications. Paper submission: Paper Format: Since the camera-ready version of accepted papers must be compliant with the IEEE Xplore format for publication, submitted papers must conform to the following Xplore layout, page limit, and font size. This will insure a size consistency and a uniform layout for the reviewers. (With minimal changes, accepted document can be styled for publication according to Xplore requirements explained in the Xplore formatting guide, which is also in Xplore format). - PDF files only. - Maximum 10 pages - Single-spaced - 8.5x11-inch, Two-column numbered pages in IEEE Xplore format - Format instructions are available at: - IEEE Paper LaTeX Template (ZIP file) http://www.cluster2009.org/IEEE_Paper_LaTeX_Template_LETTER_V3.zip - IEEE Paper Word Template (ZIP file) http://www.cluster2009.org/IEEE_Paper_Word_Template_LETTER_V3.zip Important Dates: - Workshop paper submissions: 12 June 2009 - Workshop paper notification: 15 July 2009 - Workshop paper camera ready deadline: 31 July 2009 Workshop papers should be submitted via EasyChair (see: http://research3.org/events/workshop_one.php for a link to the submission page.). If you have any problems with paper submission, please contact Mark DOT Baker AT Computer DOT Org. Programme Committee: - Mark Baker, University of Reading, UK - Dave de Roure, University of Southampton, UK - Carole Goble, University of Manchester, UK - Daniel S. Katz, University of Chicago, USA - Paul Watson, Newcastle University, UK - Geoffrey Fox, Indiana University, USA - Rob Allan, Daresbury Laboratory, UK - Dirk Neumann, Albert-Ludwigs-Universitat Freiburg, Germany - Richard Sinnott, University of Glasgow, UK - Andy Turner, University of Leeds, UK - Thomas Fahringer, University of Innsbruck, Austria - Jon Blower, University of Reading, UK - Jeremy Frey, University of Southampton, UK - Stuart Dunn, King College London, UK - Claire Warwick, UCL Department of Information Studies, UK Workshop on High Performance Interconnects for Distributed Computing (HPI-DC’09) http://www.cercs.gatech.edu/hpidc2009 in conjunction with Cluster 2009 August 31, 2009 New Orleans, Louisiana Call for Papers The emergence of 10.0 GigE and above, InfiniBand and other high- performance interconnection technologies, programmable NICs and networking platforms, and protocols like DDP and RDMA over IP, make it possible to create tightly linked systems across physical distances that exceed those of traditional single cluster or server systems. These technologies can deliver communication capabilities that achieve the performance levels needed by high end applications in enterprise systems and like those produced by the high performance computing community. Furthermore, the manycore nature of next generation platforms and the creation of distributed cloud computing infrastructure will greatly increase the demand for high performance communication capabilities over wide area distances. The purpose of this workshop is to explore the confluence of distributed computing and communications technologies with high performance interconnects, as applicable or applied to realistic high end applications. The intent is to create a venue that will act as a bridge between researchers developing tools and platforms for high-performance distributed computing, end user applications seeking high performance solutions, and technology providers aiming to improve interconnect and networking technologies for future systems. The hope is to foster knowledge creation and intellectual interchanges between HPC and Cloud computing end users and technology developers, in the specific domain of high performance distributed interconnects. Topics of interest include but are not limited to: - Hardware/software architectures for communication infrastructures for HPC and Cloud Computing - Data and control protocols for interactive and large data volume applications - Novel devices and technologies to enhance interconnect properties - Interconnect-level issues when extending high performance beyond single machines, including architecture, protocols, services, QoS, and security - Remote storage (like iSCSI), remote databases, and datacenters, etc. - Development tools, programming environments and models (like PGAS, OpenShmem, Hadoop, etc.), ranging from programming language support to simulation environments. IMPORTANT DATES Paper submission: June 5th, 2009 Notification: July 10th, 2009 Final manuscript: July 29th, 2009 Workshop date: Aug. 31st, 2009 ORGANIZATION General Chair: Steve Poole, Oak Ridge National Lab Program Co-Chairs: Pavan Balaji, Argonne National Lab Ada Gavrilovska, Georgia Institute of Technology Workshop: Parallel Programming on Accelerator Clusters http://www.checs.eng.vt.edu/ppac09 Workshop held in conjunction with IEEE Cluster 2009 August 31, 2009, New Orleans CALL FOR PAPERS Accelerator-based parallel architectures, heterogeneous multi-core processors, FPGAs and GPUs, are rapidly becoming mainstream components for cluster computing. While the potential of accelerators for addressing problems such as memory latency and bandwidth has long been recognized, accelerators present several challenges to programmers and cluster systems designers. The PPAC workshop will bring together researchers from academia and industry to discuss the latest developments in parallel programming models, languages, tools, system software and application development for clusters based on accelerators. Topics of interest include, but are not limited to: - Programming models for accelerator-based clusters, including systems using GPUs, heterogeneous multi-core processors and FPGAs. - Compiler and runtime support for accelerator-based clusters, including support for scheduling, communication and memory management. - Operating systems and virtualization support for accelerator-based clusters. - Performance evaluation of accelerator-based clusters, in particular with relation to conventional homogeneous clusters. - Performance analysis, profiling and debugging tools. - Application studies on accelerator-based clusters. - Power and energy-efficiency of accelerator-based clusters. Submissions Papers due: June 15, 2009 Author notification: July 15, 2009 Final papers due: July 31, 2009 Submitted manuscripts (8 pages max) should be formatted according to IEEE Cluster proceedings guidelines. See workshop homepage for further details. Organizers Dimitrios S. Nikolopoulos, Virginia Tech Calvin Ribbens, Virginia Tech Program Committee Kevin Barker, Los Alamos National Lab Filip Blagojevic, Lawrence Berkeley National Lab Mike Heroux, Sandia National Lab Sven Karlsson, Technical Univ. Denmark Jakub Kurzak, University of Tennesse Jeffrey Vetter, Oak Ridge National Lab -------------- next part -------------- An HTML attachment was scrubbed... URL: From faisal.latif at intel.com Mon Apr 6 12:28:52 2009 From: faisal.latif at intel.com (Faisal Latif) Date: Mon, 6 Apr 2009 14:28:52 -0500 Subject: [ofa-general] [PATCH 2/2] RDMA/nes: fix nes_nic_cm_xmit() error handling Message-ID: <20090406192852.GA9316@flatif-MOBL> We are getting crash or hung situation when we are running network cable pull tests during RDMA traffic. In schedule_nes_timer(), we are returning error if nes_nic_cm_xmit() returns failure. This is changed to success as skb is being put on the timer routines to be processed later. In send_syn() case, we are indicating connect failure once from nes_connect() and the other when the rexmit retries expires. The other issue is skb->users which we are incrementing before calling nes_nic_cm_xmit() which calls dev_queue_xmit() but in case of failure we are decrementing the skb->users at the same time putting the skb on the rexmit path. Even if dev_queue_xmit() fails, the skb->users is decremented already. We are removing the decrement of skb->users in case of failure from both schedule_nes_timer() as well as from nes_cm_timer_tick(). There is also extra check in nes_cm_timer_tick() for rexmit failure which does a break from the loop is removed. This causes problem as the other nodes have their cm_node->ref_count incremented and are not processed. Signed-off-by: Faisal Latif --- drivers/infiniband/hw/nes/nes_cm.c | 8 +------- 1 files changed, 1 insertions(+), 7 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index 572231c..ba07852 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -446,8 +446,8 @@ int schedule_nes_timer(struct nes_cm_node *cm_node, struct sk_buff *skb, if (ret != NETDEV_TX_OK) { nes_debug(NES_DBG_CM, "Error sending packet %p " "(jiffies = %lu)\n", new_send, jiffies); - atomic_dec(&new_send->skb->users); new_send->timetosend = jiffies; + ret = NETDEV_TX_OK; } else { cm_packets_sent++; if (!send_retrans) { @@ -631,7 +631,6 @@ static void nes_cm_timer_tick(unsigned long pass) nes_debug(NES_DBG_CM, "rexmit failed for " "node=%p\n", cm_node); cm_packets_bounced++; - atomic_dec(&send_entry->skb->users); send_entry->retrycount--; nexttimeout = jiffies + NES_SHORT_TIME; settimer = 1; @@ -667,11 +666,6 @@ static void nes_cm_timer_tick(unsigned long pass) spin_unlock_irqrestore(&cm_node->retrans_list_lock, flags); rem_ref_cm_node(cm_node->cm_core, cm_node); - if (ret != NETDEV_TX_OK) { - nes_debug(NES_DBG_CM, "rexmit failed for cm_node=%p\n", - cm_node); - break; - } } if (settimer) { -- 1.5.3.3 From eli at dev.mellanox.co.il Mon Apr 6 23:59:55 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Tue, 7 Apr 2009 09:59:55 +0300 Subject: [ofa-general] Re: [PATCH v3] mlx4_ib: Optimize hugetlab pages support In-Reply-To: <49DA40B7.3040004@voltaire.com> References: <20090129152725.GA26284@mtls03> <49D0EA11.2040409@Voltaire.COM> <20090402084406.GB21370@mtls03> <20090405065047.GA567@mtls03> <49DA40B7.3040004@voltaire.com> Message-ID: <20090407065955.GA2308@mtls03> On Mon, Apr 06, 2009 at 08:49:43PM +0300, Yossi Etigin wrote: > > I don't understand - if all area is huge pages, it does not mean that > it fills full huge pages - I can have just 4096 bytes in huge page memory > and umem->hugetlb will remain 1, right? You may call ib_umem_get() with a fraction of a huge page but I expect the number of pages returned from get_user_pages() will fill up a huge page. Can you check that with the mckey test you were using? From vlad at dev.mellanox.co.il Tue Apr 7 01:14:58 2009 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 07 Apr 2009 11:14:58 +0300 Subject: [ofa-general] MVAPICH_GCC seems incompatible with (RedHat)gcc-4.1.2-44 In-Reply-To: <49D46FC3.4050704@clustervision.com> References: <49D46FC3.4050704@clustervision.com> Message-ID: <49DB0B82.7000501@dev.mellanox.co.il> Guido Passet wrote: > Dear OpenFabrics list, > > > RedHat and spinoffs recently shipped v5.3 and many RedHat alike > distributions like Scientific Linux picked up the new flood of packages. > > One curiosity I am trying to debug is the fact that MVAPICH/MVAPICH2 do > not seem to compile anymore after the update from: > > gcc-4.1.2-42.el5.x86_64 (RedHat 5.2) > > to: > > gcc-4.1.2-44.el5.x86_64 (RedHat 5.3) > > On running ./install.pl --build32 --all > > I get: > > mvapich_gcc is not available on this platform > mvapich2_gcc is not available on this platform > > and no rpms are being build.. > > While all other compilers (PGI/Pathscale and Intel) seem to work fine. > > I tried OFED-1.4.1-20090401-0600, OFED-1.4.1-20090319-0600, > OFED-1.4-20090301-0600 and a couple more but I seem to be in trouble on > all versions. > > Any pointers into a direction to get these stack parts compiled against > GCC would be more than welcome. > > Cheers, > Guido. Hello Guido, In order to start mvapich_gcc and mvapich2_gcc compilation, install.pl expects to find gcc, g77 and gfortran installed (in the path) on the server. Please make sure that the following command return correct path to these binaries: which gcc which g77 which gfortran Regards, Vladimir From tziporet at mellanox.co.il Tue Apr 7 02:34:53 2009 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 7 Apr 2009 12:34:53 +0300 Subject: [ofa-general] OFED 1.4.1-rc3 is available Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD024BEB7D@mtlexch01.mtl.com> Hi, OFED-1.4.1-rc3 release is available on http://www.openfabrics.org/downloads/OFED/ofed-1.4.1/OFED-1.4.1-rc3.tgz To get BUILD_ID run ofed_info Please report any issues in bugzilla https://bugs.openfabrics.org/ for OFED 1.4.1 Note: RC4 is planned for April 20 Vladimir & Tziporet ======================================================================== Release information: ------------------------------ Linux Operating Systems: - RedHat EL4 up4: 2.6.9-42.ELsmp * - RedHat EL4 up5: 2.6.9-55.ELsmp - RedHat EL4 up6: 2.6.9-67.ELsmp - RedHat EL4 up7: 2.6.9-78.ELsmp - RedHat EL5: 2.6.18-8.el5 - RedHat EL5 up1: 2.6.18-53.el5 - RedHat EL5 up2: 2.6.18-92.el5 - RedHat EL5 up3: 2.6.18-128.el5 - OEL 4.5: 2.6.9-55.ELsmp - OEL 5.2: 2.6.18-92.el5 - CentOS 5.2: 2.6.18-92.el5 - Fedora C9: 2.6.25-14.fc9 * - SLES10: 2.6.16.21-0.8-smp - SLES10 SP1: 2.6.16.46-0.12-smp - SLES10 SP1 up1: 2.6.16.53-0.16-smp - SLES10 SP2: 2.6.16.60-0.21-smp - SLES11 GA: 2.6.27.13-1-default - OpenSuSE 10.3: 2.6.22.5-31 * - kernel.org: 2.6.26 and 2.6.27 * Minimal QA for these versions Systems: * x86_64 * x86 * ia64 * ppc64 Main Changes from OFED-1.4.1-rc2 ========================== - Fix compilation issues with SLES 11 and IA64 - More cleanup in NFSRDMA backports - Low level drivers updated: ipath, nes, mlx4, cxgb3 - 17 bugs fixed (see attachment) - Attached kernel git tree changes for details Tasks that should be completed for RC4 (Apr 20): ==================================== 1. High priority bug fixes - see list bellow 2. Open MPI 1.3.2 release 3. Documentation update Open bugs: ======== bug_id bug_severity op_sys assigned_to short_short_desc 1538 blocker Other swise at opengridcomputing.com how to deal with /usr/src/ofa_kernel/include/linux/autoconf.h? 1578 blocker All swise at opengridcomputing.com ofa_kernel/include/linux/autoconf.h undefines macros 1589 critical RHEL 5 jon at opengridcomputing.com FRMR registration errors logged by cxgb3 during NFSRDMA iozone runs 1571 critical RHEL 5 vu at mellanox.com nfsrdma server crash @test5 connectathon basic test, 1528 major RHEL 5 jackm at mellanox.co.il IPoIB get stack when running Hadoop application. 1529 major RHEL 5 jackm at mellanox.co.il Opensm cannot be stopped following openib failure. 1545 major Other jackm at mellanox.co.il Performance degradation in ofed 1.4.1 in TCP BW for some packets size 1570 major SLES 10 Jeffrey.C.Becker at nasa.gov rnfs-utils fails to compile on sles10sp2 1579 major RHEL 5 jsquyres at cisco.com OpenMPI-1.3.1-1: segfault during close 1581 major Other vlad at mellanox.co.il Unable to uninstall OFED1.4 due to dependencies on tgt and scsi-target-utils 1587 major Other yosefe at voltaire.com Adding and removing pkey from partition.conf cause to machine to hung -------------- next part -------------- A non-text attachment was scrubbed... Name: ofed-1.4-rc3_rc2.log Type: application/octet-stream Size: 18102 bytes Desc: ofed-1.4-rc3_rc2.log URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ofed-1.4.1-rc3-fixed-bugs.csv Type: application/octet-stream Size: 1907 bytes Desc: ofed-1.4.1-rc3-fixed-bugs.csv URL: From guido.passet at clustervision.com Tue Apr 7 03:02:01 2009 From: guido.passet at clustervision.com (Guido Passet) Date: Tue, 07 Apr 2009 12:02:01 +0200 Subject: [ofa-general] MVAPICH_GCC seems incompatible with (RedHat)gcc-4.1.2-44 In-Reply-To: <49DB0B82.7000501@dev.mellanox.co.il> References: <49D46FC3.4050704@clustervision.com> <49DB0B82.7000501@dev.mellanox.co.il> Message-ID: <49DB2499.3030907@clustervision.com> Vladimir Sokolovsky wrote: > > In order to start mvapich_gcc and mvapich2_gcc compilation, install.pl > expects to find gcc, g77 and gfortran installed (in the path) on the > server. > > Please make sure that the following command return correct path to these > binaries: > > which gcc > which g77 > which gfortran Thanks for suggesting this. I upgraded 5.2->5.3 and seem to have ignored a warning about this. g77 was not available :/ Pebkac fixed ;) Guido From vlad at lists.openfabrics.org Tue Apr 7 03:27:04 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 7 Apr 2009 03:27:04 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090407-0200 daily build status Message-ID: <20090407102704.C25D5E611B3@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From dledford at redhat.com Tue Apr 7 05:44:19 2009 From: dledford at redhat.com (Doug Ledford) Date: Tue, 7 Apr 2009 08:44:19 -0400 Subject: [ofa-general] MVAPICH_GCC seems incompatible with (RedHat)gcc-4.1.2-44 In-Reply-To: <49DB2499.3030907@clustervision.com> References: <49D46FC3.4050704@clustervision.com> <49DB0B82.7000501@dev.mellanox.co.il> <49DB2499.3030907@clustervision.com> Message-ID: On Apr 7, 2009, at 6:02 AM, Guido Passet wrote: > Vladimir Sokolovsky wrote: >> >> In order to start mvapich_gcc and mvapich2_gcc compilation, >> install.pl >> expects to find gcc, g77 and gfortran installed (in the path) on the >> server. >> >> Please make sure that the following command return correct path to >> these >> binaries: >> >> which gcc >> which g77 >> which gfortran > > Thanks for suggesting this. I upgraded 5.2->5.3 and seem to have > ignored > a warning about this. g77 was not available :/ > > Pebkac fixed ;) Sounds like an install.pl problem to me. Both mvapich and mvapich2 will not compile compatible f77/f90 code if you use g77 for one and gfortran for the other. It should require *either* g77 or gfortran, not both. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford InfiniBand Specific RPMS http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 203 bytes Desc: This is a digitally signed message part URL: From knikanth at suse.de Mon Apr 6 23:33:23 2009 From: knikanth at suse.de (Nikanth Karthikesan) Date: Tue, 7 Apr 2009 12:03:23 +0530 Subject: [ofa-general] [PATCH] Fix wrong dbg output and gcc warning when INFINIBAND_NES_DEBUG is not set Message-ID: <200904071203.24032.knikanth@suse.de> The debug messaage wrongly prints the address of a local variable. Also when INFINIBAND_NES_DEBUG is not set, gcc emits an unused variable warning. Fix it. Signed-off-by: Nikanth Karthikesan --- diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index 5242515..a34f0a5 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -859,7 +859,6 @@ static struct nes_cm_listener *find_listener(struct nes_cm_core *cm_core, { unsigned long flags; struct nes_cm_listener *listen_node; - __be32 tmp_addr = cpu_to_be32(dst_addr); /* walk list and find cm_node associated with this session ID */ spin_lock_irqsave(&cm_core->listen_list_lock, flags); @@ -877,7 +876,7 @@ static struct nes_cm_listener *find_listener(struct nes_cm_core *cm_core, spin_unlock_irqrestore(&cm_core->listen_list_lock, flags); nes_debug(NES_DBG_CM, "Unable to find listener for %pI4:%x\n", - &tmp_addr, dst_port); + cpu_to_be32(dst_addr), dst_port); /* no listener */ return NULL; From guido.passet at clustervision.com Tue Apr 7 06:43:26 2009 From: guido.passet at clustervision.com (Guido Passet) Date: Tue, 07 Apr 2009 15:43:26 +0200 Subject: [ofa-general] non-gcc mpitests seem to compile with gcc compileflags Message-ID: <49DB587E.2060508@clustervision.com> Dear list, I could be wrong but it looks like the non-gcc mpitests programs are using incorrect compile flags. Running rpm -iv /RPMS/sl-release-5.3-1.x86_64/x86_64/mpitests_mvapich_gcc-3.1-891.x86_64.rpm Build mpitests_mvapich_pgi RPM Running LDFLAGS='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -L/ofed/1.4.1-rc3/lib64 -L/ofed/1.4.1-rc3/lib' CFLAGS='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -I/ofed/1.4.1-rc3/include' CPPFLAGS='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -I/ofed/1.4.1-rc3/include' rpmbuild --rebuild --define '_topdir /var/tmp/OFED_topdir' --define 'dist %{nil}' --target x86_64 --define '_name mpitests_mvapich_pgi' --define 'root_path /' --define '_usr /ofed/1.4.1-rc3' --define 'path_to_mpihome /ofed/1.4.1-rc3/mpi/pgi/mvapich-1.1.0' /SRPMS/mpitests-3.1-891.src.rpm /ofed/1.4.1-rc3/mpi/pgi/mvapich-1.1.0/bin/mpicc -I/ofed/1.4.1-rc3/mpi/pgi/mvapich-1.1.0/include -DMPI1 -O3 -c IMB_cpu_exploit.c /ofed/1.4.1-rc3/mpi/pgi/mvapich-1.1.0/bin/mpicc -o IMB-MPI1 IMB.o IMB_declare.o IMB_init.o IMB_mem_manager.o IMB_parse_name_mpi1.o IMB_benchlist.o IMB_strgs.o IMB_err_handler.o IMB_g_info.o IMB_warm_up.o IMB_output.o IMB_pingpong.o IMB_pingping.o IMB_allreduce.o IMB_reduce_scatter.o IMB_reduce.o IMB_exchange.o IMB_bcast.o IMB_barrier.o IMB_allgather.o IMB_allgatherv.o IMB_gather.o IMB_gatherv.o IMB_scatter.o IMB_scatterv.o IMB_alltoall.o IMB_alltoallv.o IMB_sendrecv.o IMB_init_transfer.o IMB_chk_diff.o IMB_cpu_exploit.o make[2]: Leaving directory `/var/tmp/OFED_topdir/BUILD/mpitests-3.1/IMB-3.1/src' make[1]: Leaving directory `/var/tmp/OFED_topdir/BUILD/mpitests-3.1/IMB-3.1/src' cd /var/tmp/OFED_topdir/BUILD/mpitests-3.1/osu_benchmarks-3.0 && make MPIHOME=/ofed/1.4.1-rc3/mpi/pgi/mvapich-1.1.0 make[1]: Entering directory `/var/tmp/OFED_topdir/BUILD/mpitests-3.1/osu_benchmarks-3.0' /ofed/1.4.1-rc3/mpi/pgi/mvapich-1.1.0/bin/mpicc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -I/ofed/1.4.1-rc3/include -c -o osu_bcast.o osu_bcast.c pgcc-Error-Unknown switch: -pipe pgcc-Error-Unknown switch: -Wall pgcc-Error-Unknown switch: -Wp,-D_FORTIFY_SOURCE=2 pgcc-Error-Unknown switch: -fexceptions pgcc-Error-Unknown switch: -fstack-protector pgcc-Error-Unknown switch: --param=ssp-buffer-size=4 pgcc-Error-Unknown switch: -m64 pgcc-Error-Unknown switch: -mtune=generic make[1]: *** [osu_bcast.o] Error 1 make[1]: Leaving directory `/var/tmp/OFED_topdir/BUILD/mpitests-3.1/osu_benchmarks-3.0' make: *** [osu] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.25349 (%install) Cheers, Guido. From tziporet at mellanox.co.il Tue Apr 7 08:37:43 2009 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Tue, 7 Apr 2009 18:37:43 +0300 Subject: [ofa-general] EWG/OFED meeting minutes for Apr 6, 09 Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD024BF0F1@mtlexch01.mtl.com> > This is OFED meeting agenda for today (April 6): Meeting Summary: ============== a. OFED 1.4.1 release status: RC3 done, decided on RC4 date and bugs that should be fixed for the release. b. Reviewed the changes in OFED build process that will take place in OFED 1.5, following the BOF we had in Sonoma c. Reminder: OFED roadmap is available on the web: http://www.openfabrics.org/txt/woody/roadmap.txt d. All participants are requested to review bugzilla bugs and update status (if they are still reproduced) Details: ====== > a. OFED 1.4.1 release status: > Reminder for OFED 1.4.1 schedule: * RC1 & RC2 - done * RC3 - Apr 6 * RC4 - Apr 20 * GA - Apr 23 > Critical bugs : We reviewed the high priority bugs and decreased the list to 11 bugs: bug_id bug_severity op_sys assigned_to short_short_desc 1538 blocker Other swise at opengridcomputing.com how to deal with /usr/src/ofa_kernel/include/linux/autoconf.h? 1578 blocker All swise at opengridcomputing.com ofa_kernel/include/linux/autoconf.h undefines macros 1589 critical RHEL 5 jon at opengridcomputing.com FRMR registration errors logged by cxgb3 during NFSRDMA iozone runs 1571 critical RHEL 5 vu at mellanox.com nfsrdma server crash @test5 connectathon basic test, 1528 major RHEL 5 jackm at mellanox.co.il IPoIB get stack when running Hadoop application. 1529 major RHEL 5 jackm at mellanox.co.il Opensm cannot be stopped following openib failure. 1545 major Other jackm at mellanox.co.il Performance degradation in ofed 1.4.1 in TCP BW for some packets size 1570 major SLES 10 Jeffrey.C.Becker at nasa.gov rnfs-utils fails to compile on sles10sp2 1579 major RHEL 5 jsquyres at cisco.com OpenMPI-1.3.1-1: segfault during close 1581 major Other vlad at mellanox.co.il Unable to uninstall OFED1.4 due to dependencies on tgt and scsi-target-utils 1587 major Other yosefe at voltaire.com Adding and removing pkey from partition.conf cause to machine to hung > b. OFED 1.5 release process: > Make sure all participants are inline with decisions we took in Sonoma > in the OFED/distro discussion: > 1) Identify all user components which are not packaged as tarballs and > notify them that it will be a condition of inclusion in OFED 1.5 that > they are packaged as tarballs. > 2) Ensure that each of the tarballs for OFED 1.5 has a unique version > number, and that version numbers are updated appropriately as the > tarball contents change. > 3) Ensure that each use level component that depends on specific > kernel feature actually checks for the existence of that kernel > feature, and politely declines to compile/install if that component is > not available. > 4) Investigate making the OFED install work as non-root - Doug has > offered tutorials using BUILD_ROOT and chroot mechanisms to do > non-root builds. 5) bugzilla status: we have too many old bugs in bugzilla that no one know what is their status. Thus we request that any one that opened a bug will see if it still reproduced. If not mark as fixed and if yes change version number to 1.4.1 c. Bill raised the question how are we going to address MPI requests that were raised in Sonoma. Tziporet updated that API for Memory registration notification is being derived by Jeff S. in MPI forum. Maybe in next meeting Jeff will be on the call and he will be able to update us -------------- next part -------------- An HTML attachment was scrubbed... URL: From Charles.Baker at Sun.COM Tue Apr 7 13:01:18 2009 From: Charles.Baker at Sun.COM (Chuck Baker) Date: Tue, 07 Apr 2009 14:01:18 -0600 Subject: [ofa-general] How to recover from a bad MAD status (110) from lid 6 Message-ID: <49DBB10E.5030506@Sun.COM> Hi, I encountered an error while running load tests on RHEL5.2 OFED 1.4 connected to SRP targets, and am wondering how to recover. The error's I'm seeing is an I/O failed with EIO I/O error messages, and my load generator failed. Since the failure, the srp_daemon reports srp_daemon -a -o -c -n -i mthca0 07/03/09 10:54:13 : bad MAD status (110) from lid 6 id_ext=0003ba0001005504,ioc_guid=0003ba0001005504,dgid=fe800000000000000003ba0001005506,pkey=ffff,service_id=0003ba0001005504,initiator_ext=0655000100ba0300 id_ext=0003ba000100575c,ioc_guid=0003ba000100575c,dgid=fe800000000000000003ba000100575e,pkey=ffff,service_id=0003ba000100575c,initiator_ext=5e57000100ba0300 rebooting the target and then rebooting the initiator has made no difference. Any ideas on how to resolve this would be appreciated. thanks chuck From or.gerlitz at gmail.com Tue Apr 7 13:19:21 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Tue, 7 Apr 2009 23:19:21 +0300 Subject: [ofa-general] ***SPAM*** Re: API for Memory registration notification Message-ID: <15ddcffd0904071319l3ae098b2mc68bd89da819ce38@mail.gmail.com> On Tue, Apr 7, 2009 at 6:37 PM, Tziporet Koren wrote: > c. Bill raised the question how are we going to address MPI requests that > were raised in Sonoma. Tziporet updated that API for Memory registration > notification is being derived by Jeff S. in MPI forum. Maybe in next meeting > Jeff will be on the call and he will be able to update us > Hi Jeff, Maybe its clear to everyone except me... but is the problem description / current proposed solutions available through some publicly available email thread / presentation? Or. -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Tue Apr 7 14:24:56 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 07 Apr 2009 16:24:56 -0500 Subject: [ofa-general] ofed autoconf.h Message-ID: <49DBC4A8.8020708@opengridcomputing.com> I've been charged to resolve OFED bugs 1538 and 1578 dealing with autoconf.h. I seek input from the OFA lists. There are two issues: Bug 1538) the ofed kernel configure script creates an autoconf.h file that is intended to be included _after_ the backing kernel autoconf.h. The ofed autoconf.h undefines the CONFIG defines for all the ofed supplied modules, and then redefines the ones that are "turned on" by the config options (setup via install.pl). This forces non-ofed/non-kernel.org kernel modules that are trying to build their rdma modules on top of ofed to include the backing kernel autoconf.h and then the ofed autoconf.h. Bug 1538 requests that this be resolved somehow so all the kernel module has to do is simply include and the ofed backport/includes will do the right thing. Currently, this results in only the inclusion of the ofed autoconf.h. It was suggested that we could add a "#include_next " to the ofed file and thus include both, but that doesn't work because we need the backing kernel autoconf.h included first, and then the ofed one. Bug 1578) for ofed modules that are not configured in, the ofed autoconf.h will #undef the CONFIG* defines for these modules. So, for example, if you do not build in NFSRDMA with your ofed-1.4.1 installation, then all the CONFIG_ defines for NFS will be #undefed in the ofed autoconf.h. If a kernel module then builds against this autoconf.h, it will not have the NFS CONFIG defines set, even though the backing kernel _does_ have these set. I think the idea originally was that the backing kernel didn't have any of these ofed modules. But now almost all of them are indeed in the upstream kernel. So should the ofed install really be #undefining these? A couple of questions: 1) do we think these should be resolved in ofed-1.4.1? Maybe we should defer these to 1.5? 2) if we do want to fix them. any ideas on how best to handle this? Here are my proposed solutions (dunno if they break anything) 1538: change the ofed configure script to create a fully-populated autoconf.h that basically is a cat of the back kernel tree autoconf.h and the current ofed autoconf.h. That way, modules will get everything when they include the ofed autoconf.h. 1578: I propose we don't #undef any modules that are not configured into the ofed build. Thus if you don't build in NFSRDMA for ofed, the status of the NFS CONFIG* defines will be based on the backing kernel tree autoconf.h instead of always being turned off. Comments? Thanks, Steve. From ogerlitz at voltaire.com Tue Apr 7 14:34:04 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Wed, 8 Apr 2009 00:34:04 +0300 (IDT) Subject: [ofa-general] Re: 1.5 roadmap In-Reply-To: References: Message-ID: from www.openfabrics.org/txt/woody/roadmap.txt [...] > OFED 1.5 > Kernel base: 2.6.30 > Support for Mellanox vNIC (EoIB) and FCoIB with BridgeX device > Support for Mellanox IB over Ethernet (TBD - under discussion) None of these will be present in 2.6.30 - so as was agreed in all the ofa f2f meetings over the last three years - I don't see how they why they should land in 1.5 Also, isn't the IBoE (RoEE) mode of the IB stack be designed / implemented in a vendor independent fashion? putting into a distro IB stack a non-reviewed driver such as a vNIC is one thing, however putting a non-reviewed RoEE patch set to the IB stack is a totally different game, which I think should be rejected on the spot. Or. From or.gerlitz at gmail.com Tue Apr 7 15:23:51 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 8 Apr 2009 01:23:51 +0300 Subject: [ofa-general] ***SPAM*** Re: API for Memory registration notification In-Reply-To: <15ddcffd0904071319l3ae098b2mc68bd89da819ce38@mail.gmail.com> References: <15ddcffd0904071319l3ae098b2mc68bd89da819ce38@mail.gmail.com> Message-ID: <15ddcffd0904071523m3c1020cdp2e453857a95c9134@mail.gmail.com> On Tue, Apr 7, 2009 at 11:19 PM, Or Gerlitz wrote: > is the problem description available through some publicly available email > thread / presentation? OKay Jeff, I found your Sonoma presentation, where things are listed loud and clear. Or. From robert.j.woodruff at intel.com Tue Apr 7 15:42:43 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 7 Apr 2009 15:42:43 -0700 Subject: [ofa-general] Re: 1.5 roadmap In-Reply-To: References: Message-ID: <382A478CAD40FA4FB46605CF81FE39F4266DD407@orsmsx507.amr.corp.intel.com> Or wrote, >None of these will be present in 2.6.30 - so as was agreed in all >the ofa f2f meetings over the last three years - I don't see how >they why they should land in 1.5 I think that Mellanox has plans to submit their vNIC driver for review prior to including it in OFED 1.5, as we agreed to in the ftf meetings. Not sure on the RDMA over ethernet driver. Also, I think that what we agreed to was that code should be submitted and reviewed and queued for upstream inclusion prior to being accepted into OFED. We did not say that it had to be in a released kernel, say 2.6.30, but if it were queued in Roland's tree for a future kernel, .eg., 2.6.31, then OFED would allow it. At least that is the discussion I recall. woody From or.gerlitz at gmail.com Tue Apr 7 16:01:18 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 8 Apr 2009 02:01:18 +0300 Subject: [ofa-general] Re: 1.5 roadmap In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F4266DD407@orsmsx507.amr.corp.intel.com> References: <382A478CAD40FA4FB46605CF81FE39F4266DD407@orsmsx507.amr.corp.intel.com> Message-ID: <15ddcffd0904071601q30431502t41618f1438bf2a2a@mail.gmail.com> On Wed, Apr 8, 2009 at 1:42 AM, Woodruff, Robert J > Also, I think that what we agreed to was that code should be submitted and reviewed > and queued for upstream inclusion prior to being accepted[...] We did not say that it > had to be in a released kernel [...] but if it were queued in Roland's tree for a future > kernel, then allow it. Woody, Except for one or two exceptions which I'm not sure about for the most case, we aren't talking on inclusion of code which is queued for merge vs already merged but rather on massive inclusion of non reviewed code. Or. From robert.j.woodruff at intel.com Tue Apr 7 16:08:18 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 7 Apr 2009 16:08:18 -0700 Subject: [ofa-general] Re: 1.5 roadmap In-Reply-To: <15ddcffd0904071601q30431502t41618f1438bf2a2a@mail.gmail.com> References: <382A478CAD40FA4FB46605CF81FE39F4266DD407@orsmsx507.amr.corp.intel.com> <15ddcffd0904071601q30431502t41618f1438bf2a2a@mail.gmail.com> Message-ID: <382A478CAD40FA4FB46605CF81FE39F4266DD46F@orsmsx507.amr.corp.intel.com> Or wrote, >Woody, >Except for one or two exceptions which I'm not sure about for the most >case, we aren't talking on inclusion of code which is queued for merge >vs already merged but rather on massive inclusion of non reviewed >code. >Or. Yes, I think we all agree on this, that we do not want to put large blocks of non-reviewed code into OFED. My point was that I think that Mellanox has the goal to get the reviews done before it is accepted into OFED 1.5, but Tziporet can confirm this when she returns from vacation. woody From or.gerlitz at gmail.com Tue Apr 7 16:37:29 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Wed, 8 Apr 2009 02:37:29 +0300 Subject: ***SPAM*** Re: [ofa-general] Re: 1.5 roadmap In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F4266DD46F@orsmsx507.amr.corp.intel.com> References: <382A478CAD40FA4FB46605CF81FE39F4266DD407@orsmsx507.amr.corp.intel.com> <15ddcffd0904071601q30431502t41618f1438bf2a2a@mail.gmail.com> <382A478CAD40FA4FB46605CF81FE39F4266DD46F@orsmsx507.amr.corp.intel.com> Message-ID: <15ddcffd0904071637q23406905x8e9636b51e02b2f0@mail.gmail.com> > I think we all agree on this, that we do not want to put large blocks of non-reviewed code > into OFED. My point was that I think that Mellanox has the goal to get the reviews done > before it is accepted into OFED 1.5 Woody, per the roadmap link, the feature freeze is in May 7th - which is 30 days from now... as things are going I don't see more than ONE of EoIB, IBoE, FCoIB being reviewed and merged into Roland's for-next branch on that date (actually I see well how none of them are accpeted on that date). So we are talking again theory - anyway, this way or another, lets make sure the gate is closed for non reviewed/accepted code by May 7th (or whenever the group decides). Saying all that, I would be more then happy to review and assist in acceptance of the EoIB driver and IBoE patch set - just show us the code... Or. From robert.j.woodruff at intel.com Tue Apr 7 16:55:43 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Tue, 7 Apr 2009 16:55:43 -0700 Subject: [ofa-general] Re: 1.5 roadmap In-Reply-To: <15ddcffd0904071637q23406905x8e9636b51e02b2f0@mail.gmail.com> References: <382A478CAD40FA4FB46605CF81FE39F4266DD407@orsmsx507.amr.corp.intel.com> <15ddcffd0904071601q30431502t41618f1438bf2a2a@mail.gmail.com> <382A478CAD40FA4FB46605CF81FE39F4266DD46F@orsmsx507.amr.corp.intel.com> <15ddcffd0904071637q23406905x8e9636b51e02b2f0@mail.gmail.com> Message-ID: <382A478CAD40FA4FB46605CF81FE39F4266DD4E8@orsmsx507.amr.corp.intel.com> Or wrote, > I think we all agree on this, that we do not want to put large blocks of non-reviewed code > into OFED. My point was that I think that Mellanox has the goal to get the reviews done > before it is accepted into OFED 1.5 >Woody, per the roadmap link, the feature freeze is in May 7th - which >is 30 days from now... as things are going I don't see more than ONE >of EoIB, IBoE, FCoIB being reviewed and merged into Roland's for-next >branch on that date (actually I see well how none of them are accpeted >on that date). So we are talking again theory - anyway, this way or >another, lets make sure the gate is closed for non reviewed/accepted >code by May 7th (or whenever the group decides). Saying all that, I >would be more then happy to review and assist in acceptance of the >EoIB driver and IBoE patch set - just show us the code... >Or. Good point. If we want to have a feature freeze date of May 7, it might be a good idea to get these code reviews started ASAP, and even then, might be hard to get them all done before May 7. woody From weiny2 at llnl.gov Tue Apr 7 17:13:50 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 7 Apr 2009 17:13:50 -0700 Subject: [ofa-general] [PATCH] Fix further bugs around console closure and clean up code. Message-ID: <20090407171350.7582ce84.weiny2@llnl.gov> From: Ira Weiny Date: Tue, 7 Apr 2009 16:46:18 -0700 Subject: [PATCH] Fix further bugs around console closure and clean up code. It was found that closing the socket connection forcibly from a new connection would close the socket and leave the last connection unresponsive to "quit" The reproducer was: > ./opensm -B -console socket (Open connection #1) > telnet localhost 10000 Trying 127.0.0.1... Connected to localhost.localdomain (127.0.0.1). Escape character is '^]'. OpenSM $ (From another terminal open connection #2) > telnet localhost 10000 Trying 127.0.0.1... Connected to localhost.localdomain (127.0.0.1). Escape character is '^]'. OpenSM Console connection already in use kill other session (y/n)? y OpenSM $ The connection #1 is forced to quit as it was supposed to, however, now the "quit" command will not work in connection #2 and no other users can connect! It was found that the osm_console_exit and console_close functions had been intertwined in an unintended fashion. osm_console_exit was intended to be the opposite of osm_console_init (which closes not only any open connections but also the socket). console_close was intended to be used to close a single connection but not the socket. This patch fixes all this by removing console_close, making cio_close close only the connection, and fixing osm_console_exit to properly clean up from osm_console_init. Signed-off-by: Ira Weiny --- opensm/include/opensm/osm_console_io.h | 2 +- opensm/opensm/osm_console.c | 6 +++--- opensm/opensm/osm_console_io.c | 32 ++++++++++---------------------- 3 files changed, 14 insertions(+), 26 deletions(-) diff --git a/opensm/include/opensm/osm_console_io.h b/opensm/include/opensm/osm_console_io.h index f31d811..d1dbbdd 100644 --- a/opensm/include/opensm/osm_console_io.h +++ b/opensm/include/opensm/osm_console_io.h @@ -83,7 +83,7 @@ int is_console_enabled(osm_subn_opt_t *p_opt); #ifdef ENABLE_OSM_CONSOLE_SOCKET int cio_open(osm_console_t * p_oct, int new_fd, osm_log_t * p_log); -int cio_close(osm_console_t * p_oct); +int cio_close(osm_console_t * p_oct, osm_log_t * p_log); int is_authorized(osm_console_t * p_oct); #endif diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index 97be931..182c64e 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -1197,7 +1197,7 @@ static void perfmgr_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) static void quit_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) { - osm_console_exit(&p_osm->console, &p_osm->log); + cio_close(&p_osm->console, &p_osm->log); } static void help_version(FILE * out, int detail) @@ -1459,7 +1459,7 @@ int osm_console(osm_opensm_t * p_osm) osm_console_prompt(p_oct->out); } } else - osm_console_exit(p_oct, p_log); + cio_close(p_oct, p_log); if (p_line) free(p_line); return 0; @@ -1469,7 +1469,7 @@ int osm_console(osm_opensm_t * p_osm) #ifdef ENABLE_OSM_CONSOLE_SOCKET /* If we are using a socket, we close the current connection */ if (p_oct->socket >= 0) { - cio_close(p_oct); + cio_close(p_oct, p_log); return 0; } #endif diff --git a/opensm/opensm/osm_console_io.c b/opensm/opensm/osm_console_io.c index 56a2c98..1aa4648 100644 --- a/opensm/opensm/osm_console_io.c +++ b/opensm/opensm/osm_console_io.c @@ -92,10 +92,13 @@ int is_console_enabled(osm_subn_opt_t * p_opt) #ifdef ENABLE_OSM_CONSOLE_SOCKET -int cio_close(osm_console_t * p_oct) +int cio_close(osm_console_t * p_oct, osm_log_t * p_log) { int rtnval = -1; if (p_oct && p_oct->in_fd > 0) { + OSM_LOG(p_log, OSM_LOG_INFO, + "Console connection closed: %s (%s)\n", + p_oct->client_hn, p_oct->client_ip); rtnval = close(p_oct->in_fd); p_oct->in_fd = -1; p_oct->out_fd = -1; @@ -105,20 +108,6 @@ int cio_close(osm_console_t * p_oct) return rtnval; } -static void console_close(osm_console_t * p_oct, osm_log_t * p_log) -{ - if (p_oct->socket > 0 && p_oct->in_fd != -1) { - OSM_LOG(p_log, OSM_LOG_INFO, - "Console connection closed: %s (%s)\n", - p_oct->client_hn, p_oct->client_ip); - cio_close(p_oct); - } - if (p_oct->socket > 0) { - close(p_oct->socket); - p_oct->socket = -1; - } -} - int cio_open(osm_console_t * p_oct, int new_fd, osm_log_t * p_log) { /* returns zero if opened fine, -1 otherwise */ @@ -135,7 +124,7 @@ int cio_open(osm_console_t * p_oct, int new_fd, osm_log_t * p_log) p_line = NULL; n = getline(&p_line, &len, file); if (n > 0 && (p_line[0] == 'y' || p_line[0] == 'Y')) - console_close(p_oct, p_log); + cio_close(p_oct, p_log); else { OSM_LOG(p_log, OSM_LOG_INFO, "Console connection aborted: %s (%s)\n", @@ -238,13 +227,12 @@ int osm_console_init(osm_subn_opt_t * opt, osm_console_t * p_oct, osm_log_t * p_ /* clean up and release resources */ void osm_console_exit(osm_console_t * p_oct, osm_log_t * p_log) { - /* currently just close the current connection, not the socket */ #ifdef ENABLE_OSM_CONSOLE_SOCKET - if (p_oct->socket > 0 && p_oct->in_fd != -1) { - OSM_LOG(p_log, OSM_LOG_INFO, - "Console connection closed: %s (%s)\n", - p_oct->client_hn, p_oct->client_ip); - cio_close(p_oct); + cio_close(p_oct, p_log); + if (p_oct->socket > 0) { + OSM_LOG(p_log, OSM_LOG_INFO, "Closing console socket\n"); + close(p_oct->socket); + p_oct->socket = -1; } #endif } -- 1.5.4.5 From rdreier at cisco.com Tue Apr 7 22:40:10 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 07 Apr 2009 22:40:10 -0700 Subject: [ofa-general] Re: [PATCH] Fix wrong dbg output and gcc warning when INFINIBAND_NES_DEBUG is not set In-Reply-To: <200904071203.24032.knikanth@suse.de> (Nikanth Karthikesan's message of "Tue, 7 Apr 2009 12:03:23 +0530") References: <200904071203.24032.knikanth@suse.de> Message-ID: > The debug messaage wrongly prints the address of a local variable. Also when > INFINIBAND_NES_DEBUG is not set, gcc emits an unused variable warning. Fix it. > nes_debug(NES_DBG_CM, "Unable to find listener for %pI4:%x\n", > - &tmp_addr, dst_port); > + cpu_to_be32(dst_addr), dst_port); My understanding is that %pI4 wants a pointer (as all %p formats do) -- and every other use of %pI4 in the kernel that I looked as is passing a pointer to printk. Have you tested this patch with NES debug on? I would expect gcc to warn about passing a non-pointer to a %p format. - R. From knikanth at suse.de Tue Apr 7 23:06:07 2009 From: knikanth at suse.de (Nikanth Karthikesan) Date: Wed, 8 Apr 2009 11:36:07 +0530 Subject: [ofa-general] Re: [PATCH] Fix wrong dbg output and gcc warning when INFINIBAND_NES_DEBUG is not set In-Reply-To: References: <200904071203.24032.knikanth@suse.de> Message-ID: <200904081136.08393.knikanth@suse.de> On Wednesday 08 April 2009 11:10:10 Roland Dreier wrote: > > The debug messaage wrongly prints the address of a local variable. Also > > when INFINIBAND_NES_DEBUG is not set, gcc emits an unused variable > > warning. Fix it. > > > > nes_debug(NES_DBG_CM, "Unable to find listener for %pI4:%x\n", > > - &tmp_addr, dst_port); > > + cpu_to_be32(dst_addr), dst_port); > > My understanding is that %pI4 wants a pointer (as all %p formats do) -- > and every other use of %pI4 in the kernel that I looked as is passing a > pointer to printk. Have you tested this patch with NES debug on? I > would expect gcc to warn about passing a non-pointer to a %p format. > Yes, I was little hasty. Here is the corrected patch. Replaced %p with %x. Thanks Nikanth The debug messaage wrongly prints the address of a local variable. Also when INFINIBAND_NES_DEBUG is not set, gcc emits an unused variable warning. Fix it. Signed-off-by: Nikanth Karthikesan --- diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index 5242515..14ffada 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -859,7 +859,6 @@ static struct nes_cm_listener *find_listener(struct nes_cm_core *cm_core, { unsigned long flags; struct nes_cm_listener *listen_node; - __be32 tmp_addr = cpu_to_be32(dst_addr); /* walk list and find cm_node associated with this session ID */ spin_lock_irqsave(&cm_core->listen_list_lock, flags); @@ -876,8 +875,8 @@ static struct nes_cm_listener *find_listener(struct nes_cm_core *cm_core, } spin_unlock_irqrestore(&cm_core->listen_list_lock, flags); - nes_debug(NES_DBG_CM, "Unable to find listener for %pI4:%x\n", - &tmp_addr, dst_port); + nes_debug(NES_DBG_CM, "Unable to find listener for %xI4:%x\n", + cpu_to_be32(dst_addr), dst_port); /* no listener */ return NULL; From vlad at lists.openfabrics.org Wed Apr 8 03:27:01 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 8 Apr 2009 03:27:01 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090408-0200 daily build status Message-ID: <20090408102701.AB0CEE28002@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From hal.rosenstock at gmail.com Wed Apr 8 03:59:41 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 8 Apr 2009 06:59:41 -0400 Subject: [ofa-general] How to recover from a bad MAD status (110) from lid 6 In-Reply-To: <49DBB10E.5030506@Sun.COM> References: <49DBB10E.5030506@Sun.COM> Message-ID: On Tue, Apr 7, 2009 at 4:01 PM, Chuck Baker wrote: > Hi, > > I encountered an error while running load tests on RHEL5.2 OFED 1.4 > connected > to SRP targets, and am wondering how to recover. > > The error's I'm seeing is an I/O failed with EIO I/O error messages, and my > load > generator failed. > > Since the failure, the srp_daemon reports > > srp_daemon -a -o -c -n -i mthca0 > 07/03/09 10:54:13 : bad MAD status (110) from lid 6 110 is ETIMEDOUT id_ext=0003ba0001005504,ioc_guid=0003ba0001005504,dgid=fe800000000000000003ba0001005506,pkey=ffff,service_id=0003ba0001005504,initiator_ext=0655000100ba0300 > id_ext=0003ba000100575c,ioc_guid=0003ba000100575c,dgid=fe800000000000000003ba000100575e,pkey=ffff,service_id=0003ba000100575c,initiator_ext=5e57000100ba0300 > > rebooting the target and then rebooting the initiator has made no > difference. Sounds like some sort of network issue. > Any ideas on how to resolve this would be appreciated. Can you smpquery between initiator and target ? If so, what about perfquery ? Have you tried rebooting your SM ? -- Hal > thanks > chuck From hal.rosenstock at gmail.com Wed Apr 8 04:04:25 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 8 Apr 2009 07:04:25 -0400 Subject: [ofa-general] ***SPAM*** ibutils building errors Message-ID: Hi Yevgeny, With the latest ibutils, I get the following errors when building: ibdm_wrap.cpp: In function `int _wrap_IBPort_width_set(void*, Tcl_Interp*, int, Tcl_Obj* const*)': ibdm_wrap.cpp:5410: invalid conversion from `const char*' to `char*' ibdm_wrap.cpp: In function `int _wrap_IBPort_width_get(void*, Tcl_Interp*, int, Tcl_Obj* const*)': ibdm_wrap.cpp:5506: invalid conversion from `const char*' to `char*' ibdm_wrap.cpp: In function `int _wrap_IBPort_speed_set(void*, Tcl_Interp*, int, Tcl_Obj* const*)': ibdm_wrap.cpp:5608: invalid conversion from `const char*' to `char*' ibdm_wrap.cpp: In function `int _wrap_IBPort_speed_get(void*, Tcl_Interp*, int, Tcl_Obj* const*)': ibdm_wrap.cpp:5704: invalid conversion from `const char*' to `char*' Thanks for looking into this. -- Hal From jsquyres at cisco.com Wed Apr 8 04:27:42 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 8 Apr 2009 06:27:42 -0500 Subject: [ofa-general] EWG/OFED meeting minutes for Apr 6, 09 In-Reply-To: <5D49E7A8952DC44FB38C38FA0D758EAD024BF0F1@mtlexch01.mtl.com> References: <5D49E7A8952DC44FB38C38FA0D758EAD024BF0F1@mtlexch01.mtl.com> Message-ID: On Apr 7, 2009, at 10:37 AM, Tziporet Koren wrote: > c. Bill raised the question how are we going to address MPI requests > that were raised in Sonoma. > > Tziporet updated that API for Memory registration notification is > being derived by Jeff S. in MPI forum. > Slight correction -- I'm polling the OpenFabrics-based MPI implementations for input on a specific proposal. Specifically: Open MPI, HP MPI, Intel MPI, Platform MPI (used to be Scali), MVAPICH. The MPI Forum is the standards body that governs MPI (analogous to how the IBTA governs IB). Once we decide *exactly* what we want (in terms of a concrete proposal, as opposed to a list of requirements that were presented at Sonoma), I'll send something here. -- Jeff Squyres Cisco Systems From brian at sun.com Wed Apr 8 07:13:43 2009 From: brian at sun.com (Brian J. Murrell) Date: Wed, 08 Apr 2009 10:13:43 -0400 Subject: [ofa-general] ofed autoconf.h In-Reply-To: <49DBC4A8.8020708@opengridcomputing.com> References: <49DBC4A8.8020708@opengridcomputing.com> Message-ID: <1239200023.1541.242.camel@pc.interlinx.bc.ca> Hi Steve, Thanx for looking into these two items for us. On Tue, 2009-04-07 at 16:24 -0500, Steve Wise wrote: > I've been charged to resolve OFED bugs 1538 and 1578 dealing with > autoconf.h. I seek input from the OFA lists. For the record of discussion, I opened these two bugs. > It was suggested that we could add a "#include_next > " to the ofed file and thus include both, but that > doesn't work because we need the backing kernel autoconf.h included > first, and then the ofed one. Isn't that just a function of where you put the #include_next in the ofed autoconf.h? If you put it at the top, the ofed autoconf.h will override anything in the backing kernel autoconf.h but if you put it at the bottom, the backing kernel's autoconf.h overrides values set in the ofed autoconf.h, no? > I think the idea originally was > that the backing kernel didn't have any of these ofed modules. That was my suspicion as well -- that this grew out of something that was acceptable to do before the OFED stack started providing more of what the backing kernel could be providing. > 1) do we think these should be resolved in ofed-1.4.1? I would really like to see them resolved in the 1.4.1 release as we have already missed the 1.4.0 release due to other external-module build problems. > Here are my proposed solutions (dunno if they break anything) > > 1538: change the ofed configure script to create a fully-populated > autoconf.h that basically is a cat of the back kernel tree autoconf.h > and the current ofed autoconf.h. That way, modules will get everything > when they include the ofed autoconf.h. What happens then if the user changes something in their kernel configuration (i.e. after having built kernel-ib{,-devel} and installing kernel-ib-devel) which is completely independent of OFED, like, say enabling the serial module? I'm definitely no module versioning expert, but I think such a change would be allowable and not invalidate the modules in kernel-ib, If anyone knows better than I, please do correct me here. But in such a case, "#include " will not reflect that kernel configuration change. This is not to say that such an operation will be common, but I'm just trying to find the solution that covers the most use-cases and still achieves our goal. Is this solution to merge the backing kernel's autoconf.h into the ofed one because of the perceived inability to use #include_next to effectively concat the files at (external module) compile time? Does my suggestion as to positioning of the #include_next in the ofed autoconf.h change your thoughts on the best solution to this problem? > 1578: I propose we don't #undef any modules that are not configured into > the ofed build. Yes. This was my proposal as well. > Thus if you don't build in NFSRDMA for ofed, the status > of the NFS CONFIG* defines will be based on the backing kernel tree > autoconf.h instead of always being turned off. This is the solution I like. The use-case this covers is that I want to use infiniband in my network (for something other than NFS) and don't want to use the NFSRDMA supplied by the OFED stack and instead want to use the kernel's own provided NFS stack. Not replacing the vendor kernel supplied NFS stack reduces our risk and maintenance efforts. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From swise at opengridcomputing.com Wed Apr 8 07:27:22 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 08 Apr 2009 09:27:22 -0500 Subject: [ofa-general] ofed autoconf.h In-Reply-To: <1239200023.1541.242.camel@pc.interlinx.bc.ca> References: <49DBC4A8.8020708@opengridcomputing.com> <1239200023.1541.242.camel@pc.interlinx.bc.ca> Message-ID: <49DCB44A.7020200@opengridcomputing.com> Brian J. Murrell wrote: > Hi Steve, > > Thanx for looking into these two items for us. > > On Tue, 2009-04-07 at 16:24 -0500, Steve Wise wrote: > >> I've been charged to resolve OFED bugs 1538 and 1578 dealing with >> autoconf.h. I seek input from the OFA lists. >> > > For the record of discussion, I opened these two bugs. > > >> It was suggested that we could add a "#include_next >> " to the ofed file and thus include both, but that >> doesn't work because we need the backing kernel autoconf.h included >> first, and then the ofed one. >> > > Isn't that just a function of where you put the #include_next in the > ofed autoconf.h? If you put it at the top, the ofed autoconf.h will > override anything in the backing kernel autoconf.h but if you put it at > the bottom, the backing kernel's autoconf.h overrides values set in the > ofed autoconf.h, no? > > Maybe. I thought include_next included it after the existing file was processed. Maybe I'm all wet though. I'll run experiments. If you're right, then my original patch to the configure script will handle 1538. >> I think the idea originally was >> that the backing kernel didn't have any of these ofed modules. >> > > That was my suspicion as well -- that this grew out of something that > was acceptable to do before the OFED stack started providing more of > what the backing kernel could be providing. > > >> 1) do we think these should be resolved in ofed-1.4.1? >> > > I would really like to see them resolved in the 1.4.1 release as we have > already missed the 1.4.0 release due to other external-module build > problems. > > You can always work around these issues, yes? >> Here are my proposed solutions (dunno if they break anything) >> >> 1538: change the ofed configure script to create a fully-populated >> autoconf.h that basically is a cat of the back kernel tree autoconf.h >> and the current ofed autoconf.h. That way, modules will get everything >> when they include the ofed autoconf.h. >> > > What happens then if the user changes something in their kernel > configuration (i.e. after having built kernel-ib{,-devel} and installing > kernel-ib-devel) which is completely independent of OFED, like, say > enabling the serial module? > You have to rebuild/reinstall ofed if you change the backing kernel. > I'm definitely no module versioning expert, but I think such a change > would be allowable and not invalidate the modules in kernel-ib, If > anyone knows better than I, please do correct me here. > > But in such a case, "#include " will not reflect that > kernel configuration change. This is not to say that such an operation > will be common, but I'm just trying to find the solution that covers the > most use-cases and still achieves our goal. > > If the include_next solution works, then we're all set... > Is this solution to merge the backing kernel's autoconf.h into the ofed > one because of the perceived inability to use #include_next to > effectively concat the files at (external module) compile time? Does my > suggestion as to positioning of the #include_next in > the ofed autoconf.h change your thoughts on the best solution to this > problem? > I'll let you know. > >> 1578: I propose we don't #undef any modules that are not configured into >> the ofed build. >> > > Yes. This was my proposal as well. > > >> Thus if you don't build in NFSRDMA for ofed, the status >> of the NFS CONFIG* defines will be based on the backing kernel tree >> autoconf.h instead of always being turned off. >> > > This is the solution I like. The use-case this covers is that I want to > use infiniband in my network (for something other than NFS) and don't > want to use the NFSRDMA supplied by the OFED stack and instead want to > use the kernel's own provided NFS stack. Not replacing the vendor > kernel supplied NFS stack reduces our risk and maintenance efforts. > > This does expose an issue, however. If an ofed release changes the kernel verbs or cm APIs, then it can break any rdma kernel modules that do not get rebuilt against the ofed headers. But this issue has always been there I guess. Steve. From hnrose at comcast.net Wed Apr 8 08:04:45 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 8 Apr 2009 11:04:45 -0400 Subject: [ofa-general] ***SPAM*** [PATCH] infiniband-diags/vendstat: Update man page and examples for PortXmit/RcvDataSL counter support Message-ID: <20090408150444.GA24876@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/infiniband-diags/man/vendstat.8 b/infiniband-diags/man/vendstat.8 index e32650a..a73dcce 100644 --- a/infiniband-diags/man/vendstat.8 +++ b/infiniband-diags/man/vendstat.8 @@ -1,17 +1,17 @@ -.TH VENDSTAT 8 "February 15, 2007" "OpenIB" "OpenIB Diagnostics" +.TH VENDSTAT 8 "April 6, 2009" "OpenIB" "OpenIB Diagnostics" .SH NAME vendstat \- query InfiniBand vendor specific functions .SH SYNOPSIS .B vendstat -[\-d(ebug)] [\-G(uid)] [\-N] [\-w] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] [\-V(ersion)] [\-h(elp)] +[\-d(ebug)] [\-G(uid)] [\-N] [\-w] [\-i] [\-c ] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] [\-V(ersion)] [\-h(elp)] .SH DESCRIPTION .PP vendstat uses vendor specific MADs to access beyond the IB spec -vendor specific functionality. Currently, there is support only for -Mellanox InfiniSwitch-III (IS3). +vendor specific functionality. Currently, there is support for +Mellanox InfiniSwitch-III (IS3) and InfiniSwitch-IV (IS4). .SH OPTIONS @@ -22,6 +22,26 @@ show IS3 general information. .TP \fB\-w\fR show IS3 port xmit wait counters. +.TP +\fB\-i\fR +show IS4 counter group info. +.TP +\fB\-c\fR +configure IS4 counter groups. + +Configure IS4 counter groups 0 and 1. +First number is for counter group 0 and second is for +counter group 1. + +Group 0 counter config values: + 0 - PortXmitDataSL0-7 + 1 - PortXmitDataSL8-15 + 2 - PortRcvDataSL0-7 + +Group 1 counter config values: + 1 - PortXmitDataSL8-15 + 2 - PortRcvDataSL0-7 + 8 - PortRcvDataSL8-15 .SH COMMON OPTIONS @@ -77,6 +97,12 @@ attempted to be fulfilled, and will fail if it is not possible. vendstat -N 6 # read IS3 general information .PP vendstat -w 6 # read IS3 port xmit wait counters +.PP +vendstat -i 6 12 # read IS4 port 12 counter group info +.PP +vendstat -c 0,1 6,12 # configure IS4 port 12 counter groups for PortXmitDataSL +.PP +vendstat -c 2,8 6,12 # configure IS4 port 12 counter groups for PortRcvDataSL .SH AUTHOR .TP diff --git a/infiniband-diags/src/vendstat.c b/infiniband-diags/src/vendstat.c index 01ff0c3..240c4cb 100644 --- a/infiniband-diags/src/vendstat.c +++ b/infiniband-diags/src/vendstat.c @@ -177,7 +177,7 @@ void config_counter_groups(ib_portid_t *portid, int port) call.attrid = IB_MLX_IS4_CONFIG_COUNTER_GROUP; call.timeout = ibd_timeout; call.mod = port; - /* set config counter groups for groups 0 and 1 */ + /* configure counter groups for groups 0 and 1 */ call.method = IB_MAD_METHOD_SET; memset(&buf, 0, sizeof(buf)); @@ -239,8 +239,8 @@ int main(int argc, char **argv) const struct ibdiag_opt opts[] = { { "N", 'N', 0, NULL, "show IS3 general information"}, { "w", 'w', 0, NULL, "show IS3 port xmit wait counters"}, - { "i", 'i', 0, NULL, "show is4 counter group info"}, - { "c", 'c', 1, "", "set is4 config counter group"}, + { "i", 'i', 0, NULL, "show IS4 counter group info"}, + { "c", 'c', 1, "", "configure IS4 counter groups"}, { 0 } }; @@ -249,7 +249,8 @@ int main(int argc, char **argv) "-N 6\t\t# read IS3 general information", "-w 6\t\t# read IS3 port xmit wait counters", "-i 6 12\t# read IS4 port 12 counter group info", - "-c 0,1 6 12\t# set IS4 port 12 config counter group for XmitDataSL", + "-c 0,1 6 12\t# configure IS4 port 12 counter groups for PortXmitDataSL", + "-c 2,8 6 12\t# configure IS4 port 12 counter groups for PortRcvDataSL", NULL }; From brian at sun.com Wed Apr 8 08:52:02 2009 From: brian at sun.com (Brian J. Murrell) Date: Wed, 08 Apr 2009 11:52:02 -0400 Subject: [ofa-general] ofed autoconf.h In-Reply-To: <49DCB44A.7020200@opengridcomputing.com> References: <49DBC4A8.8020708@opengridcomputing.com> <1239200023.1541.242.camel@pc.interlinx.bc.ca> <49DCB44A.7020200@opengridcomputing.com> Message-ID: <1239205922.1541.296.camel@pc.interlinx.bc.ca> On Wed, 2009-04-08 at 09:27 -0500, Steve Wise wrote: > > Maybe. I thought include_next included it after the existing file was > processed. Hrm. You could be right about that. My interpretation was always that it included it immediately, as if the contents of the #include_next files were actually in the caller's file right where the #include_next is. > Maybe I'm all wet though. Or maybe it's me who is all wet. :-) > I'll run experiments. I just did. It seems to work as I thought. You do get macro redefinition warnings though for something defined in both files: bar/a.h:1:1: warning: "FOO" redefined which will be an error if you build with -Werror. :-( If including the kernel's autoconf.h *before* doing all of the OFED macros is the right solution (which I think it is) the warnings can be fixed by doing: #include_next #undef FOO #define FOO 1 But relative to bug 1578, I'd only want to see macros which are to be set to something "#undef"ed first and not have every macro "#undef"ed wholesale. > If you're > right, then my original patch to the configure script will handle 1538. Yes, indeed. > You can always work around these issues, yes? Yeah. It's ugly though. It essentially winds up being your "cat" solution (with some "#undef" removals to address bug 1578), to create a third autoconf.h in a temporary directory (which is basically a union of the two files) and put that temporary directory in the include path before the OFED include and backing kernel include paths. This is a hack that will need to be undone in a future Lustre release when this issue is resolved. > You have to rebuild/reinstall ofed if you change the backing kernel. Hrm. Even if I change something completely unrelated to OFED or networking at all, like say just changing CONFIG_SERIAL_8250 from m to y? > If the include_next solution works, then we're all set... Indeed. I think so too. > This does expose an issue, however. If an ofed release changes the > kernel verbs or cm APIs, then it can break any rdma kernel modules that > do not get rebuilt against the ofed headers. But this issue has always > been there I guess. Yeah. I was considering that as well WRT to bug 1578 and not wholesale "#undef"ing all macros leading to a mixture of kernel provided and OFED provided RDMA options. I wonder if this is something that is appropriate to do at (OFED0 configure time, and simply bail if a mismatch is found with a "you can't do that. either change your ofed selections or disable FOO in your kernel configuration" type error message. I don't think this particular problem is something we need to address for 1.4.1 though. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From Charles.Baker at Sun.COM Wed Apr 8 09:35:24 2009 From: Charles.Baker at Sun.COM (Chuck Baker) Date: Wed, 08 Apr 2009 10:35:24 -0600 Subject: [ofa-general] How to recover from a bad MAD status (110) from lid 6 In-Reply-To: References: <49DBB10E.5030506@Sun.COM> Message-ID: <49DCD24C.3080402@Sun.COM> Thanks for the response. It turned out to be that the target hung, and a case of new user syndrome. All are back up and operational. thanks chuck On 04/08/09 04:59, Hal Rosenstock wrote: > On Tue, Apr 7, 2009 at 4:01 PM, Chuck Baker wrote: > >> Hi, >> >> I encountered an error while running load tests on RHEL5.2 OFED 1.4 >> connected >> to SRP targets, and am wondering how to recover. >> >> The error's I'm seeing is an I/O failed with EIO I/O error messages, and my >> load >> generator failed. >> >> Since the failure, the srp_daemon reports >> >> srp_daemon -a -o -c -n -i mthca0 >> 07/03/09 10:54:13 : bad MAD status (110) from lid 6 >> > > 110 is ETIMEDOUT > id_ext=0003ba0001005504,ioc_guid=0003ba0001005504,dgid=fe800000000000000003ba0001005506,pkey=ffff,service_id=0003ba0001005504,initiator_ext=0655000100ba0300 > >> id_ext=0003ba000100575c,ioc_guid=0003ba000100575c,dgid=fe800000000000000003ba000100575e,pkey=ffff,service_id=0003ba000100575c,initiator_ext=5e57000100ba0300 >> >> rebooting the target and then rebooting the initiator has made no >> difference. >> > > Sounds like some sort of network issue. > > >> Any ideas on how to resolve this would be appreciated. >> > > Can you smpquery between initiator and target ? If so, what about perfquery ? > > Have you tried rebooting your SM ? > > -- Hal > > >> thanks >> chuck >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rdreier at cisco.com Wed Apr 8 10:21:51 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Apr 2009 10:21:51 -0700 Subject: [ofa-general] Re: [PATCH] Fix wrong dbg output and gcc warning when INFINIBAND_NES_DEBUG is not set In-Reply-To: <200904081136.08393.knikanth@suse.de> (Nikanth Karthikesan's message of "Wed, 8 Apr 2009 11:36:07 +0530") References: <200904071203.24032.knikanth@suse.de> <200904081136.08393.knikanth@suse.de> Message-ID: > + nes_debug(NES_DBG_CM, "Unable to find listener for %xI4:%x\n", > + cpu_to_be32(dst_addr), dst_port); Have you tested this? It seems like it will print the IP address as a (possibly byte-reversed) hex value followed by the literal string "I4" rather than printing it as a formatted IP address. The problem you seem to be trying to solve is an unused variable warning when nes debugging is not enabled, but I don't think you can do it by removing the tmp_addr variable. The most robust solution would probably to change the definition of nes_debug() so it appears to gcc to use all its parameters even when debugging is disabled. You could look at drivers/net/mlx4/mlx4.h for an example of one way to do that. - R. From swise at opengridcomputing.com Wed Apr 8 10:39:08 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 08 Apr 2009 12:39:08 -0500 Subject: [ofa-general] ofed autoconf.h In-Reply-To: <1239205922.1541.296.camel@pc.interlinx.bc.ca> References: <49DBC4A8.8020708@opengridcomputing.com> <1239200023.1541.242.camel@pc.interlinx.bc.ca> <49DCB44A.7020200@opengridcomputing.com> <1239205922.1541.296.camel@pc.interlinx.bc.ca> Message-ID: <49DCE13C.6040209@opengridcomputing.com> >> I'll run experiments. >> > > I just did. It seems to work as I thought. You do get macro > redefinition warnings though for something defined in both files: > > bar/a.h:1:1: warning: "FOO" redefined > > which will be an error if you build with -Werror. :-( > > If including the kernel's autoconf.h *before* doing all of the OFED > macros is the right solution (which I think it is) the warnings can be > fixed by doing: > > #include_next > > Ok we'll do this for 1538 and push it into ofed-1.4.1. > But relative to bug 1578, I'd only want to see macros which are to be > set to something "#undef"ed first and not have every macro "#undef"ed > wholesale. > > Yup. > >> You have to rebuild/reinstall ofed if you change the backing kernel. >> > > Hrm. Even if I change something completely unrelated to OFED or > networking at all, like say just changing CONFIG_SERIAL_8250 from m to > y? > > Probably not. But as a rule, the build of ofed is against a specific kernel and configuration. Change the backing kernel config can cause problems unless you rebuild. Especially if module versioning is on. > >> This does expose an issue, however. If an ofed release changes the >> kernel verbs or cm APIs, then it can break any rdma kernel modules that >> do not get rebuilt against the ofed headers. But this issue has always >> been there I guess. >> > > Yeah. I was considering that as well WRT to bug 1578 and not wholesale > "#undef"ing all macros leading to a mixture of kernel provided and OFED > provided RDMA options. > > I wonder if this is something that is appropriate to do at (OFED0 > configure time, and simply bail if a mismatch is found with a "you can't > do that. either change your ofed selections or disable FOO in your > kernel configuration" type error message. > Gimme an example of what you mean? > I don't think this particular problem is something we need to address > for 1.4.1 though. > > So 1578 can be deferred? Steve. From brian at sun.com Wed Apr 8 10:57:38 2009 From: brian at sun.com (Brian J. Murrell) Date: Wed, 08 Apr 2009 13:57:38 -0400 Subject: [ofa-general] ofed autoconf.h In-Reply-To: <49DCE13C.6040209@opengridcomputing.com> References: <49DBC4A8.8020708@opengridcomputing.com> <1239200023.1541.242.camel@pc.interlinx.bc.ca> <49DCB44A.7020200@opengridcomputing.com> <1239205922.1541.296.camel@pc.interlinx.bc.ca> <49DCE13C.6040209@opengridcomputing.com> Message-ID: <1239213458.1541.328.camel@pc.interlinx.bc.ca> On Wed, 2009-04-08 at 12:39 -0500, Steve Wise wrote: > > Ok we'll do this for 1538 and push it into ofed-1.4.1. Great. I will test as soon as it's pushed in to make sure it works. > > Yeah. I was considering that as well WRT to bug 1578 and not wholesale > > "#undef"ing all macros leading to a mixture of kernel provided and OFED > > provided RDMA options. > > > > I wonder if this is something that is appropriate to do at (OFED0 > > configure time, and simply bail if a mismatch is found with a "you can't > > do that. either change your ofed selections or disable FOO in your > > kernel configuration" type error message. > > > > Gimme an example of what you mean? I don't know enough about the OFED stack to give you a specific example, but, if you know of an API change that is happening in a given release, you write an autoconf macro to test which API is available and if the wrong one is, bail out of configure with an (informative) error. > > I don't think this particular problem is something we need to address > > for 1.4.1 though. > > > > > > So 1578 can be deferred? No, I mean the problem of detecting API changes in configure. I'd still like to see 1578 addressed for 1.4.1 as the fix in 1538 is pretty useless without it as it will still mean needing to create a third, temporary version of the autoconf.h file. Cheers, b. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From jgunthorpe at obsidianresearch.com Wed Apr 8 10:59:34 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 8 Apr 2009 11:59:34 -0600 Subject: [ofa-general] ofed autoconf.h In-Reply-To: <1239205922.1541.296.camel@pc.interlinx.bc.ca> References: <49DBC4A8.8020708@opengridcomputing.com> <1239200023.1541.242.camel@pc.interlinx.bc.ca> <49DCB44A.7020200@opengridcomputing.com> <1239205922.1541.296.camel@pc.interlinx.bc.ca> Message-ID: <20090408175934.GB9167@obsidianresearch.com> On Wed, Apr 08, 2009 at 11:52:02AM -0400, Brian J. Murrell wrote: > On Wed, 2009-04-08 at 09:27 -0500, Steve Wise wrote: > > > > Maybe. I thought include_next included it after the existing file was > > processed. > > Hrm. You could be right about that. My interpretation was always that > it included it immediately, as if the contents of the #include_next > files were actually in the caller's file right where the #include_next > is. #include_next behaves identically to #include except that the include search path order is altered. Brian is right, the included files contents are substituted immediately. Jason From rdreier at cisco.com Wed Apr 8 13:42:55 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Apr 2009 13:42:55 -0700 Subject: [ofa-general] Re: [PATCH v2] rdma_cm: create cm id even when port is down In-Reply-To: (Yossi Etigin's message of "Thu, 2 Apr 2009 23:37:14 +0300 (IDT)") References: <49D0E1A4.3070804@Voltaire.COM> <49D10673.3000006@Voltaire.COM> Message-ID: thanks, applied From chien.tin.tung at intel.com Wed Apr 8 13:47:55 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Wed, 8 Apr 2009 15:47:55 -0500 Subject: [ofa-general] [PATCH 1/3] RDMA/nes: Fix SFP+ PHY initialization Message-ID: <20090408204755.GA6800@ctung-MOBL> SFP+ PHY initialization has very long delays, incorrect settings for direct attach cable, and inconsistent link detection. Adjust delays to the minimum required by the PHY. Worst case is less than 4 seconds. Add new register settings for direct attach cable. Change link detection logic to use two new registers for more consistent link state detection. Reorganize code to shorten line length. Signed-off-by: Chien Tung --- drivers/infiniband/hw/nes/nes_hw.c | 289 +++++++++++++++--------------------- 1 files changed, 122 insertions(+), 167 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_hw.c b/drivers/infiniband/hw/nes/nes_hw.c index 52e7340..c9af6a0 100644 --- a/drivers/infiniband/hw/nes/nes_hw.c +++ b/drivers/infiniband/hw/nes/nes_hw.c @@ -757,6 +757,10 @@ static int nes_init_serdes(struct nes_device *nesdev, u8 hw_rev, u8 port_count, ((port_count > 2) && (nesadapter->phy_type[0] == NES_PHY_TYPE_PUMA_1G))) { /* init serdes 1 */ + if (nesadapter->phy_type[0] == NES_PHY_TYPE_ARGUS) { + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_EMP0, 0x00000000); + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_EMP1, 0x00000000); + } nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_CDR_CONTROL1, 0x000000FF); if (nesadapter->phy_type[0] == NES_PHY_TYPE_PUMA_1G) { serdes_common_control = nes_read_indexed(nesdev, @@ -1259,203 +1263,155 @@ int nes_init_phy(struct nes_device *nesdev) { struct nes_adapter *nesadapter = nesdev->nesadapter; u32 counter = 0; - u32 sds_common_control0; + u32 sds; u32 mac_index = nesdev->mac_index; u32 tx_config = 0; u16 phy_data; u32 temp_phy_data = 0; u32 temp_phy_data2 = 0; - u32 i = 0; + u8 phy_type = nesadapter->phy_type[mac_index]; + u8 phy_index = nesadapter->phy_index[mac_index]; if ((nesadapter->OneG_Mode) && - (nesadapter->phy_type[mac_index] != NES_PHY_TYPE_PUMA_1G)) { + (phy_type != NES_PHY_TYPE_PUMA_1G)) { nes_debug(NES_DBG_PHY, "1G PHY, mac_index = %d.\n", mac_index); - if (nesadapter->phy_type[mac_index] == NES_PHY_TYPE_1G) { - printk(PFX "%s: Programming mdc config for 1G\n", __func__); + if (phy_type == NES_PHY_TYPE_1G) { tx_config = nes_read_indexed(nesdev, NES_IDX_MAC_TX_CONFIG); tx_config &= 0xFFFFFFE3; tx_config |= 0x04; nes_write_indexed(nesdev, NES_IDX_MAC_TX_CONFIG, tx_config); } - nes_read_1G_phy_reg(nesdev, 1, nesadapter->phy_index[mac_index], &phy_data); - nes_debug(NES_DBG_PHY, "Phy data from register 1 phy address %u = 0x%X.\n", - nesadapter->phy_index[mac_index], phy_data); - nes_write_1G_phy_reg(nesdev, 23, nesadapter->phy_index[mac_index], 0xb000); + nes_read_1G_phy_reg(nesdev, 1, phy_index, &phy_data); + nes_write_1G_phy_reg(nesdev, 23, phy_index, 0xb000); /* Reset the PHY */ - nes_write_1G_phy_reg(nesdev, 0, nesadapter->phy_index[mac_index], 0x8000); + nes_write_1G_phy_reg(nesdev, 0, phy_index, 0x8000); udelay(100); counter = 0; do { - nes_read_1G_phy_reg(nesdev, 0, nesadapter->phy_index[mac_index], &phy_data); - nes_debug(NES_DBG_PHY, "Phy data from register 0 = 0x%X.\n", phy_data); - if (counter++ > 100) break; + nes_read_1G_phy_reg(nesdev, 0, phy_index, &phy_data); + if (counter++ > 100) + break; } while (phy_data & 0x8000); /* Setting no phy loopback */ phy_data &= 0xbfff; phy_data |= 0x1140; - nes_write_1G_phy_reg(nesdev, 0, nesadapter->phy_index[mac_index], phy_data); - nes_read_1G_phy_reg(nesdev, 0, nesadapter->phy_index[mac_index], &phy_data); - nes_debug(NES_DBG_PHY, "Phy data from register 0 = 0x%X.\n", phy_data); - - nes_read_1G_phy_reg(nesdev, 0x17, nesadapter->phy_index[mac_index], &phy_data); - nes_debug(NES_DBG_PHY, "Phy data from register 0x17 = 0x%X.\n", phy_data); - - nes_read_1G_phy_reg(nesdev, 0x1e, nesadapter->phy_index[mac_index], &phy_data); - nes_debug(NES_DBG_PHY, "Phy data from register 0x1e = 0x%X.\n", phy_data); + nes_write_1G_phy_reg(nesdev, 0, phy_index, phy_data); + nes_read_1G_phy_reg(nesdev, 0, phy_index, &phy_data); + nes_read_1G_phy_reg(nesdev, 0x17, phy_index, &phy_data); + nes_read_1G_phy_reg(nesdev, 0x1e, phy_index, &phy_data); /* Setting the interrupt mask */ - nes_read_1G_phy_reg(nesdev, 0x19, nesadapter->phy_index[mac_index], &phy_data); - nes_debug(NES_DBG_PHY, "Phy data from register 0x19 = 0x%X.\n", phy_data); - nes_write_1G_phy_reg(nesdev, 0x19, nesadapter->phy_index[mac_index], 0xffee); - - nes_read_1G_phy_reg(nesdev, 0x19, nesadapter->phy_index[mac_index], &phy_data); - nes_debug(NES_DBG_PHY, "Phy data from register 0x19 = 0x%X.\n", phy_data); + nes_read_1G_phy_reg(nesdev, 0x19, phy_index, &phy_data); + nes_write_1G_phy_reg(nesdev, 0x19, phy_index, 0xffee); + nes_read_1G_phy_reg(nesdev, 0x19, phy_index, &phy_data); /* turning on flow control */ - nes_read_1G_phy_reg(nesdev, 4, nesadapter->phy_index[mac_index], &phy_data); - nes_debug(NES_DBG_PHY, "Phy data from register 0x4 = 0x%X.\n", phy_data); - nes_write_1G_phy_reg(nesdev, 4, nesadapter->phy_index[mac_index], - (phy_data & ~(0x03E0)) | 0xc00); - /* nes_write_1G_phy_reg(nesdev, 4, nesadapter->phy_index[mac_index], - phy_data | 0xc00); */ - nes_read_1G_phy_reg(nesdev, 4, nesadapter->phy_index[mac_index], &phy_data); - nes_debug(NES_DBG_PHY, "Phy data from register 0x4 = 0x%X.\n", phy_data); - - nes_read_1G_phy_reg(nesdev, 9, nesadapter->phy_index[mac_index], &phy_data); - nes_debug(NES_DBG_PHY, "Phy data from register 0x9 = 0x%X.\n", phy_data); - /* Clear Half duplex */ - nes_write_1G_phy_reg(nesdev, 9, nesadapter->phy_index[mac_index], - phy_data & ~(0x0100)); - nes_read_1G_phy_reg(nesdev, 9, nesadapter->phy_index[mac_index], &phy_data); - nes_debug(NES_DBG_PHY, "Phy data from register 0x9 = 0x%X.\n", phy_data); + nes_read_1G_phy_reg(nesdev, 4, phy_index, &phy_data); + nes_write_1G_phy_reg(nesdev, 4, phy_index, (phy_data & ~(0x03E0)) | 0xc00); + nes_read_1G_phy_reg(nesdev, 4, phy_index, &phy_data); - nes_read_1G_phy_reg(nesdev, 0, nesadapter->phy_index[mac_index], &phy_data); - nes_write_1G_phy_reg(nesdev, 0, nesadapter->phy_index[mac_index], phy_data | 0x0300); - } else { - if ((nesadapter->phy_type[mac_index] == NES_PHY_TYPE_IRIS) || - (nesadapter->phy_type[mac_index] == NES_PHY_TYPE_ARGUS)) { - /* setup 10G MDIO operation */ - tx_config = nes_read_indexed(nesdev, NES_IDX_MAC_TX_CONFIG); - tx_config &= 0xFFFFFFE3; - tx_config |= 0x15; - nes_write_indexed(nesdev, NES_IDX_MAC_TX_CONFIG, tx_config); - } - if ((nesadapter->phy_type[mac_index] == NES_PHY_TYPE_ARGUS)) { - nes_read_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 0x3, 0xd7ee); + /* Clear Half duplex */ + nes_read_1G_phy_reg(nesdev, 9, phy_index, &phy_data); + nes_write_1G_phy_reg(nesdev, 9, phy_index, phy_data & ~(0x0100)); + nes_read_1G_phy_reg(nesdev, 9, phy_index, &phy_data); - temp_phy_data = (u16)nes_read_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL); - mdelay(10); - nes_read_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 0x3, 0xd7ee); - temp_phy_data2 = (u16)nes_read_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL); + nes_read_1G_phy_reg(nesdev, 0, phy_index, &phy_data); + nes_write_1G_phy_reg(nesdev, 0, phy_index, phy_data | 0x0300); - /* - * if firmware is already running (like from a - * driver un-load/load, don't do anything. - */ - if (temp_phy_data == temp_phy_data2) { - /* configure QT2505 AMCC PHY */ - nes_write_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 0x1, 0x0000, 0x8000); - nes_write_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 0x1, 0xc300, 0x0000); - nes_write_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 0x1, 0xc302, 0x0044); - nes_write_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 0x1, 0xc318, 0x0052); - nes_write_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 0x1, 0xc319, 0x0008); - nes_write_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 0x1, 0xc31a, 0x0098); - nes_write_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 0x3, 0x0026, 0x0E00); - nes_write_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 0x3, 0x0027, 0x0001); - nes_write_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 0x3, 0x0028, 0xA528); + return 0; + } - /* - * remove micro from reset; chip boots from ROM, - * uploads EEPROM f/w image, uC executes f/w - */ - nes_write_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 0x1, 0xc300, 0x0002); + if ((phy_type == NES_PHY_TYPE_IRIS) || + (phy_type == NES_PHY_TYPE_ARGUS)) { + /* setup 10G MDIO operation */ + tx_config = nes_read_indexed(nesdev, NES_IDX_MAC_TX_CONFIG); + tx_config &= 0xFFFFFFE3; + tx_config |= 0x15; + nes_write_indexed(nesdev, NES_IDX_MAC_TX_CONFIG, tx_config); + } + if ((phy_type == NES_PHY_TYPE_ARGUS)) { + /* Check firmware heartbeat */ + nes_read_10G_phy_reg(nesdev, phy_index, 0x3, 0xd7ee); + temp_phy_data = (u16)nes_read_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL); + udelay(1500); + nes_read_10G_phy_reg(nesdev, phy_index, 0x3, 0xd7ee); + temp_phy_data2 = (u16)nes_read_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL); - /* - * wait for heart beat to start to - * know loading is done - */ - counter = 0; - do { - nes_read_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 0x3, 0xd7ee); - temp_phy_data = (u16)nes_read_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL); - if (counter++ > 1000) { - nes_debug(NES_DBG_PHY, "AMCC PHY- breaking from heartbeat check \n"); - break; - } - mdelay(100); - nes_read_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 0x3, 0xd7ee); - temp_phy_data2 = (u16)nes_read_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL); - } while ((temp_phy_data2 == temp_phy_data)); + if (temp_phy_data != temp_phy_data2) + return 0; - /* - * wait for tracking to start to know - * f/w is good to go - */ - counter = 0; - do { - nes_read_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 0x3, 0xd7fd); - temp_phy_data = (u16)nes_read_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL); - if (counter++ > 1000) { - nes_debug(NES_DBG_PHY, "AMCC PHY- breaking from status check \n"); - break; - } - mdelay(1000); - /* - * nes_debug(NES_DBG_PHY, "AMCC PHY- phy_status not ready yet = 0x%02X\n", - * temp_phy_data); - */ - } while (((temp_phy_data & 0xff) != 0x50) && ((temp_phy_data & 0xff) != 0x70)); - - /* set LOS Control invert RXLOSB_I_PADINV */ - nes_write_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 0x1, 0xd003, 0x0000); - /* set LOS Control to mask of RXLOSB_I */ - nes_write_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 0x1, 0xc314, 0x0042); - /* set LED1 to input mode (LED1 and LED2 share same LED) */ - nes_write_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 0x1, 0xd006, 0x0007); - /* set LED2 to RX link_status and activity */ - nes_write_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 0x1, 0xd007, 0x000A); - /* set LED3 to RX link_status */ - nes_write_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 0x1, 0xd008, 0x0009); + /* no heartbeat, configure the PHY */ + nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0x0000, 0x8000); + nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc300, 0x0000); + nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc302, 0x000C); + nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc316, 0x000A); + nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc318, 0x0052); + nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc319, 0x0008); + nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc31a, 0x0098); + nes_write_10G_phy_reg(nesdev, phy_index, 0x3, 0x0026, 0x0E00); + nes_write_10G_phy_reg(nesdev, phy_index, 0x3, 0x0027, 0x0001); - /* - * reset the res-calibration on t2 - * serdes; ensures it is stable after - * the amcc phy is stable - */ + /* setup LEDs */ + nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xd006, 0x0007); + nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xd007, 0x000A); + nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xd008, 0x0009); - sds_common_control0 = nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL0); - sds_common_control0 |= 0x1; - nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL0, sds_common_control0); + nes_write_10G_phy_reg(nesdev, phy_index, 0x3, 0x0028, 0xA528); - /* release the res-calibration reset */ - sds_common_control0 &= 0xfffffffe; - nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL0, sds_common_control0); + /* Bring PHY out of reset */ + nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc300, 0x0002); - i = 0; - while (((nes_read32(nesdev->regs + NES_SOFTWARE_RESET) & 0x00000040) != 0x00000040) - && (i++ < 5000)) { - /* mdelay(1); */ - } + /* Check for heartbeat */ + counter = 0; + mdelay(690); + nes_read_10G_phy_reg(nesdev, phy_index, 0x3, 0xd7ee); + temp_phy_data = (u16)nes_read_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL); + do { + if (counter++ > 150) { + nes_debug(NES_DBG_PHY, "No PHY heartbeat\n"); + break; + } + mdelay(1); + nes_read_10G_phy_reg(nesdev, phy_index, 0x3, 0xd7ee); + temp_phy_data2 = (u16)nes_read_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL); + } while ((temp_phy_data2 == temp_phy_data)); - /* - * wait for link train done before moving on, - * or will get an interupt storm - */ - counter = 0; - do { - temp_phy_data = nes_read_indexed(nesdev, NES_IDX_PHY_PCS_CONTROL_STATUS0 + - (0x200 * (nesdev->mac_index & 1))); - if (counter++ > 1000) { - nes_debug(NES_DBG_PHY, "AMCC PHY- breaking from link train wait \n"); - break; - } - mdelay(1); - } while (((temp_phy_data & 0x0f1f0000) != 0x0f0f0000)); + /* wait for tracking */ + counter = 0; + do { + nes_read_10G_phy_reg(nesdev, phy_index, 0x3, 0xd7fd); + temp_phy_data = (u16)nes_read_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL); + if (counter++ > 300) { + nes_debug(NES_DBG_PHY, "PHY did not track\n"); + break; } - } + mdelay(10); + } while (((temp_phy_data & 0xff) != 0x50) && ((temp_phy_data & 0xff) != 0x70)); + + /* setup signal integrity */ + nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xd003, 0x0000); + nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xF00D, 0x00FE); + nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xF00E, 0x0032); + nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xF00F, 0x0002); + nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc314, 0x0063); + + /* reset serdes */ + sds= nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL0 + + mac_index * 0x200); + sds|= 0x1; + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL0 + + mac_index * 0x200, sds); + sds&= 0xfffffffe; + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL0 + + mac_index * 0x200, sds); + + counter = 0; + while (((nes_read32(nesdev->regs + NES_SOFTWARE_RESET) & 0x00000040) != 0x00000040) + && (counter++ < 5000)) + ; } return 0; } @@ -2483,19 +2439,18 @@ static void nes_process_mac_intr(struct nes_device *nesdev, u32 mac_number) nes_read_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 1, 0x9004); nes_read_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 1, 0x9005); /* check link status */ - nes_read_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 1, 1); + nes_read_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 1, 0x9003); temp_phy_data = (u16)nes_read_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL); - u32temp = 100; - do { - nes_read_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 1, 1); - phy_data = (u16)nes_read_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL); - if ((phy_data == temp_phy_data) || (!(--u32temp))) - break; - temp_phy_data = phy_data; - } while (1); + nes_read_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 3, 0x0021); + nes_read_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL); + nes_read_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 3, 0x0021); + phy_data = (u16)nes_read_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL); + + phy_data = (!temp_phy_data && (phy_data == 0x8000)) ? 0x4 : 0x0; + nes_debug(NES_DBG_PHY, "%s: Phy data = 0x%04X, link was %s.\n", - __func__, phy_data, nesadapter->mac_link_down ? "DOWN" : "UP"); + __func__, phy_data, nesadapter->mac_link_down[mac_index] ? "DOWN" : "UP"); break; case NES_PHY_TYPE_PUMA_1G: -- 1.5.3.3 From chien.tin.tung at intel.com Wed Apr 8 13:47:57 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Wed, 8 Apr 2009 15:47:57 -0500 Subject: [ofa-general] [PATCH 2/3] RDMA/nes: Add wide_ppm_offset parm for switch compatibility Message-ID: <20090408204757.GA1856@ctung-MOBL> We have observed unstable link with a new BNT switch. Add wide_ppm_offset parameter to allow the user to control the clock ppm offset on the CX4 interface for better compatibility. Default is 100ppm, setting it to 1 will increase it to 300ppm. Change default SerDes1 reference clock to external source. Signed-off-by: Chien Tung --- drivers/infiniband/hw/nes/nes_hw.c | 97 ++++++++++++++++++++++++----------- drivers/infiniband/hw/nes/nes_hw.h | 1 + 2 files changed, 67 insertions(+), 31 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_hw.c b/drivers/infiniband/hw/nes/nes_hw.c index c9af6a0..76cc1f1 100644 --- a/drivers/infiniband/hw/nes/nes_hw.c +++ b/drivers/infiniband/hw/nes/nes_hw.c @@ -46,6 +46,10 @@ static unsigned int nes_lro_max_aggr = NES_LRO_MAX_AGGR; module_param(nes_lro_max_aggr, uint, 0444); MODULE_PARM_DESC(nes_lro_max_aggr, "NIC LRO max packet aggregation"); +static int wide_ppm_offset; +module_param(wide_ppm_offset, int, 0644); +MODULE_PARM_DESC(wide_ppm_offset, "Increase CX4 interface clock ppm offset, 0=100ppm (default), 1=300ppm"); + static u32 crit_err_count; u32 int_mod_timer_init; u32 int_mod_cq_depth_256; @@ -546,8 +550,11 @@ struct nes_adapter *nes_init_adapter(struct nes_device *nesdev, u8 hw_rev) { msleep(1); } if (int_cnt > 1) { + u32 sds; spin_lock_irqsave(&nesadapter->phy_lock, flags); - nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL1, 0x0000F088); + sds = nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL1); + sds |= 0x00000040; + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL1, sds); mh_detected++; reset_value = nes_read32(nesdev->regs+NES_SOFTWARE_RESET); reset_value |= 0x0000003d; @@ -736,43 +743,48 @@ static int nes_init_serdes(struct nes_device *nesdev, u8 hw_rev, u8 port_count, { int i; u32 u32temp; - u32 serdes_common_control; + u32 sds; if (hw_rev != NE020_REV) { /* init serdes 0 */ + if (wide_ppm_offset && (nesadapter->phy_type[0] == NES_PHY_TYPE_CX4)) + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_CDR_CONTROL0, 0x000FFFAA); + else + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_CDR_CONTROL0, 0x000000FF); - nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_CDR_CONTROL0, 0x000000FF); if (nesadapter->phy_type[0] == NES_PHY_TYPE_PUMA_1G) { - serdes_common_control = nes_read_indexed(nesdev, - NES_IDX_ETH_SERDES_COMMON_CONTROL0); - serdes_common_control |= 0x000000100; - nes_write_indexed(nesdev, - NES_IDX_ETH_SERDES_COMMON_CONTROL0, - serdes_common_control); - } else if (!OneG_Mode) { - nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_HIGHZ_LANE_MODE0, 0x11110000); + sds = nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL0); + sds |= 0x00000100; + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL0, sds); } - if (((port_count > 1) && - (nesadapter->phy_type[0] != NES_PHY_TYPE_PUMA_1G)) || - ((port_count > 2) && - (nesadapter->phy_type[0] == NES_PHY_TYPE_PUMA_1G))) { - /* init serdes 1 */ - if (nesadapter->phy_type[0] == NES_PHY_TYPE_ARGUS) { - nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_EMP0, 0x00000000); - nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_EMP1, 0x00000000); - } - nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_CDR_CONTROL1, 0x000000FF); - if (nesadapter->phy_type[0] == NES_PHY_TYPE_PUMA_1G) { - serdes_common_control = nes_read_indexed(nesdev, - NES_IDX_ETH_SERDES_COMMON_CONTROL1); - serdes_common_control |= 0x000000100; - nes_write_indexed(nesdev, - NES_IDX_ETH_SERDES_COMMON_CONTROL1, - serdes_common_control); - } else if (!OneG_Mode) { - nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_HIGHZ_LANE_MODE1, 0x11110000); - } + if (!OneG_Mode) + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_HIGHZ_LANE_MODE0, 0x11110000); + + if (port_count < 2) + return 0; + + /* init serdes 1 */ + switch (nesadapter->phy_type[1]) { + case NES_PHY_TYPE_ARGUS: + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_EMP0, 0x00000000); + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_EMP1, 0x00000000); + break; + case NES_PHY_TYPE_CX4: + sds = nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL1); + sds &= 0xFFFFFFBF; + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL1, sds); + if (wide_ppm_offset) + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_CDR_CONTROL1, 0x000FFFAA); + else + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_CDR_CONTROL1, 0x000000FF); + break; + case NES_PHY_TYPE_PUMA_1G: + sds = nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL1); + sds |= 0x000000100; + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL1, sds); } + if (!OneG_Mode) + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_HIGHZ_LANE_MODE1, 0x11110000); } else { /* init serdes 0 */ nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL0, 0x00000008); @@ -2315,6 +2327,7 @@ static void nes_process_mac_intr(struct nes_device *nesdev, u32 mac_number) u16 temp_phy_data; u32 pcs_val = 0x0f0f0000; u32 pcs_mask = 0x0f1f0000; + u32 cdr_ctrl; spin_lock_irqsave(&nesadapter->phy_lock, flags); if (nesadapter->mac_sw_state[mac_number] != NES_MAC_SW_IDLE) { @@ -2466,6 +2479,17 @@ static void nes_process_mac_intr(struct nes_device *nesdev, u32 mac_number) } if (phy_data & 0x0004) { + if (wide_ppm_offset && + (nesadapter->phy_type[mac_index] == NES_PHY_TYPE_CX4) && + (nesadapter->hw_rev != NE020_REV)) { + cdr_ctrl = nes_read_indexed(nesdev, + NES_IDX_ETH_SERDES_CDR_CONTROL0 + + mac_index * 0x200); + nes_write_indexed(nesdev, + NES_IDX_ETH_SERDES_CDR_CONTROL0 + + mac_index * 0x200, + cdr_ctrl | 0x000F0000); + } nesadapter->mac_link_down[mac_index] = 0; list_for_each_entry(nesvnic, &nesadapter->nesvnic_list[mac_index], list) { nes_debug(NES_DBG_PHY, "The Link is UP!!. linkup was %d\n", @@ -2480,6 +2504,17 @@ static void nes_process_mac_intr(struct nes_device *nesdev, u32 mac_number) } } } else { + if (wide_ppm_offset && + (nesadapter->phy_type[mac_index] == NES_PHY_TYPE_CX4) && + (nesadapter->hw_rev != NE020_REV)) { + cdr_ctrl = nes_read_indexed(nesdev, + NES_IDX_ETH_SERDES_CDR_CONTROL0 + + mac_index * 0x200); + nes_write_indexed(nesdev, + NES_IDX_ETH_SERDES_CDR_CONTROL0 + + mac_index * 0x200, + cdr_ctrl & 0xFFF0FFFF); + } nesadapter->mac_link_down[mac_index] = 1; list_for_each_entry(nesvnic, &nesadapter->nesvnic_list[mac_index], list) { nes_debug(NES_DBG_PHY, "The Link is Down!!. linkup was %d\n", diff --git a/drivers/infiniband/hw/nes/nes_hw.h b/drivers/infiniband/hw/nes/nes_hw.h index f41a871..13bc26a 100644 --- a/drivers/infiniband/hw/nes/nes_hw.h +++ b/drivers/infiniband/hw/nes/nes_hw.h @@ -35,6 +35,7 @@ #include +#define NES_PHY_TYPE_CX4 1 #define NES_PHY_TYPE_1G 2 #define NES_PHY_TYPE_IRIS 3 #define NES_PHY_TYPE_ARGUS 4 -- 1.5.3.3 From chien.tin.tung at intel.com Wed Apr 8 13:47:59 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Wed, 8 Apr 2009 15:47:59 -0500 Subject: [ofa-general] [PATCH 3/3] RDMA/nes: Add support for new SFP+ PHY Message-ID: <20090408204759.GA4772@ctung-MOBL> Add new register settings for new SFP+ PHY/firmware. Add new phy to to nes_netdev_get/set_settings. Signed-off-by: Chien Tung --- drivers/infiniband/hw/nes/nes_hw.c | 19 +++++++++--- drivers/infiniband/hw/nes/nes_hw.h | 1 + drivers/infiniband/hw/nes/nes_nic.c | 52 +++++++++++++++++++--------------- 3 files changed, 44 insertions(+), 28 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_hw.c b/drivers/infiniband/hw/nes/nes_hw.c index 76cc1f1..cbb64c4 100644 --- a/drivers/infiniband/hw/nes/nes_hw.c +++ b/drivers/infiniband/hw/nes/nes_hw.c @@ -765,7 +765,8 @@ static int nes_init_serdes(struct nes_device *nesdev, u8 hw_rev, u8 port_count, /* init serdes 1 */ switch (nesadapter->phy_type[1]) { - case NES_PHY_TYPE_ARGUS: + case NES_PHY_TYPE_ARGUS: + case NES_PHY_TYPE_SFP_D: nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_EMP0, 0x00000000); nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_EMP1, 0x00000000); break; @@ -1337,14 +1338,16 @@ int nes_init_phy(struct nes_device *nesdev) } if ((phy_type == NES_PHY_TYPE_IRIS) || - (phy_type == NES_PHY_TYPE_ARGUS)) { + (phy_type == NES_PHY_TYPE_ARGUS) || + (phy_type == NES_PHY_TYPE_SFP_D)) { /* setup 10G MDIO operation */ tx_config = nes_read_indexed(nesdev, NES_IDX_MAC_TX_CONFIG); tx_config &= 0xFFFFFFE3; tx_config |= 0x15; nes_write_indexed(nesdev, NES_IDX_MAC_TX_CONFIG, tx_config); } - if ((phy_type == NES_PHY_TYPE_ARGUS)) { + if ((phy_type == NES_PHY_TYPE_ARGUS) || + (phy_type == NES_PHY_TYPE_SFP_D)) { /* Check firmware heartbeat */ nes_read_10G_phy_reg(nesdev, phy_index, 0x3, 0xd7ee); temp_phy_data = (u16)nes_read_indexed(nesdev, NES_IDX_MAC_MDIO_CONTROL); @@ -1358,10 +1361,15 @@ int nes_init_phy(struct nes_device *nesdev) /* no heartbeat, configure the PHY */ nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0x0000, 0x8000); nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc300, 0x0000); - nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc302, 0x000C); nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc316, 0x000A); nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc318, 0x0052); - nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc319, 0x0008); + if (phy_type == NES_PHY_TYPE_ARGUS) { + nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc302, 0x000C); + nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc319, 0x0008); + } else { + nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc302, 0x0004); + nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc319, 0x0038); + } nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc31a, 0x0098); nes_write_10G_phy_reg(nesdev, phy_index, 0x3, 0x0026, 0x0E00); nes_write_10G_phy_reg(nesdev, phy_index, 0x3, 0x0027, 0x0001); @@ -2442,6 +2450,7 @@ static void nes_process_mac_intr(struct nes_device *nesdev, u32 mac_number) break; case NES_PHY_TYPE_ARGUS: + case NES_PHY_TYPE_SFP_D: /* clear the alarms */ nes_read_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 4, 0x0008); nes_read_10G_phy_reg(nesdev, nesadapter->phy_index[mac_index], 4, 0xc001); diff --git a/drivers/infiniband/hw/nes/nes_hw.h b/drivers/infiniband/hw/nes/nes_hw.h index 13bc26a..c3654c6 100644 --- a/drivers/infiniband/hw/nes/nes_hw.h +++ b/drivers/infiniband/hw/nes/nes_hw.h @@ -42,6 +42,7 @@ #define NES_PHY_TYPE_PUMA_1G 5 #define NES_PHY_TYPE_PUMA_10G 6 #define NES_PHY_TYPE_GLADIUS 7 +#define NES_PHY_TYPE_SFP_D 8 #define NES_MULTICAST_PF_MAX 8 diff --git a/drivers/infiniband/hw/nes/nes_nic.c b/drivers/infiniband/hw/nes/nes_nic.c index ecb1f6f..c6e6611 100644 --- a/drivers/infiniband/hw/nes/nes_nic.c +++ b/drivers/infiniband/hw/nes/nes_nic.c @@ -1426,49 +1426,55 @@ static int nes_netdev_get_settings(struct net_device *netdev, struct ethtool_cmd struct nes_vnic *nesvnic = netdev_priv(netdev); struct nes_device *nesdev = nesvnic->nesdev; struct nes_adapter *nesadapter = nesdev->nesadapter; + u32 mac_index = nesdev->mac_index; + u8 phy_type = nesadapter->phy_type[mac_index]; + u8 phy_index = nesadapter->phy_index[mac_index]; u16 phy_data; et_cmd->duplex = DUPLEX_FULL; et_cmd->port = PORT_MII; + et_cmd->maxtxpkt = 511; + et_cmd->maxrxpkt = 511; if (nesadapter->OneG_Mode) { et_cmd->speed = SPEED_1000; - if (nesadapter->phy_type[nesdev->mac_index] == NES_PHY_TYPE_PUMA_1G) { + if (phy_type == NES_PHY_TYPE_PUMA_1G) { et_cmd->supported = SUPPORTED_1000baseT_Full; et_cmd->advertising = ADVERTISED_1000baseT_Full; et_cmd->autoneg = AUTONEG_DISABLE; et_cmd->transceiver = XCVR_INTERNAL; - et_cmd->phy_address = nesdev->mac_index; + et_cmd->phy_address = mac_index; } else { - et_cmd->supported = SUPPORTED_1000baseT_Full | SUPPORTED_Autoneg; - et_cmd->advertising = ADVERTISED_1000baseT_Full | ADVERTISED_Autoneg; - nes_read_1G_phy_reg(nesdev, 0, nesadapter->phy_index[nesdev->mac_index], &phy_data); + et_cmd->supported = SUPPORTED_1000baseT_Full + | SUPPORTED_Autoneg; + et_cmd->advertising = ADVERTISED_1000baseT_Full + | ADVERTISED_Autoneg; + nes_read_1G_phy_reg(nesdev, 0, phy_index, &phy_data); if (phy_data & 0x1000) et_cmd->autoneg = AUTONEG_ENABLE; else et_cmd->autoneg = AUTONEG_DISABLE; et_cmd->transceiver = XCVR_EXTERNAL; - et_cmd->phy_address = nesadapter->phy_index[nesdev->mac_index]; + et_cmd->phy_address = phy_index; } + return 0; + } + if ((phy_type == NES_PHY_TYPE_IRIS) || + (phy_type == NES_PHY_TYPE_ARGUS) || + (phy_type == NES_PHY_TYPE_SFP_D)) { + et_cmd->transceiver = XCVR_EXTERNAL; + et_cmd->port = PORT_FIBRE; + et_cmd->supported = SUPPORTED_FIBRE; + et_cmd->advertising = ADVERTISED_FIBRE; + et_cmd->phy_address = phy_index; } else { - if ((nesadapter->phy_type[nesdev->mac_index] == NES_PHY_TYPE_IRIS) || - (nesadapter->phy_type[nesdev->mac_index] == NES_PHY_TYPE_ARGUS)) { - et_cmd->transceiver = XCVR_EXTERNAL; - et_cmd->port = PORT_FIBRE; - et_cmd->supported = SUPPORTED_FIBRE; - et_cmd->advertising = ADVERTISED_FIBRE; - et_cmd->phy_address = nesadapter->phy_index[nesdev->mac_index]; - } else { - et_cmd->transceiver = XCVR_INTERNAL; - et_cmd->supported = SUPPORTED_10000baseT_Full; - et_cmd->advertising = ADVERTISED_10000baseT_Full; - et_cmd->phy_address = nesdev->mac_index; - } - et_cmd->speed = SPEED_10000; - et_cmd->autoneg = AUTONEG_DISABLE; + et_cmd->transceiver = XCVR_INTERNAL; + et_cmd->supported = SUPPORTED_10000baseT_Full; + et_cmd->advertising = ADVERTISED_10000baseT_Full; + et_cmd->phy_address = mac_index; } - et_cmd->maxtxpkt = 511; - et_cmd->maxrxpkt = 511; + et_cmd->speed = SPEED_10000; + et_cmd->autoneg = AUTONEG_DISABLE; return 0; } -- 1.5.3.3 From rdreier at cisco.com Wed Apr 8 13:52:09 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Apr 2009 13:52:09 -0700 Subject: [ofa-general] Re: [PATCH] doc/ipoib: document CM, offloads, interrupt moderation In-Reply-To: (Or Gerlitz's message of "Thu, 26 Mar 2009 10:52:33 +0200 (IST)") References: Message-ID: thanks, applied From rdreier at cisco.com Wed Apr 8 14:06:56 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Apr 2009 14:06:56 -0700 Subject: [ofa-general] Re: [PATCH] RDMA/nes: Physical memory registration is incorrect In-Reply-To: <20090403203555.GA2608@dewood-MOBL> (Don Wood's message of "Fri, 3 Apr 2009 15:35:55 -0500") References: <20090403203555.GA2608@dewood-MOBL> Message-ID: > Code incorrectly failed memory registration if the buffer was > not page aligned. Also, the length field is mangled causing > the hardware to think the registration is much larger than it > really is. This doesn't look quite right: - if (buffer_list[i].addr & ~PAGE_MASK) { - /* TODO: Unwind allocated buffers */ - nes_free_resource(nesadapter, nesadapter->allocated_mrs, stag_index); - nes_debug(NES_DBG_MR, "Unaligned Memory Buffer: 0x%x\n", - (unsigned int) buffer_list[i].addr); - ibmr = ERR_PTR(-EINVAL); - kfree(nesmr); - goto reg_phys_err; - } if a buffer after the first one is not aligned, then I suspect you can't handle the registration -- ie you want to allow: addr len 0x1800 0x 800 0x3000 0x1000 0x5000 0x 800 but not cases like addr len 0x1800 0x 800 0x3800 0x 800 0x5000 0x 800 you could take a look at the code in mthca_reg_phys_mr() to see what I think is required (although I wouldn't be shocked if you spot a bug there too) - R. From rdreier at cisco.com Wed Apr 8 14:11:24 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Apr 2009 14:11:24 -0700 Subject: [ofa-general] Re: [PATCH] RDMA/nes: Incorrect casting for 32 bit driver In-Reply-To: <20090403204327.GA5556@dewood-MOBL> (Don Wood's message of "Fri, 3 Apr 2009 15:43:27 -0500") References: <20090403204327.GA5556@dewood-MOBL> Message-ID: thanks, applied From rdreier at cisco.com Wed Apr 8 14:24:00 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Apr 2009 14:24:00 -0700 Subject: [ofa-general] Re: [PATCH 2/2] RDMA/nes: fix nes_nic_cm_xmit() error handling In-Reply-To: <20090406192852.GA9316@flatif-MOBL> (Faisal Latif's message of "Mon, 6 Apr 2009 14:28:52 -0500") References: <20090406192852.GA9316@flatif-MOBL> Message-ID: thanks, applied 1 & 2 From rdreier at cisco.com Wed Apr 8 14:28:06 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Apr 2009 14:28:06 -0700 Subject: [ofa-general] Re: [PATCH 3/3] RDMA/nes: Add support for new SFP+ PHY In-Reply-To: <20090408204759.GA4772@ctung-MOBL> (Chien Tung's message of "Wed, 8 Apr 2009 15:47:59 -0500") References: <20090408204759.GA4772@ctung-MOBL> Message-ID: thanks, applied 1,2&3 From rdreier at cisco.com Wed Apr 8 14:36:48 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Apr 2009 14:36:48 -0700 Subject: [ofa-general] Warning in nes_verbs.c Message-ID: By the way, with the latest kernel, I see the warning: drivers/infiniband/hw/nes/nes_verbs.c: In function 'nes_reg_mr': drivers/infiniband/hw/nes/nes_verbs.c:1955: warning: 'pbl_count_256' may be used uninitialized in this function I'm pretty sure this is a spurious warning but I'd appreciate it if you'd look the code over too, and see if you can come up with a clean way to avoid this. Thanks, Roland From swise at opengridcomputing.com Wed Apr 8 14:46:29 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 08 Apr 2009 16:46:29 -0500 Subject: [ofa-general] [PATCH v3] RDMA/cxgb3: Handle EEH events for active connections. In-Reply-To: References: <20090309102014.5738.98445.stgit@build.ogc.int> <49C1B7B1.6020108@opengridcomputing.com> <49C1B8EF.4010305@opengridcomputing.com> Message-ID: <49DD1B35.4000805@opengridcomputing.com> Roland Dreier wrote: > > these dependent patches are in net-next. > > OK I'll hang onto this and merge this into my second batch of things > after net-next goes into Linus's tree. > Hey Roland, so this will be in RC2 then? Thanks, Steve. From rdreier at cisco.com Wed Apr 8 15:03:25 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 08 Apr 2009 15:03:25 -0700 Subject: [ofa-general] [PATCH v3] RDMA/cxgb3: Handle EEH events for active connections. In-Reply-To: <49DD1B35.4000805@opengridcomputing.com> (Steve Wise's message of "Wed, 08 Apr 2009 16:46:29 -0500") References: <20090309102014.5738.98445.stgit@build.ogc.int> <49C1B7B1.6020108@opengridcomputing.com> <49C1B8EF.4010305@opengridcomputing.com> <49DD1B35.4000805@opengridcomputing.com> Message-ID: > Hey Roland, so this will be in RC2 then? Yes, I'll send it tomorrow I think. From knikanth at suse.de Thu Apr 9 00:14:06 2009 From: knikanth at suse.de (Nikanth Karthikesan) Date: Thu, 9 Apr 2009 12:44:06 +0530 Subject: [ofa-general] Re: [PATCH] Fix wrong dbg output and gcc warning when INFINIBAND_NES_DEBUG is not set In-Reply-To: References: <200904071203.24032.knikanth@suse.de> <200904081136.08393.knikanth@suse.de> Message-ID: <200904091244.07853.knikanth@suse.de> On Wednesday 08 April 2009 22:51:51 Roland Dreier wrote: > > + nes_debug(NES_DBG_CM, "Unable to find listener for %xI4:%x\n", > > + cpu_to_be32(dst_addr), dst_port); > > Have you tested this? It seems like it will print the IP address as a > (possibly byte-reversed) hex value followed by the literal string "I4" > rather than printing it as a formatted IP address. > No. Just Build tested. > The problem you seem to be trying to solve is an unused variable warning > when nes debugging is not enabled, but I don't think you can do it by > removing the tmp_addr variable. The most robust solution would probably > to change the definition of nes_debug() so it appears to gcc to use all > its parameters even when debugging is disabled. You could look at > drivers/net/mlx4/mlx4.h for an example of one way to do that. > Ah got it. Didnt know about the printk format %pI4. Sorry. Here are 3 different patches to silence gcc warning when nes debugging is not enabled. Please apply one of them. :) Thanks for your patience. Thanks Nikanth 1. Using __attribute__((unused)) Silence gcc warning for unused variable when INFINIBAND_NES_DEBUG is not set. Signed-off-by: Nikanth Karthikesan --- diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index 5242515..a772bf0 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -859,7 +859,7 @@ static struct nes_cm_listener *find_listener(struct nes_cm_core *cm_core, { unsigned long flags; struct nes_cm_listener *listen_node; - __be32 tmp_addr = cpu_to_be32(dst_addr); + __be32 __attribute__((unused)) tmp_addr = cpu_to_be32(dst_addr); /* walk list and find cm_node associated with this session ID */ spin_lock_irqsave(&cm_core->listen_list_lock, flags); 2. Using your suggested approach. Silence gcc warning for unused variable when INFINIBAND_NES_DEBUG is not set. Signed-off-by: Nikanth Karthikesan --- diff --git a/drivers/infiniband/hw/nes/nes.h b/drivers/infiniband/hw/nes/nes.h index 04b12ad..4c4dcf8 100644 --- a/drivers/infiniband/hw/nes/nes.h +++ b/drivers/infiniband/hw/nes/nes.h @@ -136,27 +136,31 @@ #define NES_DBG_ALL 0xffffffff #ifdef CONFIG_INFINIBAND_NES_DEBUG + +#define INFINIBAND_NES_DEBUG (1) +#define NES_EVENT_TIMEOUT 1200000 + +#else + +#define INFINIBAND_NES_DEBUG (0) +#define NES_EVENT_TIMEOUT 100000 + +#endif + #define nes_debug(level, fmt, args...) \ do { \ - if (level & nes_debug_level) \ + if (INFINIBAND_NES_DEBUG && (level & nes_debug_level)) \ printk(KERN_ERR PFX "%s[%u]: " fmt, __func__, __LINE__, ##args); \ } while (0) #define assert(expr) \ do { \ - if (!(expr)) { \ + if (INFINIBAND_NES_DEBUG && !(expr)) { \ printk(KERN_ERR PFX "Assertion failed! %s, %s, %s, line %d\n", \ #expr, __FILE__, __func__, __LINE__); \ } \ } while (0) -#define NES_EVENT_TIMEOUT 1200000 -#else -#define nes_debug(level, fmt, args...) -#define assert(expr) do {} while (0) - -#define NES_EVENT_TIMEOUT 100000 -#endif #include "nes_hw.h" #include "nes_verbs.h" 3. Just dont define the unused variable. Silence gcc warning for unused variable when INFINIBAND_NES_DEBUG is not set. Signed-off-by: Nikanth Karthikesan --- diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index 5242515..449d28d 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -859,7 +859,9 @@ static struct nes_cm_listener *find_listener(struct nes_cm_core *cm_core, { unsigned long flags; struct nes_cm_listener *listen_node; +#ifdef CONFIG_INFINIBAND_NES_DEBUG __be32 tmp_addr = cpu_to_be32(dst_addr); +#endif /* walk list and find cm_node associated with this session ID */ spin_lock_irqsave(&cm_core->listen_list_lock, flags); From ofedrnicuser at yahoo.com Thu Apr 9 01:35:33 2009 From: ofedrnicuser at yahoo.com (Bill N) Date: Thu, 9 Apr 2009 01:35:33 -0700 (PDT) Subject: [ofa-general] ***SPAM*** iwarp statistics numbers Message-ID: <579897.64639.qm@web111215.mail.gq1.yahoo.com> Hi, What does each of this statistics for iWarp indicate? 1. ipInTooBigErrors 2. ipInNoRoutes 3. ipInTruncatedPkts 4. ipOutForwDatagrams Regards, Bill From vlad at lists.openfabrics.org Thu Apr 9 03:22:04 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 9 Apr 2009 03:22:04 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090409-0200 daily build status Message-ID: <20090409102205.140C0E60FA0@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From dsk at ci.uchicago.edu Wed Apr 8 13:21:27 2009 From: dsk at ci.uchicago.edu (Daniel S. Katz) Date: Wed, 8 Apr 2009 15:21:27 -0500 Subject: [ofa-general] [hpc-announce] Cluster 2009 extension References: <8EF545BB-7B8F-4D94-BF17-E044378DB686@ieee.org> Message-ID: <842A0761-1C6A-46F3-9BED-9890938AF79F@ci.uchicago.edu> ********************************************* * Cluster 2009 Technical Papers * ********************************************* Due to a number of requests for extension of submission of papers for Cluster 2009 (http://www.cluster2009.org/), the deadline schedule has been modified as follows: Technical Paper Abstracts due date: 14th April Full Technical Paper due date: 21st April Cluster 2009 welcomes paper and poster submissions on innovative work from researchers in academia, industry, and government, describing original research in the field of cluster computing. Topics of interest include, but are not limited to: Cluster Architecture and Hardware Systems Node architectures Packaging, Power, and Cooling Cluster Software and Middleware Software Environments and Tools Single -System Image Services Parallel File Systems and I/O Libraries Standard Software for Clusters Virtualization Cluster Networking High-Speed Interconnects High Performance Message Passing Libraries Lightweight Communication Protocols Implications of Multicore and Clouds on Clusters Hardware Architecture Software and Tools Networking Management Applications Applications Application Methods and Algorithms Adaptation to Multicore Data Distribution, Load Balancing & Scaling MPI/OpenMP Hybrid Computing Visualization Performance Analysis and Evaluation Benchmarking & Profiling Tools Performance Prediction & Modeling Cluster Management Security and Reliability High Availability Solutions Resource and Job Management Paper Submission: Paper Format: Since the camera-ready version of accepted papers must be compliant with the IEEE Xplore format for publication, submitted papers must conform to the following Xplore layout, page limit, and font size. This will insure a size consistency and a uniform layout for the reviewers. (With minimal changes, accepted document can be styled for publication according to Xplore requirements explained in the Xplore formatting guide, which is also in Xplore format). PDF files only. Maximum 10 pages. Single-spaced 8.5x11-inch, Two-column numbered pages in IEEE Xplore format Format instructions are available at: IEEE Paper LaTeX Template (ZIP file) IEEE Paper Word Template (ZIP file) Electronic Submission: Only web-based submission is accepted, at http://www.easychair.org/conferences/?conf=cluster09 -------------- next part -------------- An HTML attachment was scrubbed... URL: From swise at opengridcomputing.com Thu Apr 9 09:52:01 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 09 Apr 2009 11:52:01 -0500 Subject: [ofa-general] ***SPAM*** iwarp statistics numbers In-Reply-To: <579897.64639.qm@web111215.mail.gq1.yahoo.com> References: <579897.64639.qm@web111215.mail.gq1.yahoo.com> Message-ID: <49DE27B1.7040801@opengridcomputing.com> Hey Bill, I thought these were all standard MIB variables for TCP and IP. But I don't see these in the MIB specs. I've CCed Chelsio since these are provided by the cxgb3 rnic. Steve. Bill N wrote: > Hi, > > What does each of this statistics for iWarp indicate? > > 1. ipInTooBigErrors > 2. ipInNoRoutes > 3. ipInTruncatedPkts > 4. ipOutForwDatagrams > > Regards, > Bill > > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From swise at opengridcomputing.com Thu Apr 9 09:52:19 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 09 Apr 2009 11:52:19 -0500 Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Adjust ord/ird if needed for peer2peer connections Message-ID: <20090409165218.17033.63125.stgit@build.ogc.int> NFSRDMA currently fails to setup connections if peer2peer is on. This is due to the fact that the NFSRDMA client sets its ord to 0. If peer2peer is set, make sure the active side ord is >= 1 and the passive side ird is >=1. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_cm.c | 8 ++++++++ 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index 042cc4d..c1f121e 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -1830,6 +1830,10 @@ int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) ep->com.rpl_err = 0; ep->ird = conn_param->ird; ep->ord = conn_param->ord; + + if (peer2peer && ep->ird == 0) + ep->ird = 1; + PDBG("%s %d ird %d ord %d\n", __func__, __LINE__, ep->ird, ep->ord); get_ep(&ep->com); @@ -1915,6 +1919,10 @@ int iwch_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) conn_param->private_data, ep->plen); ep->ird = conn_param->ird; ep->ord = conn_param->ord; + + if (peer2peer && ep->ord == 0) + ep->ord = 1; + ep->com.tdev = h->rdev.t3cdev_p; cm_id->add_ref(cm_id); From ofedrnicuser at yahoo.com Thu Apr 9 11:17:00 2009 From: ofedrnicuser at yahoo.com (Bill N) Date: Thu, 9 Apr 2009 11:17:00 -0700 (PDT) Subject: [ofa-general] iwarp statistics numbers Message-ID: <342152.36601.qm@web111207.mail.gq1.yahoo.com> Hi Steve, Correct. I thought its part of RFC 1156 but its not. Even cxgb3 also have them as reserved fields. Regards, Bill --- On Thu, 4/9/09, Steve Wise wrote: > From: Steve Wise > Subject: Re: [ofa-general] ***SPAM*** iwarp statistics numbers > To: "Bill N" > Cc: general at lists.openfabrics.org, "Chelsio Support" > Date: Thursday, April 9, 2009, 4:52 PM > Hey Bill, > > I thought these were all standard MIB variables for TCP and > IP.  But I > don't see these in the MIB specs.  I've CCed Chelsio > since these are > provided by the cxgb3 rnic. > > Steve. > > > Bill N wrote: > > Hi, > > > > What does each of this statistics for iWarp indicate? > > > > > 1. ipInTooBigErrors > > 2. ipInNoRoutes > > 3. ipInTruncatedPkts > > 4. ipOutForwDatagrams > > > > Regards, > > Bill > > > > > > > >        > > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > >    > > From rdreier at cisco.com Thu Apr 9 14:58:59 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 09 Apr 2009 14:58:59 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get a batch of changes for -rc2, mostly low-level hardware driver fixes with a few other miscellaneous fixes and an IPoIB documentation update. Chien Tung (3): RDMA/nes: Fix SFP+ PHY initialization RDMA/nes: Add wide_ppm_offset parm for switch compatibility RDMA/nes: Add support for new SFP+ PHY Don Wood (1): RDMA/nes: Fix incorrect casts on 32-bit architectures Faisal Latif (2): RDMA/nes: Fix error handling issues RDMA/nes: Fix nes_nic_cm_xmit() error handling Or Gerlitz (1): IPoIB: Document newish features Roland Dreier (4): IB/mlx4: Use pgprot_writecombine() for BlueFlame pages mlx4_core: Don't leak mailbox for SET_PORT on Ethernet ports IPoIB: Avoid free_netdev() BUG when destroying a child interface Merge branches 'cma', 'cxgb3', 'ipoib', 'mlx4' and 'nes' into for-next Steve Wise (2): RDMA/cxgb3: Handle EEH events RDMA/cxgb3: Release dependent resources only when endpoint memory is freed. Yossi Etigin (2): RDMA/cma: Use rate from IPoIB broadcast when joining IPoIB multicast groups RDMA/cma: Create cm id even when IB port is down Documentation/infiniband/ipoib.txt | 45 ++++ drivers/infiniband/core/cma.c | 45 +++- drivers/infiniband/hw/cxgb3/cxio_hal.c | 10 +- drivers/infiniband/hw/cxgb3/cxio_hal.h | 6 + drivers/infiniband/hw/cxgb3/iwch.c | 11 +- drivers/infiniband/hw/cxgb3/iwch.h | 5 + drivers/infiniband/hw/cxgb3/iwch_cm.c | 116 ++++++--- drivers/infiniband/hw/cxgb3/iwch_cm.h | 3 +- drivers/infiniband/hw/cxgb3/iwch_qp.c | 4 +- drivers/infiniband/hw/mlx4/main.c | 3 +- drivers/infiniband/hw/nes/nes.h | 4 +- drivers/infiniband/hw/nes/nes_cm.c | 22 +- drivers/infiniband/hw/nes/nes_cm.h | 1 - drivers/infiniband/hw/nes/nes_hw.c | 389 ++++++++++++++--------------- drivers/infiniband/hw/nes/nes_hw.h | 2 + drivers/infiniband/hw/nes/nes_nic.c | 52 +++-- drivers/infiniband/ulp/ipoib/ipoib_vlan.c | 25 ++- drivers/net/mlx4/port.c | 5 +- 18 files changed, 442 insertions(+), 306 deletions(-) From weiny2 at llnl.gov Thu Apr 9 17:01:33 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 9 Apr 2009 17:01:33 -0700 Subject: [ofa-general] [PATCH] When no DM ports are found return an error Message-ID: <20090409170133.7c4a747c.weiny2@llnl.gov> I think I have found a bug in srp_daemon when running on a fabric with no SRP targets. The patch below fixes the issue. This is when running against OpenSM 3.2.5. Without this fix the daemon goes into an endless loop looking for the node with lid 0! Ira From: Ira Weiny Date: Thu, 9 Apr 2009 16:41:58 -0700 Subject: [PATCH] When no DM ports are found return an error. Signed-off-by: Ira Weiny --- srp_daemon/srp_daemon.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/srp_daemon/srp_daemon.c b/srp_daemon/srp_daemon.c index 5e1e198..621678b 100644 --- a/srp_daemon/srp_daemon.c +++ b/srp_daemon/srp_daemon.c @@ -943,6 +943,9 @@ static int do_dm_port_list(struct resources *res) size = ib_get_attr_size(in_sa_mad->attr_offset); + if (!size) + return -1; + for (i = 0; (i + 1) * size <= len - MAD_RMPP_HDR_SIZE; ++i) { port_info = (void *) in_sa_mad->data + i * size; -- 1.5.4.5 From acceptany at gmail.com Thu Apr 9 23:04:59 2009 From: acceptany at gmail.com (Jordan) Date: Fri, 10 Apr 2009 14:04:59 +0800 Subject: [ofa-general] ***SPAM*** some suggestion for the up*/down* routing algorithm Message-ID: <91fe68d50904092304x6b5ea6e8g5b73522494b1f979@mail.gmail.com> Recently , I have seen the source code of the up*/down* routing algorithm. It still uses the BFS to generate the spanning tree, whoes performance is inferior to the improved one which uses DFS to generate the spanning tree. DFS have been proposed in paper "New Methodology to Compute Deadlock-Free Routing Tables for Irregular Networks", which can increase throughput and reduce latency. So I think that It's a better idea to use DFS to generate the spanning tree in up*/down* routing algorithm. Why not use DFS in up*/down* routing algorithm ? Is it hard to implement or some other reason ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Fri Apr 10 03:20:48 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 10 Apr 2009 03:20:48 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090410-0200 daily build status Message-ID: <20090410102048.4B75AE61023@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From worleys at gmail.com Fri Apr 10 07:53:09 2009 From: worleys at gmail.com (Chris Worley) Date: Fri, 10 Apr 2009 08:53:09 -0600 Subject: [ofa-general] iSer tuning guide? Message-ID: I'm running RHEL5.2 w/ the stock OFED RPMs. I have a target disk that run at 800MB/s locally and QDR IB. I need to get some performance out of this, as I'm seeing <300MB/s in my benchmarks. On the initiator side, I run a test using 1MB block sizes. On the target side I'm seeing everything in 4K packets. Any idea who's setting this and why? Is there a tuning guide available? Thanks, Chris From yosefe at voltaire.com Fri Apr 10 08:45:08 2009 From: yosefe at voltaire.com (Yossi Etigin) Date: Fri, 10 Apr 2009 18:45:08 +0300 Subject: [ofa-general] [PATCH] ipoib: disable napi while cq is being drained Message-ID: <49DF6984.4090000@voltaire.com> If napi is enabled while cq is being drained, it creates a race on priv->ibwc between ipoib_poll and ipoib_drain_cq, leading to memory corruption. The solution is to enable/disable napi in ipoib_ib_dev_open/stop instead of in ipoib_open/stop, and sync napi on INITIALIZED bit instead on ADMIN_UP bit. This way napi will be disabled when ipoib_drain_cq is called. Signed-off-by: Yossi Etigin --- Fix bugzilla #1587. Index: b/drivers/infiniband/ulp/ipoib/ipoib_ib.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2009-04-09 13:48:27.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2009-04-10 17:58:10.000000000 +0300 @@ -685,7 +685,8 @@ int ipoib_ib_dev_open(struct net_device queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, round_jiffies_relative(HZ)); - set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags); + if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) + napi_enable(&priv->napi); return 0; } @@ -804,7 +805,8 @@ int ipoib_ib_dev_stop(struct net_device struct ipoib_tx_buf *tx_req; int i; - clear_bit(IPOIB_FLAG_INITIALIZED, &priv->flags); + if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &priv->flags)) + napi_disable(&priv->napi); ipoib_cm_dev_stop(dev); Index: b/drivers/infiniband/ulp/ipoib/ipoib_main.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c 2009-04-09 15:16:24.000000000 +0300 +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c 2009-04-10 18:26:11.000000000 +0300 @@ -106,8 +106,7 @@ int ipoib_open(struct net_device *dev) ipoib_dbg(priv, "bringing up interface\n"); - if (!test_and_set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) - napi_enable(&priv->napi); + set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); if (ipoib_pkey_dev_delay_open(dev)) return 0; @@ -143,7 +142,6 @@ err_stop: ipoib_ib_dev_stop(dev, 1); err_disable: - napi_disable(&priv->napi); clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); return -EINVAL; @@ -156,7 +154,6 @@ static int ipoib_stop(struct net_device ipoib_dbg(priv, "stopping interface\n"); clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags); - napi_disable(&priv->napi); netif_stop_queue(dev); From chien.tin.tung at intel.com Fri Apr 10 10:09:40 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Fri, 10 Apr 2009 12:09:40 -0500 Subject: [ofa-general] [PATCH] RDMA/nes: remove compiler warning nes_verbs.c:1955 Message-ID: <20090410170940.GA2896@ctung-MOBL> Initialize pbl_count_256 to 0 to get rid of this warning. drivers/infiniband/hw/nes/nes_verbs.c: In function 'nes_reg_mr': drivers/infiniband/hw/nes/nes_verbs.c:1955: warning: 'pbl_count_256' may be used uninitialized in this function Reported-by: Roland Dreier Signed-off-by: Chien Tung --- drivers/infiniband/hw/nes/nes_verbs.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index 7e5b5ba..9279d05 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -1952,7 +1952,7 @@ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd, int ret; struct nes_adapter *nesadapter = nesdev->nesadapter; uint pg_cnt = 0; - u16 pbl_count_256; + u16 pbl_count_256 = 0; u16 pbl_count = 0; u8 use_256_pbls = 0; u8 use_4k_pbls = 0; -- 1.5.3.3 From worleys at gmail.com Fri Apr 10 13:35:49 2009 From: worleys at gmail.com (Chris Worley) Date: Fri, 10 Apr 2009 14:35:49 -0600 Subject: [ofa-general] ***SPAM*** Re: iSer tuning guide? In-Reply-To: References: Message-ID: What's version of OFED has both a stable and high performance iSer? I setup the latest OFED (1.4.1rc3), and the speed kicked in, but using fio on the initiator using two targets hung quickly on the initiator: connection3:0: ping timeout of 5 secs expired, last rx 4557795905, last ping 4557800905, now 4557805905 connection3:0: detected conn error (1011) iser: iscsi_iser_ep_disconnect:ib conn ffff8104dc482ec0 state 2 iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 connection4:0: ping timeout of 5 secs expired, last rx 4557796701, last ping 4557801701, now 4557806701 connection4:0: detected conn error (1011) connection1:0: ping timeout of 5 secs expired, last rx 4557799350, last ping 4557804350, now 4557809350 connection1:0: detected conn error (1011) connection2:0: ping timeout of 5 secs expired, last rx 4557800108, last ping 4557805108, now 4557810108 connection2:0: detected conn error (1011) connection6:0: ping timeout of 5 secs expired, last rx 4557800224, last ping 4557805224, now 4557810224 connection6:0: detected conn error (1011) connection5:0: ping timeout of 5 secs expired, last rx 4557800231, last ping 4557805231, now 4557810234 connection5:0: detected conn error (1011) iser: iser_cma_handler:event 10 conn ffff8104dc482ec0 id ffff8105b1502800 iser: iser_free_ib_conn_res:freeing conn ffff8104dc482ec0 cma_id ffff8105b1502800 fmr pool ffff8105f9b555c0 qp ffff81027b346000 iser: iser_device_try_release:device ffff81081bb81640 refcount 5 iser: iscsi_iser_ep_disconnect:ib conn ffff8104dc4824c0 state 2 iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 session3: session recovery timed out after 120 secs iser: iser_cma_handler:event 10 conn ffff8104dc4824c0 id ffff8107df2ce400 iser: iser_free_ib_conn_res:freeing conn ffff8104dc4824c0 cma_id ffff8107df2ce400 fmr pool ffff8107f60c90c0 qp ffff8107720d2400 iser: iser_device_try_release:device ffff81081bb81640 refcount 4 iser: iscsi_iser_ep_disconnect:ib conn ffff81081fa521c0 state 2 iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 session4: session recovery timed out after 120 secs On Fri, Apr 10, 2009 at 8:53 AM, Chris Worley wrote: > I'm running RHEL5.2 w/ the stock OFED RPMs.  I have a target disk that > run at 800MB/s locally and QDR IB.  I need to get some performance out > of this, as I'm seeing <300MB/s in my benchmarks. > > On the initiator side, I run a test using 1MB block sizes. > > On the target side I'm seeing everything in 4K packets. > > Any idea who's setting this and why? > > Is there a tuning guide available? > > Thanks, > > Chris > From donald.e.wood at intel.com Fri Apr 10 14:31:47 2009 From: donald.e.wood at intel.com (Don Wood) Date: Fri, 10 Apr 2009 16:31:47 -0500 Subject: [ofa-general] [PATCH V2] RDMA/nes: Physical memory registration is incorrect Message-ID: <20090410213147.GA3736@dewood-MOBL> Code incorrectly failed memory registration if the buffer was not page aligned. Also, the length field is mangled causing the hardware to think the registration is much larger than it really is. The fix is to remove the page alignment restriction as well the incorrect length adjustment. Signed-off-by: Don Wood --- V2 Change: Check virtual address starting offset matches physical address starting offset. Make sure subsequent page is page aligned. Check all but last page extends to the end of page. drivers/infiniband/hw/nes/nes_verbs.c | 27 +++++++++++++-------------- 1 files changed, 13 insertions(+), 14 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index 9279d05..f04bb1a 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -2122,6 +2122,7 @@ static struct ib_mr *nes_reg_phys_mr(struct ib_pd *ib_pd, struct nes_root_vpbl root_vpbl; u32 stag; u32 i; + unsigned long mask; u32 stag_index = 0; u32 next_stag_index = 0; u32 driver_key = 0; @@ -2150,6 +2151,9 @@ static struct ib_mr *nes_reg_phys_mr(struct ib_pd *ib_pd, return ERR_PTR(-E2BIG); } + if ((buffer_list[0].addr ^ *iova_start) & ~PAGE_MASK) + return ERR_PTR(-EINVAL); + err = nes_alloc_resource(nesadapter, nesadapter->allocated_mrs, nesadapter->max_mr, &stag_index, &next_stag_index); if (err) { @@ -2215,19 +2219,16 @@ static struct ib_mr *nes_reg_phys_mr(struct ib_pd *ib_pd, root_pbl_index++; cur_pbl_index = 0; } - if (buffer_list[i].addr & ~PAGE_MASK) { - /* TODO: Unwind allocated buffers */ - nes_free_resource(nesadapter, nesadapter->allocated_mrs, stag_index); - nes_debug(NES_DBG_MR, "Unaligned Memory Buffer: 0x%x\n", - (unsigned int) buffer_list[i].addr); - ibmr = ERR_PTR(-EINVAL); - kfree(nesmr); - goto reg_phys_err; - } - if (!buffer_list[i].size) { + mask = !buffer_list[i].size; + if (i != 0) + mask |= buffer_list[i].addr; + if (i != num_phys_buf - 1) + mask |= buffer_list[i].addr + buffer_list[i].size; + + if (mask & ~PAGE_MASK) { nes_free_resource(nesadapter, nesadapter->allocated_mrs, stag_index); - nes_debug(NES_DBG_MR, "Invalid Buffer Size\n"); + nes_debug(NES_DBG_MR, "Invalid buffer addr or size\n"); ibmr = ERR_PTR(-EINVAL); kfree(nesmr); goto reg_phys_err; @@ -2238,7 +2239,7 @@ static struct ib_mr *nes_reg_phys_mr(struct ib_pd *ib_pd, if ((buffer_list[i-1].addr+PAGE_SIZE) != buffer_list[i].addr) single_page = 0; } - vpbl.pbl_vbase[cur_pbl_index].pa_low = cpu_to_le32((u32)buffer_list[i].addr); + vpbl.pbl_vbase[cur_pbl_index].pa_low = cpu_to_le32((u32)buffer_list[i].addr & PAGE_MASK); vpbl.pbl_vbase[cur_pbl_index++].pa_high = cpu_to_le32((u32)((((u64)buffer_list[i].addr) >> 32))); } @@ -2251,8 +2252,6 @@ static struct ib_mr *nes_reg_phys_mr(struct ib_pd *ib_pd, " length = 0x%016lX, index = 0x%08X\n", stag, (unsigned long)*iova_start, (unsigned long)region_length, stag_index); - region_length -= (*iova_start)&PAGE_MASK; - /* Make the leaf PBL the root if only one PBL */ if (root_pbl_index == 1) { root_vpbl.pbl_pbase = vpbl.pbl_pbase; -- 1.5.3.3 From worleys at gmail.com Fri Apr 10 14:34:47 2009 From: worleys at gmail.com (Chris Worley) Date: Fri, 10 Apr 2009 15:34:47 -0600 Subject: [ofa-general] ***SPAM*** Re: iSer tuning guide? In-Reply-To: References: Message-ID: It looks like it's the target side at fault. The target in stalled is the one pointed to by the OFED wiki: http://www.voltaire.com/ftp/support-products/source/stgt/scsi-target-utils-0.1-20080828.x86_64.rpm ... it's mind-boggling to believe that any OFED component that's 8 months old would be compatible with the latest release... but that directory isn't browsable, so it's hard to tell if there is something newer. Most of the time, if I restart the tgtd and opensm/openibd on the target system, then restart just iscsi on the initiators, I can get iser working again after it hangs... but sometimes the tgtd gets into a state where it segfaults at launch, and I have to reboot the target machine before things work again. It looks like it works great with one initiator and one target... any more than that and the connections are hung and the target needs restarted. I can run "fio" with one thread against two targets, and it's okay... more than one thread, and it hangs. Multiple initiators (separate systems) work as long as the access is serial... accessing two different or the same target from two different initiators simultaneously, and it hangs. I have seen it crash the initiator also, but no console to see what the crash was. So, I guess I'll just start trying recent OFED versions until I find a stable iser. If anybody can give me a hint as to what might be both stable and perform well, it would be appreciated. Thanks, Chris On Fri, Apr 10, 2009 at 2:35 PM, Chris Worley wrote: > What's version of OFED has both a stable and high performance iSer? > > I setup the latest OFED (1.4.1rc3), and the speed kicked in, but using > fio on the initiator using two targets hung quickly on the initiator: > >  connection3:0: ping timeout of 5 secs expired, last rx 4557795905, > last ping 4557800905, now 4557805905 >  connection3:0: detected conn error (1011) > iser: iscsi_iser_ep_disconnect:ib conn ffff8104dc482ec0 state 2 > iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 > iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 > iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 > iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 > iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 >  connection4:0: ping timeout of 5 secs expired, last rx 4557796701, > last ping 4557801701, now 4557806701 >  connection4:0: detected conn error (1011) >  connection1:0: ping timeout of 5 secs expired, last rx 4557799350, > last ping 4557804350, now 4557809350 >  connection1:0: detected conn error (1011) >  connection2:0: ping timeout of 5 secs expired, last rx 4557800108, > last ping 4557805108, now 4557810108 >  connection2:0: detected conn error (1011) >  connection6:0: ping timeout of 5 secs expired, last rx 4557800224, > last ping 4557805224, now 4557810224 >  connection6:0: detected conn error (1011) >  connection5:0: ping timeout of 5 secs expired, last rx 4557800231, > last ping 4557805231, now 4557810234 >  connection5:0: detected conn error (1011) > iser: iser_cma_handler:event 10 conn ffff8104dc482ec0 id ffff8105b1502800 > iser: iser_free_ib_conn_res:freeing conn ffff8104dc482ec0 cma_id > ffff8105b1502800 fmr pool ffff8105f9b555c0 qp ffff81027b346000 > iser: iser_device_try_release:device ffff81081bb81640 refcount 5 > iser: iscsi_iser_ep_disconnect:ib conn ffff8104dc4824c0 state 2 > iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 > iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 > iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 > iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 > iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 >  session3: session recovery timed out after 120 secs > iser: iser_cma_handler:event 10 conn ffff8104dc4824c0 id ffff8107df2ce400 > iser: iser_free_ib_conn_res:freeing conn ffff8104dc4824c0 cma_id > ffff8107df2ce400 fmr pool ffff8107f60c90c0 qp ffff8107720d2400 > iser: iser_device_try_release:device ffff81081bb81640 refcount 4 > iser: iscsi_iser_ep_disconnect:ib conn ffff81081fa521c0 state 2 > iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 > iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 > iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 > iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 > iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 >  session4: session recovery timed out after 120 secs > > > On Fri, Apr 10, 2009 at 8:53 AM, Chris Worley wrote: >> I'm running RHEL5.2 w/ the stock OFED RPMs.  I have a target disk that >> run at 800MB/s locally and QDR IB.  I need to get some performance out >> of this, as I'm seeing <300MB/s in my benchmarks. >> >> On the initiator side, I run a test using 1MB block sizes. >> >> On the target side I'm seeing everything in 4K packets. >> >> Any idea who's setting this and why? >> >> Is there a tuning guide available? >> >> Thanks, >> >> Chris >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From vlad at lists.openfabrics.org Sat Apr 11 03:21:09 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 11 Apr 2009 03:21:09 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090411-0200 daily build status Message-ID: <20090411102110.1EB97E61387@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.27 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From bs_lists at aakef.fastmail.fm Sat Apr 11 13:05:11 2009 From: bs_lists at aakef.fastmail.fm (Bernd Schubert) Date: Sat, 11 Apr 2009 22:05:11 +0200 Subject: [ofa-general] ib_mthca 0000:0d:00.0: Async event 16 for bogus QP 00da0407 In-Reply-To: <20090406105424.GC6165@cefeid.wcss.wroc.pl> References: <200904022007.20630.bs_lists@aakef.fastmail.fm> <20090406105424.GC6165@cefeid.wcss.wroc.pl> Message-ID: <200904112205.12254.bs_lists@aakef.fastmail.fm> Hello Pawel, sorry for my late reply. On Monday 06 April 2009, Pawel Dziekonski wrote: > On Thu, 02 Apr 2009 at 08:07:20PM +0200, Bernd Schubert wrote: > > Hello, > > > > I'm fighting (as usual) with some Lustre problems and I think this time > > it is IB related. In the logs of some systems I see messages like these: > > > > ib_mthca 0000:0d:00.0: Async event 16 for bogus QP 00da0407 > > > > Anyone knows what is the meaning of that? The kernel modules are from > > OFED-1.3.1. > > Hi Bernd, > > we are also using 1.3.1 and Lustre, as you have seen recently at our > site ;-) > > I'm getting messages like these only when large computing jobs are > running using IPoIB. I believe that this is a issue with send/receive > buffers, because I see dropped packets on IPoIB iface. Those jobs work > usually fine (usually because this app is buggy itself) so I find > those messages rather harmless. just out of interest, which applications are using IPoIB? Cheers, Bernd From bs_lists at aakef.fastmail.fm Sat Apr 11 13:16:00 2009 From: bs_lists at aakef.fastmail.fm (Bernd Schubert) Date: Sat, 11 Apr 2009 22:16:00 +0200 Subject: [ofa-general] ib_mthca 0000:0d:00.0: Async event 16 for bogus =?iso-8859-1?q?QP=0900da0407?= In-Reply-To: <49D9E1DB.5050502@mellanox.co.il> References: <200904022007.20630.bs_lists@aakef.fastmail.fm> <49D9E1DB.5050502@mellanox.co.il> Message-ID: <200904112216.00725.bs_lists@aakef.fastmail.fm> On Monday 06 April 2009, Tziporet Koren wrote: > Bernd Schubert wrote: > > Hello, > > > > I'm fighting (as usual) with some Lustre problems and I think this time > > it is IB related. In the logs of some systems I see messages like these: > > > > ib_mthca 0000:0d:00.0: Async event 16 for bogus QP 00da0407 > > This message means the driver get an asynchronous event from the HW for > a QP that was already closed. Sorry for my late reply, had been busy with too many issues. Somewhere I have the IB specs Erez sometime ago gave me, I really need to find the time to read them (main issue, is that they have to much pages to print them and the pdf looks horrible on my ebook reader, so not really suitable for trains or airplanes...). So at http://www.oreillynet.com/pub/a/network/2002/02/04/windows.html?page=2 I see a QP is a queue pair, which is used for communication between two systems. So this messages means the application closed the connection, but there was still something queued (e.g. on the other side) and the hardware received that after on this host the connection was already closed? Btw, in the mean time I already figured out the reason for our IB problem - clients got out of memory and that somehow caused IB issues, I'm going to send another mail about that. Thanks, Bernd From bs_lists at aakef.fastmail.fm Sat Apr 11 13:33:50 2009 From: bs_lists at aakef.fastmail.fm (Bernd Schubert) Date: Sat, 11 Apr 2009 22:33:50 +0200 Subject: [ofa-general] mlx4: errors and failures on OOM Message-ID: <200904112233.51105.bs_lists@aakef.fastmail.fm> Hello, last week I had issues with Lustre failures, which turned out to be failures of many clients, which run into out-of-memory due to bad user space jobs (and no protection again that by the queuing system). Anyway, I don't think IB is supposed to fail, when the oom killer activates. Errors for 0x001b0d0000008ede "Cisco Switch" 5: [XmtDiscards == 270] Link info: 38 5[ ] ==( 4X 5.0 Gbps)==> 0x00188b9097fe2a81 1[ ] "eul0605 HCA-1" 16: [XmtDiscards == 132] Link info: 38 16[ ] ==( 4X 5.0 Gbps)==> 0x00188b9097fe2a01 1[ ] "eul0616 HCA-1" I used a script to monitor the fabric for failures every 5 min and just when the oom killer activated on the clients the messages above came up. Below are syslogs from one of these clients Apr 4 08:50:38 eul0605 kernel: Lustre: Request x50173 sent from MGC172.17.31.247 at o2ib to NID 172.17.31.247 at o2ib 51s ago has timed out (limit 300s). Apr 4 08:50:38 eul0605 kernel: Lustre: Skipped 30 previous similar messages Apr 4 08:50:38 eul0605 kernel: LustreError: 166-1: MGC172.17.31.247 at o2ib: Connection to service MGS via nid 172.17.31.247 at o2ib was lost; in progress operations using this service will fail. Apr 4 08:50:38 eul0605 kernel: Lustre: home1-MDT0000-mdc-0000010430fa0800: Connection to service home1-MDT0000 via nid 172.17.31.247 at o2ib was lost; in progress operations using this service will wait for recovery to complete. Apr 4 08:50:38 eul0605 kernel: Lustre: Skipped 7 previous similar messages Apr 4 08:50:38 eul0605 kernel: Lustre: tmp-OST0003-osc-0000010423750000: Connection to service tmp-OST0003 via nid 172.17.31.231 at o2ib was lost; in progress operations using this service will wait for recovery to complete. Apr 4 08:50:38 eul0605 kernel: Lustre: Skipped 29 previous similar messages Apr 4 08:50:38 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 req at 000001041bcbb800 x50205/t0 o250->MGS at 172.17.31.247@o2ib:26/25 lens 304/456 e 0 to 1 dl 1238828031 ref 2 fl Rpc:N/0/0 rc 0/0 Apr 4 08:50:38 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) Skipped 31 previous similar messages Apr 4 08:50:38 eul0605 kernel: Lustre: Request x50205 sent from MGC172.17.31.247 at o2ib to NID 172.17.31.247 at o2ib 51s ago has timed out (limit 300s). ===> So somehow lustre lost the network connection. On the server side the logs simply show this node didn't answer to pings anymore. Apr 4 08:52:58 eul0605 kernel: Lustre: Skipped 31 previous similar messages Apr 4 08:52:59 eul0605 kernel: Lustre: Changing connection for MGC172.17.31.247 at o2ib to MGC172.17.31.247 at o2ib_1/172.17.31.246 at o2ib Apr 4 08:52:59 eul0605 kernel: Lustre: Skipped 61 previous similar messages Apr 4 08:53:00 eul0605 kernel: oom-killer: gfp_mask=0xd2 [...] Apr 4 08:53:05 eul0605 kernel: Out of Memory: Killed process 10612 (gamos). Apr 4 08:53:10 eul0605 kernel: 3212 pages swap cached Apr 4 08:53:10 eul0605 kernel: Out of Memory: Killed process 10292 (tcsh). ===> And here we see, gamos consumed all memory again. Apr 4 08:53:10 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 req at 0000010430f8f800 x50237/t0 o250->MGS at MGC172.17.31.247@o2ib_1:26/25 lens 304/456 e 0 to 1 dl 1238828107 ref 2 fl Rpc:N/0/0 rc 0/0 Apr 4 08:53:10 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) Skipped 31 previous similar messages Apr 4 08:53:10 eul0605 kernel: Lustre: Request x50237 sent from MGC172.17.31.247 at o2ib to NID 172.17.31.246 at o2ib 50s ago has timed out (limit 300s). Apr 4 08:53:10 eul0605 kernel: Lustre: Skipped 31 previous similar messages Apr 4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 Apr 4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 Apr 4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 Apr 4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 Apr 4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 Apr 4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 Apr 4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 Apr 4 08:53:11 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 Apr 4 08:53:11 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 ===> So we see the reason why Lustre lost network connection - infiniband is down. In most cases IB recovers from that situation, not always. If it then entirely fails, ibnetdiscover or ibclearerrors will report that can't resolve the route to these nodes. This with drivers from ofed-1.3.1. Any ideas why OOM causes issues with IB? Thanks, Bernd From worleys at gmail.com Sat Apr 11 17:55:14 2009 From: worleys at gmail.com (Chris Worley) Date: Sat, 11 Apr 2009 18:55:14 -0600 Subject: ***SPAM*** Re: [ofa-general] Re: iSer tuning guide? Message-ID: On Fri, Apr 10, 2009 at 3:34 PM, Chris Worley wrote: > It looks like it's the target side at fault.  The target in stalled is > the one pointed to by the OFED wiki: > > http://www.voltaire.com/ftp/support-products/source/stgt/scsi-target-utils-0.1-20080828.x86_64.rpm > > ... it's mind-boggling to believe that any OFED component that's 8 > months old would be compatible with the latest release... but that > directory isn't browsable, so it's hard to tell if there is something > newer. The OFED iSer wiki does say "1.4 includes iSer target support"... so rather than using the rpm shown above, or the install ofed.conf also referred to by the twiki, I configured OFED w/ tgt (somebody w/ permission should fix the wiki). It created two conflicting RPMs: scsi-target-utils-0.1-20080828.x86_64.rpm and tgt-0.1-20080828.x86_64.rpm ... both had the same issues /w iSer as previously reported (one target max). Note that for discovery I use: iscsi_discovery -t iser -f -l If I use "tcp" instead of "iser", then multiple targets work, but <400MB/s... over Quad IB w/ drives that get >GB/s. In looking around the web and at other mailing lists, it looks like iSer is still in it's infancy and there is no reliable IB implementation, which would be be exemplified by this one-sided conversation. Chris > > Most of the time, if I restart the tgtd and opensm/openibd on the > target system, then restart just iscsi on the initiators, I can get > iser working again after it hangs... but sometimes the tgtd gets into > a state where it segfaults at launch, and I have to reboot the target > machine before things work again. > > It looks like it works great with one initiator and one target... any > more than that and the connections are hung and the target needs > restarted. > > I can run "fio" with one thread against two targets, and it's okay... > more than one thread, and it hangs. > > Multiple initiators (separate systems) work as long as the access is > serial... accessing two different or the same target from two > different initiators simultaneously, and it hangs. > > I have seen it crash the initiator also, but no console to see what > the crash was. > > So, I guess I'll just start trying recent OFED versions until I find a > stable iser. If anybody can give me a hint as to what might be both > stable and perform well, it would be appreciated. > > Thanks, > > Chris > On Fri, Apr 10, 2009 at 2:35 PM, Chris Worley wrote: >> What's version of OFED has both a stable and high performance iSer? >> >> I setup the latest OFED (1.4.1rc3), and the speed kicked in, but using >> fio on the initiator using two targets hung quickly on the initiator: >> >>  connection3:0: ping timeout of 5 secs expired, last rx 4557795905, >> last ping 4557800905, now 4557805905 >>  connection3:0: detected conn error (1011) >> iser: iscsi_iser_ep_disconnect:ib conn ffff8104dc482ec0 state 2 >> iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 >> iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 >> iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 >> iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 >> iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 >>  connection4:0: ping timeout of 5 secs expired, last rx 4557796701, >> last ping 4557801701, now 4557806701 >>  connection4:0: detected conn error (1011) >>  connection1:0: ping timeout of 5 secs expired, last rx 4557799350, >> last ping 4557804350, now 4557809350 >>  connection1:0: detected conn error (1011) >>  connection2:0: ping timeout of 5 secs expired, last rx 4557800108, >> last ping 4557805108, now 4557810108 >>  connection2:0: detected conn error (1011) >>  connection6:0: ping timeout of 5 secs expired, last rx 4557800224, >> last ping 4557805224, now 4557810224 >>  connection6:0: detected conn error (1011) >>  connection5:0: ping timeout of 5 secs expired, last rx 4557800231, >> last ping 4557805231, now 4557810234 >>  connection5:0: detected conn error (1011) >> iser: iser_cma_handler:event 10 conn ffff8104dc482ec0 id ffff8105b1502800 >> iser: iser_free_ib_conn_res:freeing conn ffff8104dc482ec0 cma_id >> ffff8105b1502800 fmr pool ffff8105f9b555c0 qp ffff81027b346000 >> iser: iser_device_try_release:device ffff81081bb81640 refcount 5 >> iser: iscsi_iser_ep_disconnect:ib conn ffff8104dc4824c0 state 2 >> iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 >> iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 >> iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 >> iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 >> iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 >>  session3: session recovery timed out after 120 secs >> iser: iser_cma_handler:event 10 conn ffff8104dc4824c0 id ffff8107df2ce400 >> iser: iser_free_ib_conn_res:freeing conn ffff8104dc4824c0 cma_id >> ffff8107df2ce400 fmr pool ffff8107f60c90c0 qp ffff8107720d2400 >> iser: iser_device_try_release:device ffff81081bb81640 refcount 4 >> iser: iscsi_iser_ep_disconnect:ib conn ffff81081fa521c0 state 2 >> iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 >> iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 >> iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 >> iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 >> iser: iser_cq_tasklet_fn:comp w. error op 0 status 5 >>  session4: session recovery timed out after 120 secs >> >> >> On Fri, Apr 10, 2009 at 8:53 AM, Chris Worley wrote: >>> I'm running RHEL5.2 w/ the stock OFED RPMs.  I have a target disk that >>> run at 800MB/s locally and QDR IB.  I need to get some performance out >>> of this, as I'm seeing <300MB/s in my benchmarks. >>> >>> On the initiator side, I run a test using 1MB block sizes. >>> >>> On the target side I'm seeing everything in 4K packets. >>> >>> Any idea who's setting this and why? >>> >>> Is there a tuning guide available? >>> >>> Thanks, >>> >>> Chris >>> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> > From sashak at voltaire.com Sun Apr 12 00:23:20 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 12 Apr 2009 10:23:20 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] Fix further bugs around console closure and clean up code. In-Reply-To: <20090407171350.7582ce84.weiny2@llnl.gov> References: <20090407171350.7582ce84.weiny2@llnl.gov> Message-ID: <20090412072320.GB7664@sk> On 17:13 Tue 07 Apr , Ira Weiny wrote: > > This patch fixes all this by removing console_close, making cio_close close > only the connection, and fixing osm_console_exit to properly clean up from > osm_console_init. > > Signed-off-by: Ira Weiny This patch in its original form breaks OpenSM build with --disable-console-socket (which is default now). So applied with such changes (also clean unused p_log variable warning): diff --git a/opensm/include/opensm/osm_console_io.h b/opensm/include/opensm/osm_console_io.h index d1dbbdd..b51cbf7 100644 --- a/opensm/include/opensm/osm_console_io.h +++ b/opensm/include/opensm/osm_console_io.h @@ -85,6 +85,8 @@ int is_console_enabled(osm_subn_opt_t *p_opt); int cio_open(osm_console_t * p_oct, int new_fd, osm_log_t * p_log); int cio_close(osm_console_t * p_oct, osm_log_t * p_log); int is_authorized(osm_console_t * p_oct); +#else +#define cio_close(c, log) #endif END_C_DECLS diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index 182c64e..00264e5 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -1385,7 +1385,6 @@ int osm_console(osm_opensm_t * p_osm) struct pollfd *fds; nfds_t nfds; osm_console_t *p_oct = &p_osm->console; - osm_log_t *p_log = &p_osm->log; pollfd[0].fd = p_oct->socket; pollfd[0].events = POLLIN; @@ -1418,7 +1417,7 @@ int osm_console(osm_opensm_t * p_osm) socklen_t len = sizeof(sin); struct hostent *hent; if ((new_fd = accept(p_oct->socket, &sin, &len)) < 0) { - OSM_LOG(p_log, OSM_LOG_ERROR, + OSM_LOG(&p_osm->log, OSM_LOG_ERROR, "ERR 4B04: Failed to accept console socket: %s\n", strerror(errno)); p_oct->in_fd = -1; @@ -1437,9 +1436,9 @@ int osm_console(osm_opensm_t * p_osm) snprintf(p_oct->client_hn, 128, "%s", hent->h_name); } if (is_authorized(p_oct)) { - cio_open(p_oct, new_fd, p_log); + cio_open(p_oct, new_fd, &p_osm->log); } else { - OSM_LOG(p_log, OSM_LOG_ERROR, + OSM_LOG(&p_osm->log, OSM_LOG_ERROR, "ERR 4B05: Console connection denied: %s (%s)\n", p_oct->client_hn, p_oct->client_ip); close(new_fd); @@ -1459,7 +1458,7 @@ int osm_console(osm_opensm_t * p_osm) osm_console_prompt(p_oct->out); } } else - cio_close(p_oct, p_log); + cio_close(p_oct, &p_osm->log); if (p_line) free(p_line); return 0; @@ -1469,7 +1468,7 @@ int osm_console(osm_opensm_t * p_osm) #ifdef ENABLE_OSM_CONSOLE_SOCKET /* If we are using a socket, we close the current connection */ if (p_oct->socket >= 0) { - cio_close(p_oct, p_log); + cio_close(p_oct, &p_osm->log); return 0; } #endif Thanks, Sasha From sashak at voltaire.com Sun Apr 12 00:58:07 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 12 Apr 2009 10:58:07 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] infiniband-diags/vendstat: Update man page and examples for PortXmit/RcvDataSL counter support In-Reply-To: <20090408150444.GA24876@comcast.net> References: <20090408150444.GA24876@comcast.net> Message-ID: <20090412075807.GD7664@sk> On 11:04 Wed 08 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sun Apr 12 02:25:02 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 12 Apr 2009 12:25:02 +0300 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] ibsim: Add SMSL support to PortInfo attribute In-Reply-To: References: <20090324182510.GA18072@comcast.net> <20090324182213.GF20085@sashak.voltaire.com> Message-ID: <20090412092502.GG7664@sk> On 14:39 Tue 24 Mar , Hal Rosenstock wrote: > > > > What is a purpose of this? Do you have any plans to use this field? > > > > If no, I don't see what this patch adds - SMSL is handled already as part > > of PortInfo buffer. > > It's needed when SMSL is not 0 (e.g. Line's recent patch for lash). Ok. I see. Actually the problem is that in do_portinfo() received PortInfo is not copied to target port's PortInfo (as I thought) and update is done for only selected fields. Wouldn't it be better to rework it in the way where we will not need to store useless (for simulator) PortInfo values as separate port structure fields? So incoming PortInfo buffer will be just copied (of course with caring about special fields - states, RO, etc..). Sasha From vlad at lists.openfabrics.org Sun Apr 12 03:22:24 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 12 Apr 2009 03:22:24 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090412-0200 daily build status Message-ID: <20090412102225.166FEE61154@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From ishai at mellanox.co.il Sun Apr 12 06:58:22 2009 From: ishai at mellanox.co.il (Ishai Rabinovitz) Date: Sun, 12 Apr 2009 16:58:22 +0300 Subject: [ofa-general] RE: [PATCH] When no DM ports are found return an error In-Reply-To: <20090409170133.7c4a747c.weiny2@llnl.gov> References: <20090409170133.7c4a747c.weiny2@llnl.gov> Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD0250FD25@mtlexch01.mtl.com> Hi Ira, Good catch, thank you. I applied it, but I changed the return value to be 0, because we do not want to end the execution. So the final patch is: commit d89490a22f0b781a505311bfa53f0acca983e097 Author: Ishai Rabinovitz Date: Sun Apr 12 11:54:56 2009 +0300 Fix a bug when there is no target in the fabric Signed-off-by: Ira Weiny Signed-off-by: Ishai Rabinovitz diff --git a/srp_daemon/srp_daemon.c b/srp_daemon/srp_daemon.c index 5e1e198..14087bd 100644 --- a/srp_daemon/srp_daemon.c +++ b/srp_daemon/srp_daemon.c @@ -943,6 +943,13 @@ static int do_dm_port_list(struct resources *res) size = ib_get_attr_size(in_sa_mad->attr_offset); + if (!size) { + if (config->verbose) { + printf("Query did not find any targets\n"); + } + return 0; + } + for (i = 0; (i + 1) * size <= len - MAD_RMPP_HDR_SIZE; ++i) { port_info = (void *) in_sa_mad->data + i * size; Ishai > -----Original Message----- > From: Ira Weiny [mailto:weiny2 at llnl.gov] > Sent: Friday, April 10, 2009 3:02 AM > To: Ishai Rabinovitz > Cc: OpenFabrics; D. Marc Stearman > Subject: [PATCH] When no DM ports are found return an error > > I think I have found a bug in srp_daemon when running on a fabric with > no SRP targets. > > The patch below fixes the issue. This is when running against OpenSM > 3.2.5. Without this fix the daemon goes into an endless loop looking > for the node with lid 0! > > Ira > > > From: Ira Weiny > Date: Thu, 9 Apr 2009 16:41:58 -0700 > Subject: [PATCH] When no DM ports are found return an error. > > > Signed-off-by: Ira Weiny > --- > srp_daemon/srp_daemon.c | 3 +++ > 1 files changed, 3 insertions(+), 0 deletions(-) > > diff --git a/srp_daemon/srp_daemon.c b/srp_daemon/srp_daemon.c > index 5e1e198..621678b 100644 > --- a/srp_daemon/srp_daemon.c > +++ b/srp_daemon/srp_daemon.c > @@ -943,6 +943,9 @@ static int do_dm_port_list(struct resources *res) > > size = ib_get_attr_size(in_sa_mad->attr_offset); > > + if (!size) > + return -1; > + > for (i = 0; (i + 1) * size <= len - MAD_RMPP_HDR_SIZE; ++i) { > port_info = (void *) in_sa_mad->data + i * size; > > -- > 1.5.4.5 From helight.xu at gmail.com Sun Apr 12 05:23:17 2009 From: helight.xu at gmail.com (Zhenwen Xu) Date: Sun, 12 Apr 2009 20:23:17 +0800 Subject: [ofa-general] ***SPAM*** [PATCH] fix a warning on drivers/infiniband/hw/nes/nes_cm.c:862: Message-ID: <20090412122317.GA4787@helight> Fix this warning: drivers/infiniband/hw/nes/nes_cm.c:862: warning: unused variable ‘tmp_addr’ the 'tmp_addr' is defined for debug, so it should be defined in CONFIG_INFINIBAND_NES_DEBUG >From 5f67884bcda5450807dcd080378d829628e4db1c Mon Sep 17 00:00:00 2001 From: Zhenwen Xu Date: Sun, 12 Apr 2009 20:12:18 +0800 Subject: [PATCH] fix a warning on drivers/infiniband/hw/nes/nes_cm.c:862: Signed-off-by: Zhenwen Xu --- drivers/infiniband/hw/nes/nes_cm.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index dbd9a75..1bad93b 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -854,8 +854,9 @@ static struct nes_cm_listener *find_listener(struct nes_cm_core *cm_core, { unsigned long flags; struct nes_cm_listener *listen_node; +#ifdef CONFIG_INFINIBAND_NES_DEBUG __be32 tmp_addr = cpu_to_be32(dst_addr); - +#endif /* walk list and find cm_node associated with this session ID */ spin_lock_irqsave(&cm_core->listen_list_lock, flags); list_for_each_entry(listen_node, &cm_core->listen_list.list, list) { -- 1.5.6.5 -- --------------------------------- Zhenwen Xu - Open and Free Home Page: http://zhwen.org My Studio: http://dim4.cn From kliteyn at dev.mellanox.co.il Sun Apr 12 08:28:38 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 12 Apr 2009 18:28:38 +0300 Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c: fixing bug in indexing Message-ID: <49E208A6.3030206@dev.mellanox.co.il> Hi Sasha, Fixing bug in indexing that was introduced by one of the recent code cleanups (commit 90e3291c040ef36a3c5e1bb5f76c866b049c79d0). Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_ucast_ftree.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c index 58d1c14..fa4a0dd 100644 --- a/opensm/opensm/osm_ucast_ftree.c +++ b/opensm/opensm/osm_ucast_ftree.c @@ -1480,7 +1480,7 @@ static void fabric_make_indexing(IN ftree_fabric_t * p_ftree) new_tuple); /* add the newly discovered switch to the BFS queue */ - cl_list_insert_tail(&bfs_list, p_sw); + cl_list_insert_tail(&bfs_list, p_remote_sw); } /* Done assigning indexes to all the remote switches that are pointed by the downgoing ports. @@ -1513,7 +1513,7 @@ static void fabric_make_indexing(IN ftree_fabric_t * p_ftree) fabric_assign_tuple(p_ftree, p_remote_sw, new_tuple); /* add the newly discovered switch to the BFS queue */ - cl_list_insert_tail(&bfs_list, p_sw); + cl_list_insert_tail(&bfs_list, p_remote_sw); } /* Done assigning indexes to all the remote switches that are pointed by the upgoing ports. -- 1.5.1.4 From kliteyn at dev.mellanox.co.il Sun Apr 12 09:09:24 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 12 Apr 2009 19:09:24 +0300 Subject: [ofa-general] [PATCH] opensm/osm_ucast_ftree.c: lids are always handled in host order Message-ID: <49E21234.2060906@dev.mellanox.co.il> Hi Sasha, There's a mess in host vs. network order in lids handling in ftree. In vast majority of the cases lid is required to be in host order, so there are many cl_ntoh16() conversions. Fixing it to be always in host order. Signed-off-by: Yevgeny Kliteynik --- opensm/opensm/osm_ucast_ftree.c | 163 +++++++++++++++++++-------------------- 1 files changed, 78 insertions(+), 85 deletions(-) diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c index 58d1c14..38ade6d 100644 --- a/opensm/opensm/osm_ucast_ftree.c +++ b/opensm/opensm/osm_ucast_ftree.c @@ -138,8 +138,8 @@ typedef union ftree_hca_or_sw_ { typedef struct ftree_port_group_t_ { cl_map_item_t map_item; - ib_net16_t base_lid; /* base lid of the current node */ - ib_net16_t remote_base_lid; /* base lid of the remote node */ + uint16_t base_lid; /* base lid of the current node */ + uint16_t remote_base_lid; /* base lid of the remote node */ ib_net64_t port_guid; /* port guid of this port */ ib_net64_t node_guid; /* this node's guid */ uint8_t node_type; /* this node's type */ @@ -165,7 +165,7 @@ typedef struct ftree_sw_t_ { osm_switch_t *p_osm_sw; uint32_t rank; ftree_tuple_t tuple; - ib_net16_t base_lid; + uint16_t base_lid; ftree_port_group_t **down_port_groups; uint8_t down_port_groups_num; ftree_port_group_t **up_port_groups; @@ -207,7 +207,7 @@ typedef struct ftree_fabric_t_ { ftree_sw_t **leaf_switches; uint32_t leaf_switches_num; uint16_t max_cn_per_leaf; - uint16_t lft_max_lid_ho; + uint16_t lft_max_lid; boolean_t fabric_built; } ftree_fabric_t; @@ -371,8 +371,8 @@ static void port_destroy(IN ftree_port_t * p_port) ** ***************************************************/ -static ftree_port_group_t *port_group_create(IN ib_net16_t base_lid, - IN ib_net16_t remote_base_lid, +static ftree_port_group_t *port_group_create(IN uint16_t base_lid, + IN uint16_t remote_base_lid, IN ib_net64_t port_guid, IN ib_net64_t node_guid, IN uint8_t node_type, @@ -490,9 +490,9 @@ static void port_group_dump(IN ftree_fabric_t * p_ftree, "0x%016" PRIx64 " (0x%04x) <--> 0x%016" PRIx64 " (0x%04x)\n", size, buff, (direction == FTREE_DIRECTION_DOWN) ? "DOWN" : "UP", - cl_ntoh64(p_group->port_guid), cl_ntoh16(p_group->base_lid), + cl_ntoh64(p_group->port_guid), p_group->base_lid, cl_ntoh64(p_group->remote_port_guid), - cl_ntoh16(p_group->remote_base_lid)); + p_group->remote_base_lid); } /* port_group_dump() */ @@ -539,7 +539,8 @@ static ftree_sw_t *sw_create(IN ftree_fabric_t * p_ftree, p_sw->rank = 0xFFFFFFFF; tuple_init(p_sw->tuple); - p_sw->base_lid = osm_node_get_base_lid(p_sw->p_osm_sw->p_node, 0); + p_sw->base_lid = cl_ntoh16( + osm_node_get_base_lid(p_sw->p_osm_sw->p_node, 0)); ports_num = osm_node_get_num_physp(p_sw->p_osm_sw->p_node); p_sw->down_port_groups = @@ -631,7 +632,7 @@ static boolean_t sw_ranked(IN ftree_sw_t * p_sw) /***************************************************/ static ftree_port_group_t *sw_get_port_group_by_remote_lid(IN ftree_sw_t * p_sw, - IN ib_net16_t + IN uint16_t remote_base_lid, IN ftree_direction_t direction) @@ -658,8 +659,8 @@ static ftree_port_group_t *sw_get_port_group_by_remote_lid(IN ftree_sw_t * p_sw, /***************************************************/ static void sw_add_port(IN ftree_sw_t * p_sw, IN uint8_t port_num, - IN uint8_t remote_port_num, IN ib_net16_t base_lid, - IN ib_net16_t remote_base_lid, IN ib_net64_t port_guid, + IN uint8_t remote_port_num, IN uint16_t base_lid, + IN uint16_t remote_base_lid, IN ib_net64_t port_guid, IN ib_net64_t remote_port_guid, IN ib_net64_t remote_node_guid, IN uint8_t remote_node_type, @@ -691,17 +692,17 @@ static void sw_add_port(IN ftree_sw_t * p_sw, IN uint8_t port_num, /***************************************************/ -static inline cl_status_t sw_set_hops(IN ftree_sw_t * p_sw, IN uint16_t lid_ho, +static inline cl_status_t sw_set_hops(IN ftree_sw_t * p_sw, IN uint16_t lid, IN uint8_t port_num, IN uint8_t hops) { /* set local min hop table(LID) */ - return osm_switch_set_hops(p_sw->p_osm_sw, lid_ho, port_num, hops); + return osm_switch_set_hops(p_sw->p_osm_sw, lid, port_num, hops); } /***************************************************/ static int set_hops_on_remote_sw(IN ftree_port_group_t * p_group, - IN ib_net16_t target_lid, IN uint8_t hops) + IN uint16_t target_lid, IN uint8_t hops) { ftree_port_t *p_port; uint8_t i, ports_num; @@ -711,7 +712,7 @@ static int set_hops_on_remote_sw(IN ftree_port_group_t * p_group, ports_num = (uint8_t) cl_ptr_vector_get_size(&p_group->ports); for (i = 0; i < ports_num; i++) { cl_ptr_vector_at(&p_group->ports, i, (void *)&p_port); - if (sw_set_hops(p_remote_sw, cl_ntoh16(target_lid), + if (sw_set_hops(p_remote_sw, target_lid, p_port->remote_port_num, hops)) return -1; } @@ -800,7 +801,7 @@ static void hca_dump(IN ftree_fabric_t * p_ftree, IN ftree_hca_t * p_hca) static ftree_port_group_t *hca_get_port_group_by_remote_lid(IN ftree_hca_t * p_hca, - IN ib_net16_t + IN uint16_t remote_base_lid) { uint32_t i; @@ -815,8 +816,8 @@ static ftree_port_group_t *hca_get_port_group_by_remote_lid(IN ftree_hca_t * /***************************************************/ static void hca_add_port(IN ftree_hca_t * p_hca, IN uint8_t port_num, - IN uint8_t remote_port_num, IN ib_net16_t base_lid, - IN ib_net16_t remote_base_lid, IN ib_net64_t port_guid, + IN uint8_t remote_port_num, IN uint16_t base_lid, + IN uint16_t remote_base_lid, IN ib_net64_t port_guid, IN ib_net64_t remote_port_guid, IN ib_net64_t remote_node_guid, IN uint8_t remote_node_type, @@ -951,7 +952,7 @@ static void fabric_clear(ftree_fabric_t * p_ftree) p_ftree->leaf_switch_rank = 0; p_ftree->max_switch_rank = 0; p_ftree->max_cn_per_leaf = 0; - p_ftree->lft_max_lid_ho = 0; + p_ftree->lft_max_lid = 0; p_ftree->leaf_switches = NULL; p_ftree->fabric_built = FALSE; @@ -998,8 +999,8 @@ static void fabric_add_sw(ftree_fabric_t * p_ftree, osm_switch_t * p_osm_sw) &p_sw->map_item); /* track the max lid (in host order) that exists in the fabric */ - if (cl_ntoh16(p_sw->base_lid) > p_ftree->lft_max_lid_ho) - p_ftree->lft_max_lid_ho = cl_ntoh16(p_sw->base_lid); + if (p_sw->base_lid > p_ftree->lft_max_lid) + p_ftree->lft_max_lid = p_sw->base_lid; } /***************************************************/ @@ -1156,7 +1157,7 @@ static void fabric_dump_general_info(IN ftree_fabric_t * p_ftree) " GUID: 0x%016" PRIx64 ", LID: %u, Index %s\n", sw_get_guid_ho(p_sw), - cl_ntoh16(p_sw->base_lid), + p_sw->base_lid, tuple_to_str(p_sw->tuple)); } @@ -1167,7 +1168,7 @@ static void fabric_dump_general_info(IN ftree_fabric_t * p_ftree) " GUID: 0x%016" PRIx64 ", LID: %u, Index %s\n", sw_get_guid_ho(p_ftree->leaf_switches[i]), - cl_ntoh16(p_ftree->leaf_switches[i]->base_lid), + p_ftree->leaf_switches[i]->base_lid, tuple_to_str(p_ftree->leaf_switches[i]->tuple)); } } @@ -1222,7 +1223,7 @@ static void fabric_dump_hca_ordering(IN ftree_fabric_t * p_ftree) continue; fprintf(p_hca_ordering_file, "0x%04x\t%s\n", - cl_ntoh16(p_group_on_hca->base_lid), + p_group_on_hca->base_lid, p_hca->p_osm_node->print_desc); printed_hcas_on_leaf++; @@ -1426,7 +1427,7 @@ static void fabric_make_indexing(IN ftree_fabric_t * p_ftree) " - Node LID : %u\n" " - Node GUID : 0x%016" PRIx64 "\n", p_sw->rank, tuple_to_str(p_sw->tuple), - cl_ntoh16(p_sw->base_lid), sw_get_guid_ho(p_sw)); + p_sw->base_lid, sw_get_guid_ho(p_sw)); /* * Now run BFS and assign indexes to all switches @@ -1697,14 +1698,13 @@ static boolean_t fabric_validate_topology(IN ftree_fabric_t * p_ftree) ", LID %u, Index %s - %u groups\n", sw_get_guid_ho (reference_sw_arr[p_sw->rank]), - cl_ntoh16(reference_sw_arr[p_sw->rank]-> - base_lid), + reference_sw_arr[p_sw->rank]->base_lid, tuple_to_str (reference_sw_arr[p_sw->rank]->tuple), reference_sw_arr[p_sw->rank]-> up_port_groups_num, sw_get_guid_ho(p_sw), - cl_ntoh16(p_sw->base_lid), + p_sw->base_lid, tuple_to_str(p_sw->tuple), p_sw->up_port_groups_num); res = FALSE; @@ -1724,14 +1724,13 @@ static boolean_t fabric_validate_topology(IN ftree_fabric_t * p_ftree) ", LID %u, Index %s - %u port groups\n", sw_get_guid_ho (reference_sw_arr[p_sw->rank]), - cl_ntoh16(reference_sw_arr[p_sw->rank]-> - base_lid), + reference_sw_arr[p_sw->rank]->base_lid, tuple_to_str (reference_sw_arr[p_sw->rank]->tuple), reference_sw_arr[p_sw->rank]-> down_port_groups_num, sw_get_guid_ho(p_sw), - cl_ntoh16(p_sw->base_lid), + p_sw->base_lid, tuple_to_str(p_sw->tuple), p_sw->down_port_groups_num); res = FALSE; @@ -1761,10 +1760,9 @@ static boolean_t fabric_validate_topology(IN ftree_fabric_t * p_ftree) sw_get_guid_ho (reference_sw_arr [p_sw->rank]), - cl_ntoh16 - (reference_sw_arr + reference_sw_arr [p_sw->rank]-> - base_lid), + base_lid, tuple_to_str (reference_sw_arr [p_sw->rank]->tuple), @@ -1772,8 +1770,7 @@ static boolean_t fabric_validate_topology(IN ftree_fabric_t * p_ftree) (&p_ref_group->ports), sw_get_guid_ho (p_sw), - cl_ntoh16(p_sw-> - base_lid), + p_sw->base_lid, tuple_to_str (p_sw->tuple), cl_ptr_vector_get_size @@ -1808,10 +1805,9 @@ static boolean_t fabric_validate_topology(IN ftree_fabric_t * p_ftree) sw_get_guid_ho (reference_sw_arr [p_sw->rank]), - cl_ntoh16 - (reference_sw_arr + reference_sw_arr [p_sw->rank]-> - base_lid), + base_lid, tuple_to_str (reference_sw_arr [p_sw->rank]->tuple), @@ -1819,8 +1815,7 @@ static boolean_t fabric_validate_topology(IN ftree_fabric_t * p_ftree) (&p_ref_group->ports), sw_get_guid_ho (p_sw), - cl_ntoh16(p_sw-> - base_lid), + p_sw->base_lid, tuple_to_str (p_sw->tuple), cl_ptr_vector_get_size @@ -1854,7 +1849,7 @@ static void set_sw_fwd_table(IN cl_map_item_t * const p_map_item, ftree_sw_t *p_sw = (ftree_sw_t * const)p_map_item; ftree_fabric_t *p_ftree = (ftree_fabric_t *) context; - p_sw->p_osm_sw->max_lid_ho = p_ftree->lft_max_lid_ho; + p_sw->p_osm_sw->max_lid_ho = p_ftree->lft_max_lid; osm_ucast_mgr_set_fwd_table(&p_ftree->p_osm->sm.ucast_mgr, p_sw->p_osm_sw); } @@ -1879,7 +1874,7 @@ static boolean_t fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, IN ftree_sw_t * p_sw, IN ftree_sw_t * p_prev_sw, - IN ib_net16_t target_lid, + IN uint16_t target_lid, IN uint8_t target_rank, IN boolean_t is_real_lid, IN boolean_t is_main_path, @@ -1946,8 +1941,7 @@ fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, p_remote_sw = p_group->remote_hca_or_sw.p_sw; if (osm_switch_get_least_hops(p_remote_sw->p_osm_sw, - cl_ntoh16(target_lid)) != - OSM_NO_PATH) { + target_lid) != OSM_NO_PATH) { /* Loop in the fabric - we already routed the remote switch on our way UP, and now we see it again on our way DOWN */ OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, @@ -1955,9 +1949,9 @@ fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, "Switch %s (LID %u) closes loop through switch %s (LID %u)\n", (p_remote_sw->rank - highest_rank_in_route) * 2, tuple_to_str(p_remote_sw->tuple), - cl_ntoh16(p_group->base_lid), + p_group->base_lid, tuple_to_str(p_sw->tuple), - cl_ntoh16(p_group->remote_base_lid)); + p_group->remote_base_lid); continue; } @@ -1991,18 +1985,18 @@ fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, /* second case: skip the port group if the remote (lower) switch has been already configured for this target LID */ if (is_real_lid && !is_main_path && - p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] != + p_remote_sw->p_osm_sw->new_lft[target_lid] != OSM_NO_PATH) continue; /* setting fwd tbl port only if this is real LID */ if (is_real_lid) { - p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] = + p_remote_sw->p_osm_sw->new_lft[target_lid] = p_min_port->remote_port_num; OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "Switch %s: set path to CA LID %u through port %u\n", tuple_to_str(p_remote_sw->tuple), - cl_ntoh16(target_lid), + target_lid, p_min_port->remote_port_num); /* On the remote switch that is pointed by the p_group, @@ -2062,7 +2056,7 @@ fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, IN ftree_sw_t * p_sw, IN ftree_sw_t * p_prev_sw, - IN ib_net16_t target_lid, + IN uint16_t target_lid, IN uint8_t target_rank, IN boolean_t is_real_lid, IN boolean_t is_main_path, @@ -2198,7 +2192,7 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, " - Routing MAIN path for %s CA LID %u: %s --> %s\n", (is_real_lid) ? "real" : "DUMMY", - cl_ntoh16(target_lid), + target_lid, tuple_to_str(p_sw->tuple), tuple_to_str(p_remote_sw->tuple)); } @@ -2212,14 +2206,14 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, /* This LID may already be in the LFT in the reverse_hop feature is used */ /* We update the LFT only if this LID isn't already present. */ if (p_remote_sw->p_osm_sw-> - new_lft[cl_ntoh16(target_lid)] == OSM_NO_PATH) { + new_lft[target_lid] == OSM_NO_PATH) { p_remote_sw->p_osm_sw-> - new_lft[cl_ntoh16(target_lid)] = + new_lft[target_lid] = p_min_port->remote_port_num; OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "Switch %s: set path to CA LID %u through port %u\n", tuple_to_str(p_remote_sw->tuple), - cl_ntoh16(target_lid), + target_lid, p_min_port->remote_port_num); } /* On the remote switch that is pointed by the min_group, @@ -2283,14 +2277,14 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, p_remote_sw = p_group->remote_hca_or_sw.p_sw; /* skip if target lid has been already set on remote switch fwd tbl */ - if (p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] != + if (p_remote_sw->p_osm_sw->new_lft[target_lid] != OSM_NO_PATH) continue; if (p_sw->is_leaf) { OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, " - Routing SECONDARY path for LID %u: %s --> %s\n", - cl_ntoh16(target_lid), + target_lid, tuple_to_str(p_sw->tuple), tuple_to_str(p_remote_sw->tuple)); } @@ -2302,7 +2296,7 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, trying to balance these routes - always pick port 0. */ cl_ptr_vector_at(&p_group->ports, 0, (void *)&p_port); - p_remote_sw->p_osm_sw->new_lft[cl_ntoh16(target_lid)] = + p_remote_sw->p_osm_sw->new_lft[target_lid] = p_port->remote_port_num; /* On the remote switch that is pointed by the p_group, @@ -2376,7 +2370,7 @@ static void fabric_route_to_cns(IN ftree_fabric_t * p_ftree) ftree_port_t *p_port; uint32_t i; uint32_t j; - ib_net16_t hca_lid; + uint16_t hca_lid; unsigned routed_targets_on_leaf; OSM_LOG_ENTER(&p_ftree->p_osm->log); @@ -2417,16 +2411,16 @@ static void fabric_route_to_cns(IN ftree_fabric_t * p_ftree) /* set local LFT(LID) to the port that is connected to HCA */ cl_ptr_vector_at(&p_leaf_port_group->ports, 0, (void *)&p_port); - p_sw->p_osm_sw->new_lft[cl_ntoh16(hca_lid)] = + p_sw->p_osm_sw->new_lft[hca_lid] = p_port->port_num; OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "Switch %s: set path to CN LID %u through port %u\n", tuple_to_str(p_sw->tuple), - cl_ntoh16(hca_lid), p_port->port_num); + hca_lid, p_port->port_num); /* set local min hop table(LID) to route to the CA */ - sw_set_hops(p_sw, cl_ntoh16(hca_lid), + sw_set_hops(p_sw, hca_lid, p_port->port_num, 1); /* Assign downgoing ports by stepping up. @@ -2500,7 +2494,7 @@ static void fabric_route_to_non_cns(IN ftree_fabric_t * p_ftree) ftree_hca_t *p_next_hca; ftree_port_t *p_hca_port; ftree_port_group_t *p_hca_port_group; - ib_net16_t hca_lid; + uint16_t hca_lid; unsigned port_num_on_switch; unsigned i; @@ -2530,16 +2524,15 @@ static void fabric_route_to_non_cns(IN ftree_fabric_t * p_ftree) cl_ptr_vector_at(&p_hca_port_group->ports, 0, (void *)&p_hca_port); port_num_on_switch = p_hca_port->remote_port_num; - p_sw->p_osm_sw->new_lft[cl_ntoh16(hca_lid)] = - port_num_on_switch; + p_sw->p_osm_sw->new_lft[hca_lid] = port_num_on_switch; OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "Switch %s: set path to non-CN HCA LID %u through port %u\n", tuple_to_str(p_sw->tuple), - cl_ntoh16(hca_lid), port_num_on_switch); + hca_lid, port_num_on_switch); /* set local min hop table(LID) to route to the CA */ - sw_set_hops(p_sw, cl_ntoh16(hca_lid), port_num_on_switch, /* port num */ + sw_set_hops(p_sw, hca_lid, port_num_on_switch, /* port num */ 1); /* hops */ /* Assign downgoing ports by stepping up. @@ -2588,14 +2581,14 @@ static void fabric_route_to_switches(IN ftree_fabric_t * p_ftree) p_next_sw = (ftree_sw_t *) cl_qmap_next(&p_sw->map_item); /* set local LFT(LID) to 0 (route to itself) */ - p_sw->p_osm_sw->new_lft[cl_ntoh16(p_sw->base_lid)] = 0; + p_sw->p_osm_sw->new_lft[p_sw->base_lid] = 0; OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, "Switch %s (LID %u): routing switch-to-switch paths\n", - tuple_to_str(p_sw->tuple), cl_ntoh16(p_sw->base_lid)); + tuple_to_str(p_sw->tuple), p_sw->base_lid); /* set min hop table of the switch to itself */ - sw_set_hops(p_sw, cl_ntoh16(p_sw->base_lid), 0, /* port_num */ + sw_set_hops(p_sw, p_sw->base_lid, 0, /* port_num */ 0); /* hops */ fabric_route_downgoing_by_going_up(p_ftree, p_sw, /* local switch - used as a route-downgoing alg. start point */ @@ -2797,7 +2790,7 @@ static int rank_leaf_switches(IN ftree_fabric_t * p_ftree, PRIx64 "\n" " - Switch LID : %u\n", hca_get_guid_ho(p_hca), - sw_get_guid_ho(p_sw), cl_ntoh16(p_sw->base_lid)); + sw_get_guid_ho(p_sw), p_sw->base_lid); cl_list_insert_tail(p_ranking_bfs_list, p_sw); } @@ -2937,8 +2930,8 @@ fabric_construct_hca_ports(IN ftree_fabric_t * p_ftree, IN ftree_hca_t * p_hca) hca_add_port(p_hca, /* local ftree_hca object */ i, /* local port number */ remote_port_num, /* remote port number */ - osm_node_get_base_lid(p_node, i), /* local lid */ - osm_node_get_base_lid(p_remote_node, 0), /* remote lid */ + cl_ntoh16(osm_node_get_base_lid(p_node, i)), /* local lid */ + cl_ntoh16(osm_node_get_base_lid(p_remote_node, 0)), /* remote lid */ osm_physp_get_port_guid(p_osm_port), /* local port guid */ osm_physp_get_port_guid(p_remote_osm_port), /* remote port guid */ remote_node_guid, /* remote node guid */ @@ -2961,7 +2954,7 @@ static int fabric_construct_sw_ports(IN ftree_fabric_t * p_ftree, ftree_sw_t *p_remote_sw; osm_node_t *p_node = p_sw->p_osm_sw->p_node; osm_node_t *p_remote_node; - ib_net16_t remote_base_lid; + uint16_t remote_base_lid; uint8_t remote_node_type; ib_net64_t remote_node_guid; osm_physp_t *p_remote_osm_port; @@ -2991,7 +2984,7 @@ static int fabric_construct_sw_ports(IN ftree_fabric_t * p_ftree, "Ignoring loopback on switch GUID 0x%016" PRIx64 ", LID %u, rank %u\n", sw_get_guid_ho(p_sw), - cl_ntoh16(p_sw->base_lid), p_sw->rank); + p_sw->base_lid, p_sw->rank); continue; } @@ -3013,8 +3006,8 @@ static int fabric_construct_sw_ports(IN ftree_fabric_t * p_ftree, p_remote_hca_or_sw = (void *)p_remote_hca; direction = FTREE_DIRECTION_DOWN; - remote_base_lid = - osm_physp_get_base_lid(p_remote_osm_port); + remote_base_lid = cl_ntoh16( + osm_physp_get_base_lid(p_remote_osm_port)); break; case IB_NODE_TYPE_SWITCH: @@ -3036,9 +3029,9 @@ static int fabric_construct_sw_ports(IN ftree_fabric_t * p_ftree, ", LID %u, rank %u\n", p_sw->rank, p_remote_sw->rank, sw_get_guid_ho(p_sw), - cl_ntoh16(p_sw->base_lid), p_sw->rank, + p_sw->base_lid, p_sw->rank, sw_get_guid_ho(p_remote_sw), - cl_ntoh16(p_remote_sw->base_lid), + p_remote_sw->base_lid, p_remote_sw->rank); res = -1; goto Exit; @@ -3050,8 +3043,8 @@ static int fabric_construct_sw_ports(IN ftree_fabric_t * p_ftree, direction = FTREE_DIRECTION_DOWN; /* switch LID is only in port 0 port_info structure */ - remote_base_lid = - osm_node_get_base_lid(p_remote_node, 0); + remote_base_lid = cl_ntoh16( + osm_node_get_base_lid(p_remote_node, 0)); break; @@ -3077,8 +3070,8 @@ static int fabric_construct_sw_ports(IN ftree_fabric_t * p_ftree, direction); /* port direction (up or down) */ /* Track the max lid (in host order) that exists in the fabric */ - if (cl_ntoh16(remote_base_lid) > p_ftree->lft_max_lid_ho) - p_ftree->lft_max_lid_ho = cl_ntoh16(remote_base_lid); + if (remote_base_lid > p_ftree->lft_max_lid) + p_ftree->lft_max_lid = remote_base_lid; } Exit: @@ -3610,7 +3603,7 @@ static int construct_fabric(IN void *context) } OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_VERBOSE, - "Max LID in switch LFTs: %u\n", p_ftree->lft_max_lid_ho); + "Max LID in switch LFTs: %u\n", p_ftree->lft_max_lid); Exit: if (status != 0) { -- 1.5.1.4 From worleys at gmail.com Sun Apr 12 20:01:27 2009 From: worleys at gmail.com (Chris Worley) Date: Sun, 12 Apr 2009 21:01:27 -0600 Subject: [ofa-general] Any easy way to specify to the SM to route/zone? Message-ID: I have a system w/ multipe IB cards, two ports each... I'm only getting ~1.6GB/s out per port, so I need to use multiple ports. I can't use IB bonding, as the scst package that I'm using that works reliably isn't compatible w/ OFED 1.4, but it performs very well w/ the RHEL5.2 built-in drivers... but bonding isn't supported in RHEL5.2 (afaik). I figured I could use IPoIB subnets and zone specific ports to specific clients/initiators... but the SM doesn't respect IPoIB routes (bring down a subnet's interface on the target, and the client can still ping one of the target's other interfaces, even though the client isn't configured on the same subnet). So I need to tell the SM to route specific ports on the server/target to specific clients/initiators. Is there any way to do this? Chris From bart.vanassche at gmail.com Sun Apr 12 23:32:34 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Mon, 13 Apr 2009 08:32:34 +0200 Subject: ***SPAM*** Re: [ofa-general] Any easy way to specify to the SM to route/zone? In-Reply-To: References: Message-ID: On Mon, Apr 13, 2009 at 5:01 AM, Chris Worley wrote: > I have a system w/ multipe IB cards, two ports each... I'm only > getting ~1.6GB/s out per port, so I need to use multiple ports. Is this on a QDR network ? Which tool did you use to measure throughput ? Bart. From vlad at lists.openfabrics.org Mon Apr 13 03:22:34 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 13 Apr 2009 03:22:34 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090413-0200 daily build status Message-ID: <20090413102234.C91AEE61214@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.27 Passed on i686 with linux-2.6.26 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From hal.rosenstock at gmail.com Mon Apr 13 04:39:33 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 13 Apr 2009 07:39:33 -0400 Subject: ***SPAM*** Re: [ofa-general] Any easy way to specify to the SM to route/zone? In-Reply-To: References: Message-ID: On Sun, Apr 12, 2009 at 11:01 PM, Chris Worley wrote: > I have a system w/ multipe IB cards, two ports each... I'm only > getting ~1.6GB/s out per port, so I need to use multiple ports. > > I can't use IB bonding, as the scst package that I'm using that works > reliably isn't compatible w/ OFED 1.4, but it performs very well w/ > the RHEL5.2 built-in drivers... but bonding isn't supported in RHEL5.2 > (afaik). > > I figured I could use IPoIB subnets and zone specific ports to > specific clients/initiators... but the SM doesn't respect IPoIB routes > (bring down a subnet's interface on the target, and the client can > still ping one of the target's other interfaces, even though the > client isn't configured on the same subnet). > > So I need to tell the SM to route specific ports on the server/target > to specific clients/initiators. > > Is there any way to do this? Do you mean restrict access between certain clients/servers ? If so, you can do this with partitioning (which will also affect your IPoIB subnets). I'm not sure what the bonding implications are of partitioning though. -- Hal > Chris From hal.rosenstock at gmail.com Mon Apr 13 04:40:33 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 13 Apr 2009 07:40:33 -0400 Subject: ***SPAM*** Re: [ofa-general] mlx4: errors and failures on OOM In-Reply-To: <200904112233.51105.bs_lists@aakef.fastmail.fm> References: <200904112233.51105.bs_lists@aakef.fastmail.fm> Message-ID: On Sat, Apr 11, 2009 at 4:33 PM, Bernd Schubert wrote: > Hello, > > last week I had issues with Lustre failures, which turned out to be > failures of many clients, which run into out-of-memory due to bad user space jobs > (and no protection again that by the queuing system). > > Anyway, I don't think IB is supposed to fail, when the oom killer activates. > > Errors for 0x001b0d0000008ede "Cisco Switch" >   5: [XmtDiscards == 270] >         Link info:     38    5[  ]  ==( 4X 5.0 Gbps)==>  0x00188b9097fe2a81    1[  ] "eul0605 HCA-1" >   16: [XmtDiscards == 132] >         Link info:     38   16[  ]  ==( 4X 5.0 Gbps)==>  0x00188b9097fe2a01    1[  ] "eul0616 HCA-1" > > I used a script to monitor the fabric for failures every 5 min and just when the oom > killer activated on the clients the messages above came up. XmtDiscards are the total number of outbound packets discarded by the port because the port is down or congested. Reasons for this include: • Output port is not in the active state • Packet length exceeded NeighborMTU • Switch Lifetime Limit exceeded • Switch HOQ Lifetime Limit exceeded This may also include packets discarded while in VLStalled State. > Below are syslogs from one of these clients > > Apr  4 08:50:38 eul0605 kernel: Lustre: Request x50173 sent from MGC172.17.31.247 at o2ib to NID 172.17.31.247 at o2ib 51s ago has timed out (limit > 300s). > Apr  4 08:50:38 eul0605 kernel: Lustre: Skipped 30 previous similar messages > Apr  4 08:50:38 eul0605 kernel: LustreError: 166-1: MGC172.17.31.247 at o2ib: Connection to service MGS via nid 172.17.31.247 at o2ib was lost; in > progress operations using this service will fail. > Apr  4 08:50:38 eul0605 kernel: Lustre: home1-MDT0000-mdc-0000010430fa0800: Connection to service home1-MDT0000 via nid 172.17.31.247 at o2ib was > lost; in progress operations using this service will wait for recovery to complete. > Apr  4 08:50:38 eul0605 kernel: Lustre: Skipped 7 previous similar messages > Apr  4 08:50:38 eul0605 kernel: Lustre: tmp-OST0003-osc-0000010423750000: Connection to service tmp-OST0003 via nid 172.17.31.231 at o2ib was lost; in > progress operations using this service will wait for recovery to complete. > Apr  4 08:50:38 eul0605 kernel: Lustre: Skipped 29 previous similar messages > Apr  4 08:50:38 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  req at 000001041bcbb800 x50205/t0 > o250->MGS at 172.17.31.247@o2ib:26/25 lens 304/456 e 0 to 1 dl 1238828031 ref 2 fl Rpc:N/0/0 rc 0/0 > Apr  4 08:50:38 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) Skipped 31 previous similar messages > Apr  4 08:50:38 eul0605 kernel: Lustre: Request x50205 sent from MGC172.17.31.247 at o2ib to NID 172.17.31.247 at o2ib 51s ago has timed out (limit > 300s). > > ===> So somehow lustre lost the network connection. On the server side the > logs simply show this node didn't answer to pings anymore. > > > Apr  4 08:52:58 eul0605 kernel: Lustre: Skipped 31 previous similar messages > Apr  4 08:52:59 eul0605 kernel: Lustre: Changing connection for MGC172.17.31.247 at o2ib to MGC172.17.31.247 at o2ib_1/172.17.31.246 at o2ib > Apr  4 08:52:59 eul0605 kernel: Lustre: Skipped 61 previous similar messages > Apr  4 08:53:00 eul0605 kernel: oom-killer: gfp_mask=0xd2 > > [...] > > Apr  4 08:53:05 eul0605 kernel: Out of Memory: Killed process 10612 (gamos). > Apr  4 08:53:10 eul0605 kernel: 3212 pages swap cached > Apr  4 08:53:10 eul0605 kernel: Out of Memory: Killed process 10292 (tcsh). > > ===> And here we see, gamos consumed all memory again. > > Apr  4 08:53:10 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  req at 0000010430f8f800 x50237/t0 > o250->MGS at MGC172.17.31.247@o2ib_1:26/25 lens 304/456 e 0 to 1 dl 1238828107 ref 2 fl Rpc:N/0/0 rc 0/0 > Apr  4 08:53:10 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) Skipped 31 previous similar messages > Apr  4 08:53:10 eul0605 kernel: Lustre: Request x50237 sent from MGC172.17.31.247 at o2ib to NID 172.17.31.246 at o2ib 50s ago has timed out (limit > 300s). > Apr  4 08:53:10 eul0605 kernel: Lustre: Skipped 31 previous similar messages > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > Apr  4 08:53:11 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > Apr  4 08:53:11 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 That multicast group looks like the IPv4 broadcast group; -11 is EAGAIN. I'm not sure what's causing IPoIB to indicate this but I wonder if this is a second level failure due to the previous (Lustre) error detected. -- Hal > ===> So we see the reason why Lustre lost network connection - infiniband is down. > > > In most cases IB recovers from that situation, not always. If it then entirely > fails, ibnetdiscover or ibclearerrors will report that can't resolve the route > to these nodes. > > > This with drivers from ofed-1.3.1. Any ideas why OOM causes issues with IB? > > > Thanks, > Bernd > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From worleys at gmail.com Mon Apr 13 06:37:42 2009 From: worleys at gmail.com (Chris Worley) Date: Mon, 13 Apr 2009 07:37:42 -0600 Subject: ***SPAM*** Re: [ofa-general] Any easy way to specify to the SM to route/zone? In-Reply-To: References: Message-ID: On Mon, Apr 13, 2009 at 5:39 AM, Hal Rosenstock wrote: > On Sun, Apr 12, 2009 at 11:01 PM, Chris Worley wrote: >> I have a system w/ multipe IB cards, two ports each... I'm only >> getting ~1.6GB/s out per port, so I need to use multiple ports. >> >> I can't use IB bonding, as the scst package that I'm using that works >> reliably isn't compatible w/ OFED 1.4, but it performs very well w/ >> the RHEL5.2 built-in drivers... but bonding isn't supported in RHEL5.2 >> (afaik). >> >> I figured I could use IPoIB subnets and zone specific ports to >> specific clients/initiators... but the SM doesn't respect IPoIB routes >> (bring down a subnet's interface on the target, and the client can >> still ping one of the target's other interfaces, even though the >> client isn't configured on the same subnet). >> >> So I need to tell the SM to route specific ports on the server/target >> to specific clients/initiators. >> >> Is there any way to do this? > > Do you mean restrict access between certain clients/servers ? One server w/ 4QDR boards, 16 clients with one QDR board. I want each port on the server routed/zoned to two clients. > If so, > you can do this with partitioning What is partitioning? > (which will also affect your IPoIB > subnets). I don't need the subnets... I was trying to use them to effect routing, which didn't work. > I'm not sure what the bonding implications are of > partitioning though. I can't use bonding w/ the RHEL5.2 IB drivers. Thanks, Chris > -- Hal > >> Chris > From hal.rosenstock at gmail.com Mon Apr 13 06:43:36 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 13 Apr 2009 09:43:36 -0400 Subject: ***SPAM*** Re: [ofa-general] Any easy way to specify to the SM to route/zone? In-Reply-To: References: Message-ID: On Mon, Apr 13, 2009 at 9:37 AM, Chris Worley wrote: > On Mon, Apr 13, 2009 at 5:39 AM, Hal Rosenstock > wrote: >> On Sun, Apr 12, 2009 at 11:01 PM, Chris Worley wrote: >>> I have a system w/ multipe IB cards, two ports each... I'm only >>> getting ~1.6GB/s out per port, so I need to use multiple ports. >>> >>> I can't use IB bonding, as the scst package that I'm using that works >>> reliably isn't compatible w/ OFED 1.4, but it performs very well w/ >>> the RHEL5.2 built-in drivers... but bonding isn't supported in RHEL5.2 >>> (afaik). >>> >>> I figured I could use IPoIB subnets and zone specific ports to >>> specific clients/initiators... but the SM doesn't respect IPoIB routes >>> (bring down a subnet's interface on the target, and the client can >>> still ping one of the target's other interfaces, even though the >>> client isn't configured on the same subnet). >>> >>> So I need to tell the SM to route specific ports on the server/target >>> to specific clients/initiators. >>> >>> Is there any way to do this? >> >> Do you mean restrict access between certain clients/servers ? > > One server w/ 4QDR boards, 16 clients with one QDR board.  I want each > port on the server routed/zoned to two clients. > >> If so, >> you can do this with partitioning > > What is partitioning? A partition is a collection of ports which are allowed to communicate together. There are two forms of members: full members which can talk to any other member (useful for servers) and limited members which can only talk to full members (useful for clients). See the opensm man page or partition-config.txt on setting this up for OpenSM. -- Hal >> (which will also affect your IPoIB >> subnets). > > I don't need the subnets... I was trying to use them to effect > routing, which didn't work. > >> I'm not sure what the bonding implications are of >> partitioning though. > > I can't use bonding w/ the RHEL5.2 IB drivers. > > Thanks, > > Chris >> -- Hal >> >>> Chris >> > From sashak at voltaire.com Mon Apr 13 06:54:34 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 13 Apr 2009 16:54:34 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm/osm_ucast_ftree.c: fixing bug in indexing In-Reply-To: <49E208A6.3030206@dev.mellanox.co.il> References: <49E208A6.3030206@dev.mellanox.co.il> Message-ID: <20090413135434.GB5521@sk> On 18:28 Sun 12 Apr , Yevgeny Kliteynik wrote: > Hi Sasha, > > Fixing bug in indexing that was introduced by one of the > recent code cleanups (commit 90e3291c040ef36a3c5e1bb5f76c866b049c79d0). > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From subbukl at gmail.com Mon Apr 13 07:04:57 2009 From: subbukl at gmail.com (subbu kl) Date: Mon, 13 Apr 2009 19:34:57 +0530 Subject: [ofa-general] ***SPAM*** softIB on KVM Message-ID: Hi, the following presentation talks about a faster interconnect based on virtual Infiniband device between VM in se of Xen I am wondering whether is was released for testing ever, I am doing some experiments with KVM and would be interesting to see how it performs on KVM. www.openfabrics.org/archives/spring2007sonoma/Monday%20April%2030/Xiong%20OFA-Sonoma-2007-04-30- *SoftIB*.ppt -- ~subbu -------------- next part -------------- An HTML attachment was scrubbed... URL: From chien.tin.tung at intel.com Mon Apr 13 08:28:41 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Mon, 13 Apr 2009 10:28:41 -0500 Subject: [ofa-general] Subject: [PATCH] RDMA/nes: Update iw_nes version Message-ID: <20090413152841.GA3648@ctung-MOBL> Update version number to 1.5.0.0 Signed-off-by: Chien Tung --- drivers/infiniband/hw/nes/nes.h | 4 +--- 1 files changed, 1 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes.h b/drivers/infiniband/hw/nes/nes.h index 17621de..bf1720f 100644 --- a/drivers/infiniband/hw/nes/nes.h +++ b/drivers/infiniband/hw/nes/nes.h @@ -56,10 +56,8 @@ #define QUEUE_DISCONNECTS -#define DRV_BUILD "1" - #define DRV_NAME "iw_nes" -#define DRV_VERSION "1.0 KO Build " DRV_BUILD +#define DRV_VERSION "1.5.0.0" #define PFX DRV_NAME ": " /* -- 1.5.3.3 From worleys at gmail.com Mon Apr 13 09:02:28 2009 From: worleys at gmail.com (Chris Worley) Date: Mon, 13 Apr 2009 10:02:28 -0600 Subject: ***SPAM*** Re: [ofa-general] Any easy way to specify to the SM to route/zone? In-Reply-To: References: Message-ID: On Mon, Apr 13, 2009 at 7:43 AM, Hal Rosenstock wrote: > On Mon, Apr 13, 2009 at 9:37 AM, Chris Worley wrote: >> On Mon, Apr 13, 2009 at 5:39 AM, Hal Rosenstock >> wrote: >>> On Sun, Apr 12, 2009 at 11:01 PM, Chris Worley wrote: >>>> >>>> So I need to tell the SM to route specific ports on the server/target >>>> to specific clients/initiators. >>>> >>>> Is there any way to do this? >>> >>> Do you mean restrict access between certain clients/servers ? >> >> One server w/ 4QDR boards, 16 clients with one QDR board.  I want each >> port on the server routed/zoned to two clients. >> >>> If so, >>> you can do this with partitioning >> >> What is partitioning? > > A partition is a collection of ports which are allowed to communicate > together. There are two forms of members: full members which can talk > to any other member (useful for servers) and limited members which can > only talk to full members (useful for clients). See the opensm man > page or partition-config.txt on setting this up for OpenSM. > Let me see if I understand this with a simple example... my port GUIDs (as reported by ibstat) are for one server (4 QDR ports) and four clients (one QDR port each): Server A: Port GUID: 0x0024717124000029 Server B: Port GUID: 0x002471712400002a Server C: Port GUID: 0x0024717127000035 Server D: Port GUID: 0x0024717127000036 Client 1: Port GUID: 0x0002c90300028c01 Client 2: Port GUID: 0x0002c90300026047 Client 3: Port GUID: 0x0002c90300026053 Client 4: Port GUID: 0x0002c9030002603b Assuming I want a 1:1 (one server port to one client) partitioning, I would put the following in /etc/ofed/partitions.conf: part1=0x1, ipoib, defmember=full : 0x0024717124000029, 0x0002c90300028c01; part2=0x2, ipoib, defmember=full : 0x002471712400002a, 0x0002c90300026047; part3=0x3, ipoib, defmember=full : 0x0024717127000035, 0x0002c90300026053; part4=0x4, ipoib, defmember=full : 0x0024717127000036, 0x0002c9030002603b; ... and run w/: opensm -r -B -P/etc/ofed/partitions.conf Does that sound correct? It doesn't work (I restarted ib on the clients), although ibstat shows the links up. What am I getting wrong? The opensmd is running on the server. Thanks, Chris From jsquyres at cisco.com Mon Apr 13 09:07:17 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Mon, 13 Apr 2009 12:07:17 -0400 Subject: [ofa-general] New proposal for memory management Message-ID: The following is a proposal from several MPI implementations to the OpenFabrics community (various MPI implementation representatives CC'ed). The basic concept was introduced in the MPI Panel at Sonoma (see http://www.openfabrics.org/archives/spring2009sonoma/tuesday/panel3/panel3.zip) ; it was further refined in discussions after Sonoma. Introduction: ============= MPI has long had a problem maintaining its own verbs memory registration cache in userspace. The main issue is that user applications are responsible for allocating/freeing their own data buffers -- the MPI layer does not (usually) have visibility when application buffers are allocated or freed. Hence, MPI has had to intercept deallocation calls in order to know when its registration cache entries have potentially become invalid. Horrible and dangerous tricks are used to intercept the various flavors of free, sbrk, munmap, etc. Here's the classic scenario we're trying to handle better: 1. MPI application allocs buffer A and MPI_SENDs it 2. MPI library registers buffer A and caches it (in user space) 3. MPI application frees buffer A 4. page containing buffer A is returned to the OS 5. MPI application allocs buffer B 5a. B is at the same virtual address as A, but different physical address 6. MPI application MPI_SENDs buffer B 7. MPI library thinks B is already registered and sends it --> the physical address may well still be registered, so the send does not fail -- but it's the wrong data Note that the above scenario occurs because before Linux kernel v2.6.27, the OF kernel drivers are not notified when pages are returned to the OS -- we're leaking registered memory, and therefore the OF driver/hardware have the wrong virtual/physical mapping. It *may* not segv at step 7 because the OF driver/hardware can still access the memory and it is still registered. But it will definitely be accessing the wrong physical memory. In discussions before the Sonoma OpenFabrics event this year, several MPI implementations got together and concluded that userspace "notifier" functions might solve this issue for MPI (as proposed by Pete Wyckoff quite a while ago). Specifically, when memory is unregistered down in the kernel, a flag is set in userspace that allows the userspace to know that it needs to make a [potentially expensive] downcall to find out exactly what happened. In this way, MPI can know when to update its registration cache safely. After further post-Sonoma discussion, it became evident that the so-called userspace "notifier" functions nat not solve the problem -- there seem to be unavoidable race conditions, particularly in multi-threaded applications (more on this below). We concluded that what could be useful is to move the registration cache from the userspace/MPI down into the kernel and maintain it on a per-protection domain (PD) basis. Short version: ============== Here's a short version of our proposal: 1. A new enum value is added to ibv_access_flags: IBV_ACCESS_CACHE. If this flag is set in the call to ibv_reg_mr(), the following occurs down in the kernel: - look for the memory to be registered in the PD-specific cache - if found - increment its refcount - else - try to register the memory - if the registration fails because no more memory is available - traverse all PD registration caches in this process, evicting/unregistering each entry with a refcount <= 0 - try to register the memory again - if the registration succeeds (either the 1st or the 2nd time), put it in the PD cache with a refcount of 1 If this flag is *not* set in the call to ibv_reg_mr(), then the following occurs: - try to register the memory - if the registration fails because no more registered memory is available - traverse all PD registration caches in this process, evicting/unregistering each entry with a refcount <= 0 - try to register the memory again If an application never uses IBV_ACCESS_CACHE, registration performance should be no different. Registration costs may increase slightly in some cases if there is a non-empty registration cache. 2. The kernel side of the ibv_dereg_mr() deregistration call now does the following: - look for the memory to be deregistered in the PD's cache - if it's in the cache - decrement the refcount (leaving the memory registered) - else - unregister the memory 3. A new verb, ibv_is_reg(), is created to query if the entire buffer X is already registered. If it is, increase its refcount in the reg cache. If it is not, just return an error (and do not register any of the buffer). --> An alternate proposal for this idea is to add another ibv_access_flags value (e.g., IBV_ACCESS_IS_CACHED) instead of a new verb. But that might be a little odd in that we don't want the memory registered if it's not already registered. This verb is useful for pipelined protocols to offset the cost of registration of long buffers (e.g., if the buffer is already registered, just send it -- otherwise let the ULP potentially do something else). See below for a more detailed explanation / use case. 4. A new verb, ibv_reg_mr_limits(), is created to specify some configuration information about the registration cache. Configuration specifics TBD here, but one obvious possibility here would be to specify the maximum number of pages that can be registered by this process (which must be <= the value specified limits.conf, or it will fail). 5. A new verb, ibv_reg_mr_clean(), is created to traverse the internal registration cache and actually de-register any item with a refcount <= 0. The intent is to give applications the ability to forcibly deregister any still-existing memory that has been ibv_reg_mr(..., IBV_ACCESS_CACHE)'ed and later ibv_dereg_mr()'ed. These proposals assume that the new IOMMU notify system in >=2.6.27 kernels will be used to catch when memory is returned from a process to the kernel, and will both unregister the memory and remove it from the kernel PD reg caches, if relevant. More details: ============= Starting with Linux kernel v2.6.27, the OF kernel drivers can be notified when pages are returned to the OS (I don't know if they yet take advantage of this feature). However, we can still run into pretty much the same scenario -- the MPI userspace registration cache can become invalid even though the kernel is no longer leaking registered memory. The situation is *slightly* better because the ibv_post_send() may fail because the memory will (in a single threaded application) likely be unregistered. Pete Wyckoff's solution several years ago was to add two steps into the scenario listed above; my understanding is this is now possible with the IOMMU notifiers in 2.6.27 (new steps 4a and 4b): 1. MPI application allocs buffer A and MPI_SENDs it 2. MPI library registers buffer A and caches it (in user space) 3. MPI application frees buffer A 4. page containing buffer A is returned to the OS 4a. OF kernel driver is notified and can unregister the page 4b. OF kernel driver can twiddle a bit in userspace indicating that something has changed ...etc. The thought here is that the MPI can register a global variable during MPI_INIT that can be modified during step 4b. Hence, you can add a cheap "if" statement in MPI's send path like this: if (variable_has_changed_indicating_step_4b_executed) { ibv_expensive_downcall_to_find_out_what_happened(..., &output); if (need_to_register(buffer, mpi_reg_cache, output)) { ibv_reg_mr(buffer, ...); } } ibv_post_send(...); You get the idea -- check the global variable before invoking ibv_post_send() or ibv_post_recv(), and if necessary, register the memory that MPI thought was already registered. But whacky situations might occur in a multithreaded application where one thread calls free() while another thread calls malloc(), gets the same virtual address that was just free()d but has not yet been unregistered in the kernel, so a subsequent ibv_post_send() may succeed but be sending the wrong data. Put simply: in a multi-threaded application, there's always the chance that the notify won't get to the user-level process until after the global notifier variable has been checked, right? Or, putting it the other way: is there any kind of notify system that could be used that *can't* create a potential race condition in a multi-threaded user application? NOTE: There's actually some debate about whether this "bad" scenario could actually happen -- I admit that I'm not entirely sure. But if this race condition *can* happen, then I cannot think of a kernel notifier system that would not have this race condition. So a few of us hashed this around and came up with an alternate proposal: 1. Move the entire registration cache down into the kernel. Supporting rationale: 1a. If all ULPs (MPIs, in this case) have to implement registration caches, why not implement it *once*, not N times? 1b. Putting the reg cache in the kernel means that with the IOMMU notifier system introduced in 2.6.27, the kernel can call back to the device driver when the mapping changes so that a) the memory can be deregistered, and b) the corresponding item can be removed from the registration cache. Specifically: the race condition described above can be fixed because it's all located in one place in the kernel. 2. This means that the userspace process must *always* call ibv_reg_mr() and ibv_dereg_mr() to increment / decrement the reference counts on the kernel reg cache. But in practice, on-demand registration/de-registration is only done for long messages (short messages typically use copy-to-pre-registered-buffers schemes). So the additional ibv_reg_mr() before calling ibv_post_send() / ibv_post_recv() for long messages shouldn't matter. 3. The registration cache in the kernel can lazily deregister cached memory, as described in the "short version" discussion, above (quite similar to what MPI's do today). To offset the cost of large memory registrations (because registration is linearly proportional to the size of the buffer being registered), pipelined protocols are sometimes used. As such, it seems useful to have a "is this memory already registered?" verb -- a ULP can check to see if an entire long message is already registered, and if so, do a single large RDMA action. If not, the ULP can use a pipelined protocol to loop over registering a portion of the buffer and then RDMA'ing it. Possible pipelined pseudocode can look like this: if (ibv_is_reg(pd, buffer, len)) { ibv_post_send(); // will still need to ibv_dereg_mr() after completion } else { // pipeline loop for (i = 0; ...) { ibv_reg_mr(pd, buffer + i*pipeline_size, pipeline_size, IBV_ACCESS_CACHE); ibv_post_send(...); } } The rationale here is that these verbs allow the flexibility of doing something like the above scenario or just registering the whole long buffer and sending it immediately: ibv_reg_mr(pd, buffer, len, IBV_ACCESS_CACHE); ibv_post_send(...); It may also be useful to progamatically enforce some limits on a given PD's registration cache. A per-process limit is already enforced via /etc/security/limits.conf, but it may be useful to specify per-PD limits in the ULP (MPI) itself. Note that most MPI's have controls like this already; it's consistent with moving the registration cache down to the kernel. A proposal for the verb could be: ibv_reg_mr_cache_limits(pd, max_num_pages) Another userspace-accessible verb that may be useful is one that traverses a PD's reg cache and actually deregisters any item with a refcount <= 0. This allows a ULP to "clean out" any lingering registrations, thereby freeing up registered memory for other uses (e.g., being registered by another PD). This verb can have a simplistic interface: ibv_reg_mr_clean(pd) It's not 100% clear that we need this "clean" verb -- if ibv_reg_mr() will evict entries with <= 0 refcounts from any PD's registration cache in this process, that might be enough. However, using verbs registered memory with other (non-verbs) pinned memory in the same process may make this verb necessary. ----- Finally, it should be noted that with 2.6.27's IOMMU notify system, full on-demand paging / registering seems possible. On-demand paging would be a full, complete solution -- the ULP wouldn't have to worry about registering / de-registering memory at all (the existing de/registration verbs could become no-ops for backwards compatibility). I assume that a proposal along these lines this would be a [much] larger debate in the OpenFabrics community, and further assume that the proposal above would be a smaller debate and actually have a chance of being implemented in the not-distant future. (/me puts on fire suit) Thoughts? -- Jeff Squyres Cisco Systems From faisal.latif at intel.com Mon Apr 13 09:09:47 2009 From: faisal.latif at intel.com (Faisal Latif) Date: Mon, 13 Apr 2009 11:09:47 -0500 Subject: [ofa-general] [PATCH] RDMA/nes: application hang during large cluster test Message-ID: <20090413160947.GA4260@flatif-MOBL> Running large cluster setup, sometimes tests are hanging during long testing cycle. Fixing required following changes in nes_cm.[ch] code. * Under heavy load, sometimes it takes longer to receive the response from application to the MPA request. The rexmit timeout value is too low. * in handle_fin_pkt(), we are calling cleanup_retrans_entry() for all conditions, even if the packets needs to be dropped. * check_seq(), does not check for condition if the seq# is wrapped. * handle_ack_pkt() need to return error value, so in case of error, handle_fin() is not called. * handle_rst_pkt(), handling of cm_node's NES_CM_STATE_LAST_ACK is missing. * process_packet(), in case of FIN only packet is received, call check_seq() before processing. * nes_connect() is not to set apbvt bit if it is a loopback connection. apbvt bit is only set for non-loopback connections as for loopback, all the connection setup is done from the driver. Signed-off-by: Faisal Latif --- drivers/infiniband/hw/nes/nes_cm.c | 74 ++++++++++++++++++------------------ drivers/infiniband/hw/nes/nes_cm.h | 1 + 2 files changed, 38 insertions(+), 37 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index dbd9a75..61da9d3 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -56,6 +56,7 @@ #include #include #include +#include #include "nes.h" @@ -540,6 +541,7 @@ static void nes_cm_timer_tick(unsigned long pass) struct list_head *list_node; struct nes_cm_core *cm_core = g_cm_core; u32 settimer = 0; + unsigned long timetosend; int ret = NETDEV_TX_OK; struct list_head timer_list; @@ -644,8 +646,10 @@ static void nes_cm_timer_tick(unsigned long pass) send_entry->retrycount); if (send_entry->send_retrans) { send_entry->retranscount--; + timetosend = (NES_RETRY_TIMEOUT << + (NES_DEFAULT_RETRANS - send_entry->retranscount)); send_entry->timetosend = jiffies + - NES_RETRY_TIMEOUT; + min(timetosend, NES_MAX_TIMEOUT); if (nexttimeout > send_entry->timetosend || !settimer) { nexttimeout = send_entry->timetosend; @@ -1325,18 +1329,20 @@ static void handle_fin_pkt(struct nes_cm_node *cm_node) nes_debug(NES_DBG_CM, "Received FIN, cm_node = %p, state = %u. " "refcnt=%d\n", cm_node, cm_node->state, atomic_read(&cm_node->ref_count)); - cm_node->tcp_cntxt.rcv_nxt++; - cleanup_retrans_entry(cm_node); switch (cm_node->state) { case NES_CM_STATE_SYN_RCVD: case NES_CM_STATE_SYN_SENT: case NES_CM_STATE_ESTABLISHED: case NES_CM_STATE_MPAREQ_SENT: case NES_CM_STATE_MPAREJ_RCVD: + cm_node->tcp_cntxt.rcv_nxt++; + cleanup_retrans_entry(cm_node); cm_node->state = NES_CM_STATE_LAST_ACK; send_fin(cm_node, NULL); break; case NES_CM_STATE_FIN_WAIT1: + cm_node->tcp_cntxt.rcv_nxt++; + cleanup_retrans_entry(cm_node); cm_node->state = NES_CM_STATE_CLOSING; send_ack(cm_node, NULL); /* Wait for ACK as this is simultanous close.. @@ -1344,11 +1350,15 @@ static void handle_fin_pkt(struct nes_cm_node *cm_node) * Just rm the node.. Done.. */ break; case NES_CM_STATE_FIN_WAIT2: + cm_node->tcp_cntxt.rcv_nxt++; + cleanup_retrans_entry(cm_node); cm_node->state = NES_CM_STATE_TIME_WAIT; send_ack(cm_node, NULL); schedule_nes_timer(cm_node, NULL, NES_TIMER_TYPE_CLOSE, 1, 0); break; case NES_CM_STATE_TIME_WAIT: + cm_node->tcp_cntxt.rcv_nxt++; + cleanup_retrans_entry(cm_node); cm_node->state = NES_CM_STATE_CLOSED; rem_ref_cm_node(cm_node->cm_core, cm_node); break; @@ -1384,7 +1394,6 @@ static void handle_rst_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb, passive_state = atomic_add_return(1, &cm_node->passive_state); if (passive_state == NES_SEND_RESET_EVENT) create_event(cm_node, NES_CM_EVENT_RESET); - cleanup_retrans_entry(cm_node); cm_node->state = NES_CM_STATE_CLOSED; dev_kfree_skb_any(skb); break; @@ -1398,17 +1407,16 @@ static void handle_rst_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb, active_open_err(cm_node, skb, reset); break; case NES_CM_STATE_CLOSED: - cleanup_retrans_entry(cm_node); drop_packet(skb); break; + case NES_CM_STATE_LAST_ACK: + cm_node->cm_id->rem_ref(cm_node->cm_id); case NES_CM_STATE_TIME_WAIT: - cleanup_retrans_entry(cm_node); cm_node->state = NES_CM_STATE_CLOSED; rem_ref_cm_node(cm_node->cm_core, cm_node); drop_packet(skb); break; case NES_CM_STATE_FIN_WAIT1: - cleanup_retrans_entry(cm_node); nes_debug(NES_DBG_CM, "Bad state %s[%u]\n", __func__, __LINE__); default: drop_packet(skb); @@ -1455,6 +1463,7 @@ static void handle_rcv_mpa(struct nes_cm_node *cm_node, struct sk_buff *skb) NES_PASSIVE_STATE_INDICATED); break; case NES_CM_STATE_MPAREQ_SENT: + cleanup_retrans_entry(cm_node); if (res_type == NES_MPA_REQUEST_REJECT) { type = NES_CM_EVENT_MPA_REJECT; cm_node->state = NES_CM_STATE_MPAREJ_RCVD; @@ -1518,7 +1527,7 @@ static int check_seq(struct nes_cm_node *cm_node, struct tcphdr *tcph, rcv_wnd = cm_node->tcp_cntxt.rcv_wnd; if (ack_seq != loc_seq_num) err = 1; - else if ((seq + rcv_wnd) < rcv_nxt) + else if (!between(seq, rcv_nxt, (rcv_nxt+rcv_wnd))) err = 1; if (err) { nes_debug(NES_DBG_CM, "%s[%u] create abort for cm_node=%p " @@ -1652,49 +1661,39 @@ static void handle_synack_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb, } } -static void handle_ack_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb, +static int handle_ack_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb, struct tcphdr *tcph) { int datasize = 0; u32 inc_sequence; u32 rem_seq_ack; u32 rem_seq; - int ret; + int ret = 0; int optionsize; optionsize = (tcph->doff << 2) - sizeof(struct tcphdr); if (check_seq(cm_node, tcph, skb)) - return; + return -EINVAL; skb_pull(skb, tcph->doff << 2); inc_sequence = ntohl(tcph->seq); rem_seq = ntohl(tcph->seq); rem_seq_ack = ntohl(tcph->ack_seq); datasize = skb->len; - cleanup_retrans_entry(cm_node); switch (cm_node->state) { case NES_CM_STATE_SYN_RCVD: /* Passive OPEN */ + cleanup_retrans_entry(cm_node); ret = handle_tcp_options(cm_node, tcph, skb, optionsize, 1); if (ret) break; cm_node->tcp_cntxt.rem_ack_num = ntohl(tcph->ack_seq); - if (cm_node->tcp_cntxt.rem_ack_num != - cm_node->tcp_cntxt.loc_seq_num) { - nes_debug(NES_DBG_CM, "rem_ack_num != loc_seq_num\n"); - cleanup_retrans_entry(cm_node); - send_reset(cm_node, skb); - return; - } cm_node->state = NES_CM_STATE_ESTABLISHED; - cleanup_retrans_entry(cm_node); if (datasize) { cm_node->tcp_cntxt.rcv_nxt = inc_sequence + datasize; handle_rcv_mpa(cm_node, skb); - } else { /* rcvd ACK only */ + } else /* rcvd ACK only */ dev_kfree_skb_any(skb); - cleanup_retrans_entry(cm_node); - } break; case NES_CM_STATE_ESTABLISHED: /* Passive OPEN */ @@ -1706,15 +1705,12 @@ static void handle_ack_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb, drop_packet(skb); break; case NES_CM_STATE_MPAREQ_SENT: - cleanup_retrans_entry(cm_node); cm_node->tcp_cntxt.rem_ack_num = ntohl(tcph->ack_seq); if (datasize) { cm_node->tcp_cntxt.rcv_nxt = inc_sequence + datasize; handle_rcv_mpa(cm_node, skb); - } else { /* Could be just an ack pkt.. */ - cleanup_retrans_entry(cm_node); + } else /* Could be just an ack pkt.. */ dev_kfree_skb_any(skb); - } break; case NES_CM_STATE_LISTENING: case NES_CM_STATE_CLOSED: @@ -1722,11 +1718,10 @@ static void handle_ack_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb, send_reset(cm_node, skb); break; case NES_CM_STATE_LAST_ACK: + case NES_CM_STATE_CLOSING: cleanup_retrans_entry(cm_node); cm_node->state = NES_CM_STATE_CLOSED; cm_node->cm_id->rem_ref(cm_node->cm_id); - case NES_CM_STATE_CLOSING: - cleanup_retrans_entry(cm_node); rem_ref_cm_node(cm_node->cm_core, cm_node); drop_packet(skb); break; @@ -1741,9 +1736,11 @@ static void handle_ack_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb, case NES_CM_STATE_MPAREQ_RCVD: case NES_CM_STATE_UNKNOWN: default: + cleanup_retrans_entry(cm_node); drop_packet(skb); break; } + return ret; } @@ -1849,6 +1846,7 @@ static void process_packet(struct nes_cm_node *cm_node, struct sk_buff *skb, enum nes_tcpip_pkt_type pkt_type = NES_PKT_TYPE_UNKNOWN; struct tcphdr *tcph = tcp_hdr(skb); u32 fin_set = 0; + int ret = 0; skb_pull(skb, ip_hdr(skb)->ihl << 2); nes_debug(NES_DBG_CM, "process_packet: cm_node=%p state =%d syn=%d " @@ -1874,17 +1872,17 @@ static void process_packet(struct nes_cm_node *cm_node, struct sk_buff *skb, handle_synack_pkt(cm_node, skb, tcph); break; case NES_PKT_TYPE_ACK: - handle_ack_pkt(cm_node, skb, tcph); - if (fin_set) + ret = handle_ack_pkt(cm_node, skb, tcph); + if (fin_set && !ret) handle_fin_pkt(cm_node); break; case NES_PKT_TYPE_RST: handle_rst_pkt(cm_node, skb, tcph); break; default: - drop_packet(skb); - if (fin_set) + if ((fin_set) && (!check_seq(cm_node, tcph, skb))) handle_fin_pkt(cm_node); + drop_packet(skb); break; } } @@ -2959,6 +2957,7 @@ int nes_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) struct nes_device *nesdev; struct nes_cm_node *cm_node; struct nes_cm_info cm_info; + int apbvt_set = 0; ibqp = nes_get_qp(cm_id->device, conn_param->qpn); if (!ibqp) @@ -2996,9 +2995,11 @@ int nes_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) conn_param->private_data_len); if (cm_id->local_addr.sin_addr.s_addr != - cm_id->remote_addr.sin_addr.s_addr) + cm_id->remote_addr.sin_addr.s_addr) { nes_manage_apbvt(nesvnic, ntohs(cm_id->local_addr.sin_port), PCI_FUNC(nesdev->pcidev->devfn), NES_MANAGE_APBVT_ADD); + apbvt_set = 1; + } /* set up the connection params for the node */ cm_info.loc_addr = htonl(cm_id->local_addr.sin_addr.s_addr); @@ -3015,8 +3016,7 @@ int nes_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) conn_param->private_data_len, (void *)conn_param->private_data, &cm_info); if (!cm_node) { - if (cm_id->local_addr.sin_addr.s_addr != - cm_id->remote_addr.sin_addr.s_addr) + if (apbvt_set) nes_manage_apbvt(nesvnic, ntohs(cm_id->local_addr.sin_port), PCI_FUNC(nesdev->pcidev->devfn), NES_MANAGE_APBVT_DEL); @@ -3025,7 +3025,7 @@ int nes_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) return -ENOMEM; } - cm_node->apbvt_set = 1; + cm_node->apbvt_set = apbvt_set; nesqp->cm_node = cm_node; cm_node->nesqp = nesqp; nes_add_ref(&nesqp->ibqp); diff --git a/drivers/infiniband/hw/nes/nes_cm.h b/drivers/infiniband/hw/nes/nes_cm.h index 80bba18..8b7e7c0 100644 --- a/drivers/infiniband/hw/nes/nes_cm.h +++ b/drivers/infiniband/hw/nes/nes_cm.h @@ -149,6 +149,7 @@ struct nes_timer_entry { #endif #define NES_SHORT_TIME (10) #define NES_LONG_TIME (2000*HZ/1000) +#define NES_MAX_TIMEOUT ((unsigned long) (12*HZ)) #define NES_CM_HASHTABLE_SIZE 1024 #define NES_CM_TCP_TIMER_INTERVAL 3000 -- 1.5.3.3 From hal.rosenstock at gmail.com Mon Apr 13 10:53:42 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 13 Apr 2009 13:53:42 -0400 Subject: ***SPAM*** Re: [ofa-general] Any easy way to specify to the SM to route/zone? In-Reply-To: References: Message-ID: On Mon, Apr 13, 2009 at 12:02 PM, Chris Worley wrote: > On Mon, Apr 13, 2009 at 7:43 AM, Hal Rosenstock > wrote: >> On Mon, Apr 13, 2009 at 9:37 AM, Chris Worley wrote: >>> On Mon, Apr 13, 2009 at 5:39 AM, Hal Rosenstock >>> wrote: >>>> On Sun, Apr 12, 2009 at 11:01 PM, Chris Worley wrote: >>>>> >>>>> So I need to tell the SM to route specific ports on the server/target >>>>> to specific clients/initiators. >>>>> >>>>> Is there any way to do this? >>>> >>>> Do you mean restrict access between certain clients/servers ? >>> >>> One server w/ 4QDR boards, 16 clients with one QDR board.  I want each >>> port on the server routed/zoned to two clients. >>> >>>> If so, >>>> you can do this with partitioning >>> >>> What is partitioning? >> >> A partition is a collection of ports which are allowed to communicate >> together. There are two forms of members: full members which can talk >> to any other member (useful for servers) and limited members which can >> only talk to full members (useful for clients). See the opensm man >> page or partition-config.txt on setting this up for OpenSM. >> > > Let me see if I understand this with a simple example... my port GUIDs > (as reported by ibstat) are for one server (4 QDR ports) and four > clients (one QDR port each): > > > Server A:           Port GUID: 0x0024717124000029 > Server B:           Port GUID: 0x002471712400002a > Server C:           Port GUID: 0x0024717127000035 > Server D:           Port GUID: 0x0024717127000036 > > Client 1:                Port GUID: 0x0002c90300028c01 > Client 2:                Port GUID: 0x0002c90300026047 > Client 3:                Port GUID: 0x0002c90300026053 > Client 4:                Port GUID: 0x0002c9030002603b > > Assuming I want a 1:1 (one server port to one client) partitioning, I > would put the following in /etc/ofed/partitions.conf: > > part1=0x1, ipoib, defmember=full : 0x0024717124000029, 0x0002c90300028c01; > part2=0x2, ipoib, defmember=full : 0x002471712400002a, 0x0002c90300026047; > part3=0x3, ipoib, defmember=full : 0x0024717127000035, 0x0002c90300026053; > part4=0x4, ipoib, defmember=full : 0x0024717127000036, 0x0002c9030002603b; So you want IPoIB. > ... and run w/: > > opensm -r -B -P/etc/ofed/partitions.conf > > Does that sound correct?  It doesn't work What application(s) aren't working ? Any SM error messages ? Any end node messages pertaining to IB ? > (I restarted ib on the > clients), although ibstat shows the links up.  What am I getting > wrong?  The opensmd is running on the server. Which server ? You still need the default partition with the SM node being full and the others being limited there (so it's also best to run SM on separate node if possible otherwise you have the potential of any client connecting to it on default partition). -- Hal > Thanks, > > Chris > From worleys at gmail.com Mon Apr 13 11:26:45 2009 From: worleys at gmail.com (Chris Worley) Date: Mon, 13 Apr 2009 12:26:45 -0600 Subject: ***SPAM*** Re: [ofa-general] Any easy way to specify to the SM to route/zone? In-Reply-To: References: Message-ID: On Mon, Apr 13, 2009 at 11:53 AM, Hal Rosenstock wrote: > On Mon, Apr 13, 2009 at 12:02 PM, Chris Worley wrote: >> On Mon, Apr 13, 2009 at 7:43 AM, Hal Rosenstock wrote: >>> On Mon, Apr 13, 2009 at 9:37 AM, Chris Worley wrote: >>>> On Mon, Apr 13, 2009 at 5:39 AM, Hal Rosenstock >>>> wrote: >>>>> On Sun, Apr 12, 2009 at 11:01 PM, Chris Worley wrote: >>>>>> >>>>>> So I need to tell the SM to route specific ports on the server/target >>>>>> to specific clients/initiators. >>>>>> >>>>>> Is there any way to do this? >>>>> >>>>> Do you mean restrict access between certain clients/servers ? >>>> >>>> One server w/ 4QDR boards, 16 clients with one QDR board.  I want each >>>> port on the server routed/zoned to two clients. >>>> >>>>> If so, >>>>> you can do this with partitioning >>>> >>>> What is partitioning? >>> >>> A partition is a collection of ports which are allowed to communicate >>> together. There are two forms of members: full members which can talk >>> to any other member (useful for servers) and limited members which can >>> only talk to full members (useful for clients). See the opensm man >>> page or partition-config.txt on setting this up for OpenSM. >>> >> >> Let me see if I understand this with a simple example... my port GUIDs >> (as reported by ibstat) are for one server (4 QDR ports) and four >> clients (one QDR port each): >> >> >> Server A:           Port GUID: 0x0024717124000029 >> Server B:           Port GUID: 0x002471712400002a >> Server C:           Port GUID: 0x0024717127000035 >> Server D:           Port GUID: 0x0024717127000036 >> >> Client 1:                Port GUID: 0x0002c90300028c01 >> Client 2:                Port GUID: 0x0002c90300026047 >> Client 3:                Port GUID: 0x0002c90300026053 >> Client 4:                Port GUID: 0x0002c9030002603b >> >> Assuming I want a 1:1 (one server port to one client) partitioning, I >> would put the following in /etc/ofed/partitions.conf: >> >> part1=0x1, ipoib, defmember=full : 0x0024717124000029, 0x0002c90300028c01; >> part2=0x2, ipoib, defmember=full : 0x002471712400002a, 0x0002c90300026047; >> part3=0x3, ipoib, defmember=full : 0x0024717127000035, 0x0002c90300026053; >> part4=0x4, ipoib, defmember=full : 0x0024717127000036, 0x0002c9030002603b; > > So you want IPoIB. I'm doing SRP, so I need IPoIB working. > >> ... and run w/: >> >> opensm -r -B -P/etc/ofed/partitions.conf >> >> Does that sound correct?  It doesn't work > > What application(s) aren't working ? ping over IPoIB, for example. I am seeing the test node in an "initializing" state right now... I thought it was "up" before. > Any SM error messages ? The server has one klogd error coming out continuously: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22 OpenSM is seeing "lid out of range", "send completed with error", "Failed to find source physical port for trap" Opensm's log looks like: Apr 13 12:03:43 556996 [21085350] 0x03 -> OpenSM 3.2.2 Apr 13 12:03:43 557061 [21085350] 0x80 -> OpenSM 3.2.2 Apr 13 12:03:43 557556 [21085350] 0x02 -> osm_vendor_init: 1000 pending umads specified Apr 13 12:03:43 557659 [21085350] 0x80 -> Entering DISCOVERING state Apr 13 12:03:43 605573 [21085350] 0x02 -> osm_vendor_bind: Binding to port 0x24717124000029 Apr 13 12:03:43 636142 [21085350] 0x02 -> osm_vendor_bind: Binding to port 0x24717124000029 Apr 13 12:03:44 437076 [4863C940] 0x01 -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x520000123b) -- dropping Apr 13 12:03:44 437104 [4863C940] 0x01 -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0 Apr 13 12:03:44 437126 [4863C940] 0x01 -> Received SMP on a 1 hop path: Initial path = 0,0 Return path = 0,0 Apr 13 12:03:44 437135 [4863C940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Apr 13 12:03:44 437179 [4863C940] 0x01 -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x123b attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................65535 dr_dlid.................65535 Initial path: 0,1 Return path: 0,0 Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Apr 13 12:03:44 437218 [47C3B940] 0x80 -> Entering MASTER state Apr 13 12:03:44 437409 [47C3B940] 0x02 -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0 GID:0xfe80000000000000,0x0024717124000029 Apr 13 12:03:44 437458 [47C3B940] 0x02 -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0 GID:0xfe80000000000000,0x0024717124000029 Apr 13 12:03:44 437514 [47C3B940] 0x02 -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0 GID:0xfe80000000000000,0x0024717124000029 Apr 13 12:03:44 437558 [47C3B940] 0x02 -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0 GID:0xfe80000000000000,0x0024717124000029 Apr 13 12:03:44 437612 [47C3B940] 0x02 -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0 GID:0xfe80000000000000,0x0024717124000029 Apr 13 12:03:44 437653 [47C3B940] 0x02 -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0 GID:0xfe80000000000000,0x0024717124000029 Apr 13 12:03:44 437707 [47C3B940] 0x02 -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0 GID:0xfe80000000000000,0x0024717124000029 Apr 13 12:03:44 437748 [47C3B940] 0x02 -> osm_report_notice: Reporting Generic Notice type:3 num:66 from LID:0 GID:0xfe80000000000000,0x0024717124000029 Apr 13 12:03:44 443077 [47C3B940] 0x80 -> SUBNET UP Apr 13 12:03:44 891932 [42232940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:0x02 num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f Apr 13 12:03:44 891951 [42232940] 0x01 -> osm_get_physp_by_mad_addr: ERR 7503: Lid is out of range: 10 Apr 13 12:03:44 891959 [42232940] 0x01 -> __osm_trap_rcv_process_request: ERR 3809: Failed to find source physical port for trap Apr 13 12:03:45 184124 [44035940] 0x01 -> __osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 0xff12401bffff0000 : 0x00000000ffffffff from port 0x0 24717124000029 (MT25408) ... Apr 13 12:04:04 852289 [43634940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:0x02 num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f Apr 13 12:04:04 852306 [43634940] 0x01 -> osm_get_physp_by_mad_addr: ERR 7503: Lid is out of range: 10 Apr 13 12:04:04 852314 [43634940] 0x01 -> __osm_trap_rcv_process_request: ERR 3809: Failed to find source physical port for trap Apr 13 12:04:04 852363 [43634940] 0x01 -> __osm_trap_rcv_process_request: ERR 3804: Received trap 20 times consecutively Apr 13 12:04:05 850307 [44035940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:0x02 num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f Apr 13 12:04:05 850327 [44035940] 0x01 -> osm_get_physp_by_mad_addr: ERR 7503: Lid is out of range: 10 Apr 13 12:04:05 850334 [44035940] 0x01 -> __osm_trap_rcv_process_request: ERR 3809: Failed to find source physical port for trap Apr 13 12:04:06 848327 [44A36940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:0x02 num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f Apr 13 12:04:06 848340 [44A36940] 0x01 -> osm_get_physp_by_mad_addr: ERR 7503: Lid is out of range: 10 Apr 13 12:04:06 848348 [44A36940] 0x01 -> __osm_trap_rcv_process_request: ERR 3809: Failed to find source physical port for trap Apr 13 12:04:07 846349 [45437940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:0x02 num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f Apr 13 12:04:07 846365 [45437940] 0x01 -> osm_get_physp_by_mad_addr: ERR 7503: Lid is out of range: 10 Apr 13 12:04:07 846373 [45437940] 0x01 -> __osm_trap_rcv_process_request: ERR 3809: Failed to find source physical port for trap Apr 13 12:04:08 844372 [45E38940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:0x02 num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f Apr 13 12:04:08 844391 [45E38940] 0x01 -> osm_get_physp_by_mad_addr: ERR 7503: Lid is out of range: 10 Apr 13 12:04:08 844398 [45E38940] 0x01 -> __osm_trap_rcv_process_request: ERR 3809: Failed to find source physical port for trap Apr 13 12:04:09 842394 [46839940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:0x02 num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f Apr 13 12:04:09 842414 [46839940] 0x01 -> osm_get_physp_by_mad_addr: ERR 7503: Lid is out of range: 10 Apr 13 12:04:09 842421 [46839940] 0x01 -> __osm_trap_rcv_process_request: ERR 3809: Failed to find source physical port for trap Apr 13 12:04:10 840400 [42232940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:0x02 num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f Apr 13 12:04:10 840414 [42232940] 0x01 -> osm_get_physp_by_mad_addr: ERR 7503: Lid is out of range: 10 Apr 13 12:04:10 840421 [42232940] 0x01 -> __osm_trap_rcv_process_request: ERR 3809: Failed to find source physical port for trap Apr 13 12:04:11 838419 [42C33940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:0x02 num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f Apr 13 12:04:11 838432 [42C33940] 0x01 -> osm_get_physp_by_mad_addr: ERR 7503: Lid is out of range: 10 Apr 13 12:04:11 838440 [42C33940] 0x01 -> __osm_trap_rcv_process_request: ERR 3809: Failed to find source physical port for trap Apr 13 12:04:12 836435 [43634940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:0x02 num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f Apr 13 12:04:12 836467 [43634940] 0x01 -> osm_get_physp_by_mad_addr: ERR 7503: Lid is out of range: 10 Apr 13 12:04:12 836476 [43634940] 0x01 -> __osm_trap_rcv_process_request: ERR 3809: Failed to find source physical port for trap Apr 13 12:04:13 834459 [45437940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:0x02 num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f Apr 13 12:04:13 834479 [45437940] 0x01 -> osm_get_physp_by_mad_addr: ERR 7503: Lid is out of range: 10 Apr 13 12:04:13 834487 [45437940] 0x01 -> __osm_trap_rcv_process_request: ERR 3809: Failed to find source physical port for trap Apr 13 12:04:14 364185 [4863C940] 0x01 -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x5200001266) -- dropping Apr 13 12:04:14 364211 [4863C940] 0x01 -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0 ... Apr 13 12:19:51 971642 [453B6940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:0x02 num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f Apr 13 12:19:51 971658 [453B6940] 0x01 -> osm_get_physp_by_mad_addr: ERR 7503: Lid is out of range: 10 Apr 13 12:19:51 971666 [453B6940] 0x01 -> __osm_trap_rcv_process_request: ERR 3809: Failed to find source physical port for trap Apr 13 12:19:52 969658 [45DB7940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:0x02 num:128 Producer:2 (Switch) from LID:10 TID:0x0000000000000190 Apr 13 12:19:52 969671 [45DB7940] 0x01 -> osm_get_physp_by_mad_addr: ERR 7503: Lid is out of range: 10 Apr 13 12:19:52 969679 [45DB7940] 0x01 -> __osm_trap_rcv_process_request: ERR 3809: Failed to find source physical port for trap Apr 13 12:19:53 967681 [467B8940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:0x02 num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f Apr 13 12:19:53 967696 [467B8940] 0x01 -> osm_get_physp_by_mad_addr: ERR 7503: Lid is out of range: 10 Apr 13 12:19:53 967704 [467B8940] 0x01 -> __osm_trap_rcv_process_request: ERR 3809: Failed to find source physical port for trap Apr 13 12:19:54 965697 [471B9940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:0x02 num:128 Producer:2 (Switch) from LID:10 TID:0x0000000000000190 Apr 13 12:19:54 965710 [471B9940] 0x01 -> osm_get_physp_by_mad_addr: ERR 7503: Lid is out of range: 10 Apr 13 12:19:54 965717 [471B9940] 0x01 -> __osm_trap_rcv_process_request: ERR 3809: Failed to find source physical port for trap Apr 13 12:19:55 963717 [42BB2940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:0x02 num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f Apr 13 12:19:55 963735 [42BB2940] 0x01 -> osm_get_physp_by_mad_addr: ERR 7503: Lid is out of range: 10 Apr 13 12:19:55 963743 [42BB2940] 0x01 -> __osm_trap_rcv_process_request: ERR 3809: Failed to find source physical port for trap Apr 13 12:19:56 961736 [435B3940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:0x02 num:128 Producer:2 (Switch) from LID:10 TID:0x0000000000000190 Apr 13 12:19:56 961749 [435B3940] 0x01 -> osm_get_physp_by_mad_addr: ERR 7503: Lid is out of range: 10 Apr 13 12:19:56 961779 [435B3940] 0x01 -> __osm_trap_rcv_process_request: ERR 3809: Failed to find source physical port for trap Apr 13 12:19:57 959748 [43FB4940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:0x02 num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f Apr 13 12:19:57 959771 [43FB4940] 0x01 -> osm_get_physp_by_mad_addr: ERR 7503: Lid is out of range: 10 Apr 13 12:19:57 959779 [43FB4940] 0x01 -> __osm_trap_rcv_process_request: ERR 3809: Failed to find source physical port for trap Apr 13 12:19:58 957770 [449B5940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:0x02 num:128 Producer:2 (Switch) from LID:10 TID:0x0000000000000190 Apr 13 12:19:58 957788 [449B5940] 0x01 -> osm_get_physp_by_mad_addr: ERR 7503: Lid is out of range: 10 Apr 13 12:19:58 957795 [449B5940] 0x01 -> __osm_trap_rcv_process_request: ERR 3809: Failed to find source physical port for trap Apr 13 12:19:59 955793 [453B6940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:0x02 num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f Apr 13 12:19:59 955806 [453B6940] 0x01 -> osm_get_physp_by_mad_addr: ERR 7503: Lid is out of range: 10 Apr 13 12:19:59 955813 [453B6940] 0x01 -> __osm_trap_rcv_process_request: ERR 3809: Failed to find source physical port for trap Apr 13 12:20:00 491524 [45DB7940] 0x01 -> __osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet, scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 0xff12401bffff0000 : 0x00000000ffffffff from port 0x0 24717124000029 (MT25408 IOSAN Fusion-IO) Apr 13 12:20:00 953808 [42BB2940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:0x02 num:128 Producer:2 (Switch) from LID:10 TID:0x0000000000000190 Apr 13 12:20:00 953822 [42BB2940] 0x01 -> osm_get_physp_by_mad_addr: ERR 7503: Lid is out of range: 10 Apr 13 12:20:00 953830 [42BB2940] 0x01 -> __osm_trap_rcv_process_request: ERR 3809: Failed to find source physical port for trap Apr 13 12:20:01 424318 [48FBC940] 0x01 -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x5500001311) -- dropping Apr 13 12:20:01 424345 [48FBC940] 0x01 -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0 Apr 13 12:20:01 424356 [48FBC940] 0x01 -> Received SMP on a 1 hop path: Initial path = 0,0 Return path = 0,0 Apr 13 12:20:01 424366 [48FBC940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Apr 13 12:20:01 424410 [48FBC940] 0x01 -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x1311 attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................65535 dr_dlid.................65535 Initial path: 0,1 Return path: 0,0 Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > Any end > node messages pertaining to IB ? Nothing I can see. > >> (I restarted ib on the >> clients), although ibstat shows the links up.  What am I getting >> wrong?  The opensmd is running on the server. > > Which server ? There's only one server... it has many ports for which I'm trying to partition do different clients. So, in the above, when I say "Server A", I mean server port "A". > > You still need the default partition with the SM node being full and > the others being limited there (so it's also best to run SM on > separate node if possible otherwise you have the potential of any > client connecting to it on default partition). Are you saying to change the partitions.conf file to: part1=0x1, ipoib: 0x0024717124000029=full, 0x0002c90300028c01; part2=0x2, ipoib: 0x002471712400002a=full, 0x0002c90300026047; part3=0x3, ipoib: 0x0024717127000035=full, 0x0002c90300026053; part4=0x4, ipoib: 0x0024717127000036=full, 0x0002c9030002603b; ... (which still doesn't work) in which case I set all the server's ports to "full", or should just one be "full" (which didn't work either)? I did have a difficult time understanding the difference between "full" and "limited" in the man page. I've got a captive network, so I don't want any paths I've not specified to be allowed. If that makes any sense. So, I didn't want to put a statement in like: Default=0x7fff,ipoib:ALL=full; ... that would let a rogue node slip through the cracks. Thanks, Chris > > -- Hal > >> Thanks, >> >> Chris >> > From hnrose at comcast.net Mon Apr 13 11:30:26 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Mon, 13 Apr 2009 14:30:26 -0400 Subject: [ofa-general] ***SPAM*** [PATCH] opensm/partition-config.txt: Update for defmember feature Message-ID: <20090413183026.GA12839@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/doc/partition-config.txt b/opensm/doc/partition-config.txt index 6bb8e9b..ead3f76 100644 --- a/opensm/doc/partition-config.txt +++ b/opensm/doc/partition-config.txt @@ -32,13 +32,15 @@ General file format: Partition Definition: -------------------- -[PartitionName][=PKey][,flag[=value]] +[PartitionName][=PKey][,flag[=value]][,defmember=full|limited] PartitionName - string, to be used with logging. When omitted empty string will be used. PKey - P_Key value for this partition. Only low 15 bits will be used. When omitted will be autogenerated. flag - used to indicate IPoIB capability of this partition. +defmember=full|limited - specifies default membership for port guid + list. Default is limited. Currently recognized flags are: @@ -103,6 +105,12 @@ NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306 ; YetAnotherOne = 0x300 : SELF=full ; YetAnotherOne = 0x300 : ALL=limited ; +ShareIO = 0x80 , defmember=full : 0x123451, 0x123452; # 0x123453, 0x123454 will be limited +ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full; # 0x123456, 0x123457 will be limited +ShareIO = 0x80 : defmember=limited : 0x123456, 0x123457, 0x123458=full; +ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a; +ShareIO = 0x80 , defmember=full : 0x12345b, 0x12345c=limited, 0x12345d; + Note: ---- From hal.rosenstock at gmail.com Mon Apr 13 11:52:08 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 13 Apr 2009 14:52:08 -0400 Subject: ***SPAM*** Re: [ofa-general] Any easy way to specify to the SM to route/zone? In-Reply-To: References: Message-ID: On Mon, Apr 13, 2009 at 2:26 PM, Chris Worley wrote: > On Mon, Apr 13, 2009 at 11:53 AM, Hal Rosenstock > wrote: >> On Mon, Apr 13, 2009 at 12:02 PM, Chris Worley wrote: >>> On Mon, Apr 13, 2009 at 7:43 AM, Hal Rosenstock wrote: >>>> On Mon, Apr 13, 2009 at 9:37 AM, Chris Worley wrote: >>>>> On Mon, Apr 13, 2009 at 5:39 AM, Hal Rosenstock >>>>> wrote: >>>>>> On Sun, Apr 12, 2009 at 11:01 PM, Chris Worley wrote: >>>>>>> >>>>>>> So I need to tell the SM to route specific ports on the server/target >>>>>>> to specific clients/initiators. >>>>>>> >>>>>>> Is there any way to do this? >>>>>> >>>>>> Do you mean restrict access between certain clients/servers ? >>>>> >>>>> One server w/ 4QDR boards, 16 clients with one QDR board.  I want each >>>>> port on the server routed/zoned to two clients. >>>>> >>>>>> If so, >>>>>> you can do this with partitioning >>>>> >>>>> What is partitioning? >>>> >>>> A partition is a collection of ports which are allowed to communicate >>>> together. There are two forms of members: full members which can talk >>>> to any other member (useful for servers) and limited members which can >>>> only talk to full members (useful for clients). See the opensm man >>>> page or partition-config.txt on setting this up for OpenSM. >>>> >>> >>> Let me see if I understand this with a simple example... my port GUIDs >>> (as reported by ibstat) are for one server (4 QDR ports) and four >>> clients (one QDR port each): >>> >>> >>> Server A:           Port GUID: 0x0024717124000029 >>> Server B:           Port GUID: 0x002471712400002a >>> Server C:           Port GUID: 0x0024717127000035 >>> Server D:           Port GUID: 0x0024717127000036 >>> >>> Client 1:                Port GUID: 0x0002c90300028c01 >>> Client 2:                Port GUID: 0x0002c90300026047 >>> Client 3:                Port GUID: 0x0002c90300026053 >>> Client 4:                Port GUID: 0x0002c9030002603b >>> >>> Assuming I want a 1:1 (one server port to one client) partitioning, I >>> would put the following in /etc/ofed/partitions.conf: >>> >>> part1=0x1, ipoib, defmember=full : 0x0024717124000029, 0x0002c90300028c01; >>> part2=0x2, ipoib, defmember=full : 0x002471712400002a, 0x0002c90300026047; >>> part3=0x3, ipoib, defmember=full : 0x0024717127000035, 0x0002c90300026053; >>> part4=0x4, ipoib, defmember=full : 0x0024717127000036, 0x0002c9030002603b; >> >> So you want IPoIB. > > I'm doing SRP, so I need IPoIB working. SRP needs to query PathRecord with the correct PKey and use the correct Pkey index for that partition. I'm not sure how that is done in SRP but first IPoIB needs to be made to work (again). >> >>> ... and run w/: >>> >>> opensm -r -B -P/etc/ofed/partitions.conf Also, do you need to use -r ? It's better not to (reassign LIDs). >>> Does that sound correct?  It doesn't work >> >> What application(s) aren't working ? > > ping over IPoIB, for example. > > I am seeing the test node in an "initializing" state right now... I > thought it was "up" before. Yes, this has gone "backwards" (not as far along yet...) >> Any SM error messages ? > > The server has one klogd error coming out continuously: > > ib0: multicast join failed for > ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22 IPoIB broadcast group (on the default partition) can't be joined (I'm presuming due to the current partition setup (e.g. it worked prior to this, right ?)). You need to do some IPoIB configuration relative to partitions as well. See kernel Documentation/infiniband/ipoib.txt for help with this. > OpenSM is seeing "lid out of range", "send completed with error", > "Failed to find source physical port for trap" > Opensm's log looks like: > > Apr 13 12:03:43 556996 [21085350] 0x03 -> OpenSM 3.2.2 > Apr 13 12:03:43 557061 [21085350] 0x80 -> OpenSM 3.2.2 > Apr 13 12:03:43 557556 [21085350] 0x02 -> osm_vendor_init: 1000 > pending umads specified > Apr 13 12:03:43 557659 [21085350] 0x80 -> Entering DISCOVERING state > Apr 13 12:03:43 605573 [21085350] 0x02 -> osm_vendor_bind: Binding to > port 0x24717124000029 > Apr 13 12:03:43 636142 [21085350] 0x02 -> osm_vendor_bind: Binding to > port 0x24717124000029 > Apr 13 12:03:44 437076 [4863C940] 0x01 -> umad_receiver: ERR 5409: > send completed with error (method=0x1 attr=0x11 trans_id=0x520000123b) > -- dropping > Apr 13 12:03:44 437104 [4863C940] 0x01 -> umad_receiver: ERR 5411: DR > SMP Hop Ptr: 0x0 > Apr 13 12:03:44 437126 [4863C940] 0x01 -> Received SMP on a 1 hop path: >                                Initial path = 0,0 >                                Return path  = 0,0 > Apr 13 12:03:44 437135 [4863C940] 0x01 -> > __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error > (IB_TIMEOUT) > Apr 13 12:03:44 437179 [4863C940] 0x01 -> SMP dump: >                                base_ver................0x1 >                                mgmt_class..............0x81 >                                class_ver...............0x1 >                                method..................0x1 (SubnGet) >                                D bit...................0x0 >                                status..................0x0 >                                hop_ptr.................0x0 >                                hop_count...............0x1 >                                trans_id................0x123b >                                attr_id.................0x11 (NodeInfo) >                                resv....................0x0 >                                attr_mod................0x0 >                                m_key...................0x0000000000000000 >                                dr_slid.................65535 >                                dr_dlid.................65535 > >                                Initial path: 0,1 >                                Return path:  0,0 >                                Reserved:     [0][0][0][0][0][0][0] > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > > Apr 13 12:03:44 437218 [47C3B940] 0x80 -> Entering MASTER state > Apr 13 12:03:44 437409 [47C3B940] 0x02 -> osm_report_notice: Reporting > Generic Notice type:3 num:66 from LID:0 > GID:0xfe80000000000000,0x0024717124000029 > Apr 13 12:03:44 437458 [47C3B940] 0x02 -> osm_report_notice: Reporting > Generic Notice type:3 num:66 from LID:0 > GID:0xfe80000000000000,0x0024717124000029 > Apr 13 12:03:44 437514 [47C3B940] 0x02 -> osm_report_notice: Reporting > Generic Notice type:3 num:66 from LID:0 > GID:0xfe80000000000000,0x0024717124000029 > Apr 13 12:03:44 437558 [47C3B940] 0x02 -> osm_report_notice: Reporting > Generic Notice type:3 num:66 from LID:0 > GID:0xfe80000000000000,0x0024717124000029 > Apr 13 12:03:44 437612 [47C3B940] 0x02 -> osm_report_notice: Reporting > Generic Notice type:3 num:66 from LID:0 > GID:0xfe80000000000000,0x0024717124000029 > Apr 13 12:03:44 437653 [47C3B940] 0x02 -> osm_report_notice: Reporting > Generic Notice type:3 num:66 from LID:0 > GID:0xfe80000000000000,0x0024717124000029 > Apr 13 12:03:44 437707 [47C3B940] 0x02 -> osm_report_notice: Reporting > Generic Notice type:3 num:66 from LID:0 > GID:0xfe80000000000000,0x0024717124000029 > Apr 13 12:03:44 437748 [47C3B940] 0x02 -> osm_report_notice: Reporting > Generic Notice type:3 num:66 from LID:0 > GID:0xfe80000000000000,0x0024717124000029 > Apr 13 12:03:44 443077 [47C3B940] 0x80 -> SUBNET UP > Apr 13 12:03:44 891932 [42232940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x02 > num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f > Apr 13 12:03:44 891951 [42232940] 0x01 -> osm_get_physp_by_mad_addr: > ERR 7503: Lid is out of range: 10 > Apr 13 12:03:44 891959 [42232940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3809: Failed to find source > physical port for trap > Apr 13 12:03:45 184124 [44035940] 0x01 -> __osm_mcmr_rcv_join_mgrp: > ERR 1B11: method = SubnAdmSet, scope_state = 0x1, component mask = > 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: > 0xff12401bffff0000 : 0x00000000ffffffff from port 0x0 > 24717124000029 (MT25408) > > ... > > Apr 13 12:04:04 852289 [43634940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x02 > num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f > Apr 13 12:04:04 852306 [43634940] 0x01 -> osm_get_physp_by_mad_addr: > ERR 7503: Lid is out of range: 10 > Apr 13 12:04:04 852314 [43634940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3809: Failed to find source > physical port for trap > Apr 13 12:04:04 852363 [43634940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3804: Received trap 20 times > consecutively > Apr 13 12:04:05 850307 [44035940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x02 > num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f > Apr 13 12:04:05 850327 [44035940] 0x01 -> osm_get_physp_by_mad_addr: > ERR 7503: Lid is out of range: 10 > Apr 13 12:04:05 850334 [44035940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3809: Failed to find source > physical port for trap > Apr 13 12:04:06 848327 [44A36940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x02 > num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f > Apr 13 12:04:06 848340 [44A36940] 0x01 -> osm_get_physp_by_mad_addr: > ERR 7503: Lid is out of range: 10 > Apr 13 12:04:06 848348 [44A36940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3809: Failed to find source > physical port for trap > Apr 13 12:04:07 846349 [45437940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x02 > num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f > Apr 13 12:04:07 846365 [45437940] 0x01 -> osm_get_physp_by_mad_addr: > ERR 7503: Lid is out of range: 10 > Apr 13 12:04:07 846373 [45437940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3809: Failed to find source > physical port for trap > Apr 13 12:04:08 844372 [45E38940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x02 > num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f > Apr 13 12:04:08 844391 [45E38940] 0x01 -> osm_get_physp_by_mad_addr: > ERR 7503: Lid is out of range: 10 > Apr 13 12:04:08 844398 [45E38940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3809: Failed to find source > physical port for trap > Apr 13 12:04:09 842394 [46839940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x02 > num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f > Apr 13 12:04:09 842414 [46839940] 0x01 -> osm_get_physp_by_mad_addr: > ERR 7503: Lid is out of range: 10 > Apr 13 12:04:09 842421 [46839940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3809: Failed to find source > physical port for trap > Apr 13 12:04:10 840400 [42232940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x02 > num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f > Apr 13 12:04:10 840414 [42232940] 0x01 -> osm_get_physp_by_mad_addr: > ERR 7503: Lid is out of range: 10 > Apr 13 12:04:10 840421 [42232940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3809: Failed to find source > physical port for trap > Apr 13 12:04:11 838419 [42C33940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x02 > num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f > Apr 13 12:04:11 838432 [42C33940] 0x01 -> osm_get_physp_by_mad_addr: > ERR 7503: Lid is out of range: 10 > Apr 13 12:04:11 838440 [42C33940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3809: Failed to find source > physical port for trap > Apr 13 12:04:12 836435 [43634940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x02 > num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f > Apr 13 12:04:12 836467 [43634940] 0x01 -> osm_get_physp_by_mad_addr: > ERR 7503: Lid is out of range: 10 > Apr 13 12:04:12 836476 [43634940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3809: Failed to find source > physical port for trap > Apr 13 12:04:13 834459 [45437940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x02 > num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f > Apr 13 12:04:13 834479 [45437940] 0x01 -> osm_get_physp_by_mad_addr: > ERR 7503: Lid is out of range: 10 > Apr 13 12:04:13 834487 [45437940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3809: Failed to find source > physical port for trap > Apr 13 12:04:14 364185 [4863C940] 0x01 -> umad_receiver: ERR 5409: > send completed with error (method=0x1 attr=0x11 trans_id=0x5200001266) > -- dropping > Apr 13 12:04:14 364211 [4863C940] 0x01 -> umad_receiver: ERR 5411: DR > SMP Hop Ptr: 0x0 > > ... > > Apr 13 12:19:51 971642 [453B6940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x02 > num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f > Apr 13 12:19:51 971658 [453B6940] 0x01 -> osm_get_physp_by_mad_addr: > ERR 7503: Lid is out of range: 10 > Apr 13 12:19:51 971666 [453B6940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3809: Failed to find source > physical port for trap > Apr 13 12:19:52 969658 [45DB7940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x02 > num:128 Producer:2 (Switch) from LID:10 TID:0x0000000000000190 > Apr 13 12:19:52 969671 [45DB7940] 0x01 -> osm_get_physp_by_mad_addr: > ERR 7503: Lid is out of range: 10 > Apr 13 12:19:52 969679 [45DB7940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3809: Failed to find source > physical port for trap > Apr 13 12:19:53 967681 [467B8940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x02 > num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f > Apr 13 12:19:53 967696 [467B8940] 0x01 -> osm_get_physp_by_mad_addr: > ERR 7503: Lid is out of range: 10 > Apr 13 12:19:53 967704 [467B8940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3809: Failed to find source > physical port for trap > Apr 13 12:19:54 965697 [471B9940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x02 > num:128 Producer:2 (Switch) from LID:10 TID:0x0000000000000190 > Apr 13 12:19:54 965710 [471B9940] 0x01 -> osm_get_physp_by_mad_addr: > ERR 7503: Lid is out of range: 10 > Apr 13 12:19:54 965717 [471B9940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3809: Failed to find source > physical port for trap > Apr 13 12:19:55 963717 [42BB2940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x02 > num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f > Apr 13 12:19:55 963735 [42BB2940] 0x01 -> osm_get_physp_by_mad_addr: > ERR 7503: Lid is out of range: 10 > Apr 13 12:19:55 963743 [42BB2940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3809: Failed to find source > physical port for trap > Apr 13 12:19:56 961736 [435B3940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x02 > num:128 Producer:2 (Switch) from LID:10 TID:0x0000000000000190 > Apr 13 12:19:56 961749 [435B3940] 0x01 -> osm_get_physp_by_mad_addr: > ERR 7503: Lid is out of range: 10 > Apr 13 12:19:56 961779 [435B3940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3809: Failed to find source > physical port for trap > Apr 13 12:19:57 959748 [43FB4940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x02 > num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f > Apr 13 12:19:57 959771 [43FB4940] 0x01 -> osm_get_physp_by_mad_addr: > ERR 7503: Lid is out of range: 10 > Apr 13 12:19:57 959779 [43FB4940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3809: Failed to find source > physical port for trap > Apr 13 12:19:58 957770 [449B5940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x02 > num:128 Producer:2 (Switch) from LID:10 TID:0x0000000000000190 > Apr 13 12:19:58 957788 [449B5940] 0x01 -> osm_get_physp_by_mad_addr: > ERR 7503: Lid is out of range: 10 > Apr 13 12:19:58 957795 [449B5940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3809: Failed to find source > physical port for trap > Apr 13 12:19:59 955793 [453B6940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x02 > num:259 Producer:2 (Switch) from LID:10 TID:0x000000000000018f > Apr 13 12:19:59 955806 [453B6940] 0x01 -> osm_get_physp_by_mad_addr: > ERR 7503: Lid is out of range: 10 > Apr 13 12:19:59 955813 [453B6940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3809: Failed to find source > physical port for trap > Apr 13 12:20:00 491524 [45DB7940] 0x01 -> __osm_mcmr_rcv_join_mgrp: > ERR 1B11: method = SubnAdmSet, scope_state = 0x1, component mask = > 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: > 0xff12401bffff0000 : 0x00000000ffffffff from port 0x0 > 24717124000029 (MT25408 IOSAN Fusion-IO) > Apr 13 12:20:00 953808 [42BB2940] 0x01 -> > __osm_trap_rcv_process_request: Received Generic Notice type:0x02 > num:128 Producer:2 (Switch) from LID:10 TID:0x0000000000000190 > Apr 13 12:20:00 953822 [42BB2940] 0x01 -> osm_get_physp_by_mad_addr: > ERR 7503: Lid is out of range: 10 > Apr 13 12:20:00 953830 [42BB2940] 0x01 -> > __osm_trap_rcv_process_request: ERR 3809: Failed to find source > physical port for trap > Apr 13 12:20:01 424318 [48FBC940] 0x01 -> umad_receiver: ERR 5409: > send completed with error (method=0x1 attr=0x11 trans_id=0x5500001311) > -- dropping > Apr 13 12:20:01 424345 [48FBC940] 0x01 -> umad_receiver: ERR 5411: DR > SMP Hop Ptr: 0x0 > Apr 13 12:20:01 424356 [48FBC940] 0x01 -> Received SMP on a 1 hop path: >                                Initial path = 0,0 >                                Return path  = 0,0 > Apr 13 12:20:01 424366 [48FBC940] 0x01 -> > __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error > (IB_TIMEOUT) > Apr 13 12:20:01 424410 [48FBC940] 0x01 -> SMP dump: >                                base_ver................0x1 >                                mgmt_class..............0x81 >                                class_ver...............0x1 >                                method..................0x1 (SubnGet) >                                D bit...................0x0 >                                status..................0x0 >                                hop_ptr.................0x0 >                                hop_count...............0x1 >                                trans_id................0x1311 >                                attr_id.................0x11 (NodeInfo) >                                resv....................0x0 >                                attr_mod................0x0 >                                m_key...................0x0000000000000000 >                                dr_slid.................65535 >                                dr_dlid.................65535 > >                                Initial path: 0,1 >                                Return path:  0,0 >                                Reserved:     [0][0][0][0][0][0][0] > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >> Any end >> node messages pertaining to IB ? > > Nothing I can see. > >> >>> (I restarted ib on the >>> clients), although ibstat shows the links up.  What am I getting >>> wrong?  The opensmd is running on the server. >> >> Which server ? > > There's only one server... it has many ports for which I'm trying to > partition do different clients.  So, in the above, when I say "Server > A", I mean server port "A". I meant which server port is running OpenSM (which GUID is being used). I see above it is 0x24717124000029 >> You still need the default partition with the SM node being full and >> the others being limited there (so it's also best to run SM on >> separate node if possible otherwise you have the potential of any >> client connecting to it on default partition). > > Are you saying to change the partitions.conf file to: > > part1=0x1, ipoib: 0x0024717124000029=full, 0x0002c90300028c01; > part2=0x2, ipoib: 0x002471712400002a=full, 0x0002c90300026047; > part3=0x3, ipoib: 0x0024717127000035=full, 0x0002c90300026053; > part4=0x4, ipoib: 0x0024717127000036=full, 0x0002c9030002603b; That's part of it. > ... (which still doesn't work) in which case I set all the server's > ports to "full", or should just one be "full" (which didn't work > either)? You also need: Default=0x7fff: ALL, SELF=FULL; I would put that first. > I did have a difficult time understanding the difference between > "full" and "limited" in the man page. On a given partition, full can talk with all other members whereas a limited member can only talk with full members (not other limited members). > I've got a captive network, so I don't want any paths I've not > specified to be allowed.  If that makes any sense.  So, I didn't want > to put a statement in like: > > Default=0x7fff,ipoib:ALL=full; > > ... that would let a rogue node slip through the cracks. The only one they can talk with is the SM (the way I'm proposing) so it's best if the SM node could be separate. In order for SA portion of SM to work, SM node must be a full member of the default partition and other nodes must be at least limited members (so their queries will be responded to). IPoIB is not needed on that partition. -- Hal > Thanks, > > Chris >> >> -- Hal >> >>> Thanks, >>> >>> Chris >>> >> > From yosefe at voltaire.com Mon Apr 13 12:19:08 2009 From: yosefe at voltaire.com (Yossi Etigin) Date: Mon, 13 Apr 2009 22:19:08 +0300 Subject: [ofa-general] Re: [PATCH v3] mlx4_ib: Optimize hugetlab pages support In-Reply-To: <20090407065955.GA2308@mtls03> References: <20090129152725.GA26284@mtls03> <49D0EA11.2040409@Voltaire.COM> <20090402084406.GB21370@mtls03> <20090405065047.GA567@mtls03> <49DA40B7.3040004@voltaire.com> <20090407065955.GA2308@mtls03> Message-ID: <49E3902C.30704@voltaire.com> Eli Cohen wrote: > On Mon, Apr 06, 2009 at 08:49:43PM +0300, Yossi Etigin wrote: >> I don't understand - if all area is huge pages, it does not mean that >> it fills full huge pages - I can have just 4096 bytes in huge page memory >> and umem->hugetlb will remain 1, right? > > You may call ib_umem_get() with a fraction of a huge page but I expect > the number of pages returned from get_user_pages() will fill up a huge > page. Can you check that with the mckey test you were using? The number of pages is 1. I got this in dmesg with the modified mckey (see the last line): umem: addr=508000 size=1024 hugetlb=0 npages=1 umem: addr=50a000 size=4096 hugetlb=0 npages=1 umem: addr=50c000 size=4352 hugetlb=0 npages=2 umem: addr=50f000 size=4096 hugetlb=0 npages=1 umem: addr=2aaaaac00000 size=140 hugetlb=1 npages=1 After applying this to umem.c: --- ofa_kernel-1.4.1/drivers/infiniband/core/umem.c 2009-04-13 22:15:19.000000000 +0300 +++ ofa_kernel-1.4.1.patched/drivers/infiniband/core/umem.c 2009-04-13 22:09:36.000000000 +0300 @@ -137,6 +137,7 @@ int ret; int off; int i; + int ntotalpages; DEFINE_DMA_ATTRS(attrs); if (dmasync) @@ -196,6 +197,7 @@ cur_base = addr & PAGE_MASK; ret = 0; + ntotalpages = 0; while (npages) { ret = get_user_pages(current, current->mm, cur_base, min_t(unsigned long, npages, @@ -226,6 +228,7 @@ !is_vm_hugetlb_page(vma_list[i + off])) umem->hugetlb = 0; sg_set_page(&chunk->page_list[i], page_list[i + off], PAGE_SIZE, 0); + ntotalpages++; } chunk->nmap = ib_dma_map_sg_attrs(context->device, @@ -254,8 +257,11 @@ if (ret < 0) { __ib_umem_release(context->device, umem, 0); kfree(umem); - } else + } else { current->mm->locked_vm = locked; + printk(KERN_DEBUG "umem: addr=%lx size=%ld hugetlb=%d npages=%d\n", + addr, size, umem->hugetlb, ntotalpages); + } up_write(¤t->mm->mmap_sem); if (vma_list) From akepner at sgi.com Mon Apr 13 11:46:57 2009 From: akepner at sgi.com (akepner at sgi.com) Date: Mon, 13 Apr 2009 11:46:57 -0700 Subject: [ofa-general] [PATCH] mthca: increase INIT_HCA timeout Message-ID: <20090413184657.GE22355@sgi.com> Here's a little patch we've been carrying along for a while. If the num_qp module parameter is set higher than 2^19 or so, HCA initialization times out with EBUSY, e.g.: ib_mthca: probe of 0031:01:00.0 failed with error -16 A 60 second timeout seems to be sufficient for the max number of QPs that the h/w can accomodate. Signed-off-by: Arthur Kepner --- mthca_cmd.c | 2 +- 1 files changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c index c33e1c5..6ba8a43 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.c +++ b/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -1390,7 +1390,7 @@ int mthca_INIT_HCA(struct mthca_dev *dev, MTHCA_PUT(inbox, param->uarc_base, INIT_HCA_UAR_CTX_BASE_OFFSET); } - err = mthca_cmd(dev, mailbox->dma, 0, 0, CMD_INIT_HCA, HZ, status); + err = mthca_cmd(dev, mailbox->dma, 0, 0, CMD_INIT_HCA, 60*HZ, status); mthca_free_mailbox(dev, mailbox); return err; From worleys at gmail.com Mon Apr 13 13:09:28 2009 From: worleys at gmail.com (Chris Worley) Date: Mon, 13 Apr 2009 14:09:28 -0600 Subject: ***SPAM*** Re: [ofa-general] Any easy way to specify to the SM to route/zone? In-Reply-To: References: Message-ID: On Mon, Apr 13, 2009 at 12:52 PM, Hal Rosenstock wrote: > On Mon, Apr 13, 2009 at 2:26 PM, Chris Worley wrote: >> On Mon, Apr 13, 2009 at 11:53 AM, Hal Rosenstock >> wrote: >>> On Mon, Apr 13, 2009 at 12:02 PM, Chris Worley wrote: >>>> On Mon, Apr 13, 2009 at 7:43 AM, Hal Rosenstock wrote: >>>>> On Mon, Apr 13, 2009 at 9:37 AM, Chris Worley wrote: >>>>>> On Mon, Apr 13, 2009 at 5:39 AM, Hal Rosenstock >>>>>> wrote: >>>>>>> On Sun, Apr 12, 2009 at 11:01 PM, Chris Worley wrote: >>>>>>>> >>>>>>>> So I need to tell the SM to route specific ports on the server/target >>>>>>>> to specific clients/initiators. >>>>>>>> >>>>>>>> Is there any way to do this? >>>>>>> >>>>>>> Do you mean restrict access between certain clients/servers ? >>>>>> >>>>>> One server w/ 4QDR boards, 16 clients with one QDR board.  I want each >>>>>> port on the server routed/zoned to two clients. >>>>>> >>>>>>> If so, >>>>>>> you can do this with partitioning >>>>>> >>>>>> What is partitioning? >>>>> >>>>> A partition is a collection of ports which are allowed to communicate >>>>> together. There are two forms of members: full members which can talk >>>>> to any other member (useful for servers) and limited members which can >>>>> only talk to full members (useful for clients). See the opensm man >>>>> page or partition-config.txt on setting this up for OpenSM. >>>>> >>>> >>>> Let me see if I understand this with a simple example... my port GUIDs >>>> (as reported by ibstat) are for one server (4 QDR ports) and four >>>> clients (one QDR port each): >>>> >>>> >>>> Server A:           Port GUID: 0x0024717124000029 >>>> Server B:           Port GUID: 0x002471712400002a >>>> Server C:           Port GUID: 0x0024717127000035 >>>> Server D:           Port GUID: 0x0024717127000036 >>>> >>>> Client 1:                Port GUID: 0x0002c90300028c01 >>>> Client 2:                Port GUID: 0x0002c90300026047 >>>> Client 3:                Port GUID: 0x0002c90300026053 >>>> Client 4:                Port GUID: 0x0002c9030002603b >>>> >>>> Assuming I want a 1:1 (one server port to one client) partitioning, I >>>> would put the following in /etc/ofed/partitions.conf: >>>> >>>> part1=0x1, ipoib, defmember=full : 0x0024717124000029, 0x0002c90300028c01; >>>> part2=0x2, ipoib, defmember=full : 0x002471712400002a, 0x0002c90300026047; >>>> part3=0x3, ipoib, defmember=full : 0x0024717127000035, 0x0002c90300026053; >>>> part4=0x4, ipoib, defmember=full : 0x0024717127000036, 0x0002c9030002603b; >>> >>> So you want IPoIB. >> >> I'm doing SRP, so I need IPoIB working. > > SRP needs to query PathRecord with the correct PKey and use the > correct Pkey index for that partition. I'm not sure how that is done > in SRP but first IPoIB needs to be made to work (again). > Okay... I'll setup the IPoIB as the ipoib.txt suggests, i.e.: echo 0x1 > /sys/class/net/ib0/create_child ... but for now, I'm still not seeing the state go to "up"... I think that's the first problem. >>> >>>> ... and run w/: >>>> >>>> opensm -r -B -P/etc/ofed/partitions.conf > > Also, do you need to use -r ? It's better not to (reassign LIDs). I'm using it to assure that it just doesn't hang on to the old state, especially since I'm not getting the SM working... I don't want it to assume anything is right about the previous state. I have tried w/ and w/o and don't see a difference. The plan is, once I get it working, to remove the "-r". Or, are you suggesting I not use it? > >>>> Does that sound correct?  It doesn't work >>> >>> What application(s) aren't working ? >> >> ping over IPoIB, for example. >> >> I am seeing the test node in an "initializing" state right now... I >> thought it was "up" before. > > Yes, this has gone "backwards" (not as far along yet...) > I think getting to an "up" state is the first step. >>> Any SM error messages ? >> >> The server has one klogd error coming out continuously: >> >> ib0: multicast join failed for >> ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22 > > IPoIB broadcast group (on the default partition) can't be joined (I'm > presuming due to the current partition setup (e.g. it worked prior to > this, right ?)). > > You need to do some IPoIB configuration relative to partitions as well. > See kernel Documentation/infiniband/ipoib.txt for help with this. > Will do. As you say, the trick will be getting SRP to use the right P_Key's... but I need to get the IB in an "up" state first. >>> Which server ? >> >> There's only one server... it has many ports for which I'm trying to >> partition do different clients.  So, in the above, when I say "Server >> A", I mean server port "A". > > I meant which server port is running OpenSM (which GUID is being > used). I see above it is 0x24717124000029 That was it. I've switched to a client as the SM now, as you suggest a stand-alone SM. > >>> You still need the default partition with the SM node being full and >>> the others being limited there (so it's also best to run SM on >>> separate node if possible otherwise you have the potential of any >>> client connecting to it on default partition). >> >> Are you saying to change the partitions.conf file to: >> >> part1=0x1, ipoib: 0x0024717124000029=full, 0x0002c90300028c01; >> part2=0x2, ipoib: 0x002471712400002a=full, 0x0002c90300026047; >> part3=0x3, ipoib: 0x0024717127000035=full, 0x0002c90300026053; >> part4=0x4, ipoib: 0x0024717127000036=full, 0x0002c9030002603b; > > That's part of it. > >> ... (which still doesn't work) in which case I set all the server's >> ports to "full", or should just one be "full" (which didn't work >> either)? > > You also need: > Default=0x7fff: ALL, SELF=FULL; > I would put that first. So, now my /etc/ofed/partitions.conf file looks like: Default=0x7fff: ALL, SELF=FULL; part1=0x1, ipoib: 0x0002c903000292af=full, 0x0002c90300028c01; part2=0x2, ipoib: 0x0002c903000292b0=full, 0x0002c90300026047; part4=0x4, ipoib: 0x0024717124000029=full, 0x0002c9030002603b; ... I pulled out the node on partition 3 to use as an SM exclusive node, I also changed the server ports to some of the other IB ports on that machine (port GUIDs as shown by ibstat). I set the server port GUID's to "full", as I want the client GUIDs to talk to it, but not necessarily each other (as there is only one client GUID on each partition now, it's a moot point). Note that I made-up the partition P_Key's of 1, 2, and 4. Note that it still doesn't work. On the stand-alone SM, ibstat looks like: # ibstat CA 'mlx4_0' CA type: MT26428 Number of ports: 2 Firmware version: 2.6.0 Hardware version: a0 Node GUID: 0x0002c90300026052 System image GUID: 0x0002c90300026055 Port 1: State: Armed Physical state: LinkUp Rate: 10 Base lid: 1 LMC: 0 SM lid: 1 Capability mask: 0x0251086a Port GUID: 0x0002c90300026053 Port 2: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510868 Port GUID: 0x0002c90300026054 ... On the server, the devices mentioned in the partitions file look like: CA 'mlx4_0' CA type: MT25418 Number of ports: 2 Firmware version: 2.6.0 Hardware version: a0 Node GUID: 0x0024717124000028 System image GUID: 0x002471712400002b Port 1: State: Initializing Physical state: LinkUp Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510868 Port GUID: 0x0024717124000029 Port 2: State: Initializing Physical state: LinkUp Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510868 Port GUID: 0x002471712400002a CA 'mlx4_1' CA type: MT26428 Number of ports: 2 Firmware version: 2.6.0 Hardware version: a0 Node GUID: 0x0002c903000292ae System image GUID: 0x0002c903000292b1 Port 1: State: Initializing Physical state: LinkUp Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510868 Port GUID: 0x0002c903000292af Port 2: State: Initializing Physical state: LinkUp Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510868 Port GUID: 0x0002c903000292b0 On one of the clients: # ibstat CA 'mlx4_0' CA type: MT26428 Number of ports: 2 Firmware version: 2.6.0 Hardware version: a0 Node GUID: 0x0002c90300026046 System image GUID: 0x0002c90300026049 Port 1: State: Initializing Physical state: LinkUp Rate: 10 Base lid: 7 LMC: 0 SM lid: 1 Capability mask: 0x02510868 Port GUID: 0x0002c90300026047 Port 2: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510868 Port GUID: 0x0002c90300026048 Partition "part2" with P_Key=2 should connect this client's port 0 to the sever on port 1 of mlx4_1 > >> I did have a difficult time understanding the difference between >> "full" and "limited" in the man page. > > On a given partition, full can talk with all other members whereas a > limited member can only talk with full members (not other limited > members). > I think I've got that correctly specified in the above partitions file. >> I've got a captive network, so I don't want any paths I've not >> specified to be allowed.  If that makes any sense.  So, I didn't want >> to put a statement in like: >> >> Default=0x7fff,ipoib:ALL=full; >> >> ... that would let a rogue node slip through the cracks. > > The only one they can talk with is the SM (the way I'm proposing) so > it's best if the SM node could be separate. It's separate now. The log looks like (in its entirety at statup): Apr 13 13:41:56 182699 [1D71CA30] 0x03 -> OpenSM 3.2.5_20081207 Apr 13 13:41:56 182764 [1D71CA30] 0x80 -> OpenSM 3.2.5_20081207 Apr 13 13:41:56 183020 [1D71CA30] 0x02 -> osm_vendor_init: 1000 pending umads specified Apr 13 13:41:56 183104 [1D71CA30] 0x80 -> Entering DISCOVERING state Apr 13 13:41:56 193181 [1D71CA30] 0x02 -> osm_vendor_bind: Binding to port 0x2c90300026053 Apr 13 13:41:56 217349 [1D71CA30] 0x02 -> osm_vendor_bind: Binding to port 0x2c90300026053 Apr 13 13:41:57 018570 [47FCE940] 0x01 -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x110000123b) -- dropping Apr 13 13:41:57 018586 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0 Apr 13 13:41:57 018603 [47FCE940] 0x01 -> Received SMP on a 1 hop path: Initial path = 0,0 Return path = 0,0 Apr 13 13:41:57 018608 [47FCE940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Apr 13 13:41:57 018626 [47FCE940] 0x01 -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x123b attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................65535 dr_dlid.................65535 Initial path: 0,1 Return path: 0,0 Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Apr 13 13:41:57 018681 [475CD940] 0x80 -> Entering MASTER state Apr 13 13:41:57 019791 [475CD940] 0x80 -> SUBNET UP Apr 13 13:42:06 986336 [47FCE940] 0x01 -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x1100001242) -- dropping Apr 13 13:42:06 986349 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0 Apr 13 13:42:06 986355 [47FCE940] 0x01 -> Received SMP on a 1 hop path: Initial path = 0,0 Return path = 0,0 Apr 13 13:42:06 986360 [47FCE940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Apr 13 13:42:06 986376 [47FCE940] 0x01 -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x1242 attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................65535 dr_dlid.................65535 Initial path: 0,1 Return path: 0,0 Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Apr 13 13:42:06 986708 [475CD940] 0x02 -> SUBNET UP Apr 13 13:42:16 990103 [47FCE940] 0x01 -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x1100001246) -- dropping Apr 13 13:42:16 990114 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0 Apr 13 13:42:16 990120 [47FCE940] 0x01 -> Received SMP on a 1 hop path: Initial path = 0,0 Return path = 0,0 Apr 13 13:42:16 990125 [47FCE940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Apr 13 13:42:16 990141 [47FCE940] 0x01 -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x1246 attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................65535 dr_dlid.................65535 Initial path: 0,1 Return path: 0,0 Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Apr 13 13:42:16 990475 [475CD940] 0x02 -> SUBNET UP Apr 13 13:42:26 990871 [47FCE940] 0x01 -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x110000124a) -- dropping Apr 13 13:42:26 990884 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0 Apr 13 13:42:26 990890 [47FCE940] 0x01 -> Received SMP on a 1 hop path: Initial path = 0,0 Return path = 0,0 Apr 13 13:42:26 990895 [47FCE940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Apr 13 13:42:26 990912 [47FCE940] 0x01 -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x124a attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................65535 dr_dlid.................65535 Initial path: 0,1 Return path: 0,0 Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Apr 13 13:42:26 991227 [475CD940] 0x02 -> SUBNET UP Apr 13 13:42:36 993638 [47FCE940] 0x01 -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x110000124e) -- dropping Apr 13 13:42:36 993649 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0 Apr 13 13:42:36 993655 [47FCE940] 0x01 -> Received SMP on a 1 hop path: Initial path = 0,0 Return path = 0,0 Apr 13 13:42:36 993660 [47FCE940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Apr 13 13:42:36 993676 [47FCE940] 0x01 -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x124e attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................65535 dr_dlid.................65535 Initial path: 0,1 Return path: 0,0 Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Apr 13 13:42:36 993996 [475CD940] 0x02 -> SUBNET UP Apr 13 13:42:46 996409 [47FCE940] 0x01 -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x1100001252) -- dropping Apr 13 13:42:46 996420 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0 Apr 13 13:42:46 996426 [47FCE940] 0x01 -> Received SMP on a 1 hop path: Initial path = 0,0 Return path = 0,0 Apr 13 13:42:46 996431 [47FCE940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Apr 13 13:42:46 996449 [47FCE940] 0x01 -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x1252 attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................65535 dr_dlid.................65535 Initial path: 0,1 Return path: 0,0 Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Apr 13 13:42:46 996800 [475CD940] 0x02 -> SUBNET UP Apr 13 13:42:56 999180 [47FCE940] 0x01 -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x1100001256) -- dropping Apr 13 13:42:56 999192 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0 Apr 13 13:42:56 999198 [47FCE940] 0x01 -> Received SMP on a 1 hop path: Initial path = 0,0 Return path = 0,0 Apr 13 13:42:56 999203 [47FCE940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Apr 13 13:42:56 999220 [47FCE940] 0x01 -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x1256 attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................65535 dr_dlid.................65535 Initial path: 0,1 Return path: 0,0 Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Apr 13 13:42:56 999553 [475CD940] 0x02 -> SUBNET UP Apr 13 13:43:07 001949 [47FCE940] 0x01 -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x110000125a) -- dropping Apr 13 13:43:07 001963 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0 Apr 13 13:43:07 001969 [47FCE940] 0x01 -> Received SMP on a 1 hop path: Initial path = 0,0 Return path = 0,0 Apr 13 13:43:07 001975 [47FCE940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Apr 13 13:43:07 001992 [47FCE940] 0x01 -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x125a attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................65535 dr_dlid.................65535 Initial path: 0,1 Return path: 0,0 Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Apr 13 13:43:07 002384 [475CD940] 0x02 -> SUBNET UP Apr 13 13:43:17 004713 [47FCE940] 0x01 -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x110000125e) -- dropping Apr 13 13:43:17 004727 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0 Apr 13 13:43:17 004733 [47FCE940] 0x01 -> Received SMP on a 1 hop path: Initial path = 0,0 Return path = 0,0 Apr 13 13:43:17 004738 [47FCE940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Apr 13 13:43:17 004755 [47FCE940] 0x01 -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x125e attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................65535 dr_dlid.................65535 Initial path: 0,1 Return path: 0,0 Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Apr 13 13:43:17 005140 [475CD940] 0x02 -> SUBNET UP Apr 13 13:43:27 007482 [47FCE940] 0x01 -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x1100001262) -- dropping Apr 13 13:43:27 007497 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0 Apr 13 13:43:27 007503 [47FCE940] 0x01 -> Received SMP on a 1 hop path: Initial path = 0,0 Return path = 0,0 Apr 13 13:43:27 007508 [47FCE940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Apr 13 13:43:27 007524 [47FCE940] 0x01 -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x1262 attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................65535 dr_dlid.................65535 Initial path: 0,1 Return path: 0,0 Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Apr 13 13:43:27 007958 [475CD940] 0x02 -> SUBNET UP Apr 13 13:43:37 010250 [47FCE940] 0x01 -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x1100001266) -- dropping Apr 13 13:43:37 010264 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0 Apr 13 13:43:37 010270 [47FCE940] 0x01 -> Received SMP on a 1 hop path: Initial path = 0,0 Return path = 0,0 Apr 13 13:43:37 010275 [47FCE940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Apr 13 13:43:37 010292 [47FCE940] 0x01 -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x1266 attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................65535 dr_dlid.................65535 Initial path: 0,1 Return path: 0,0 Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Apr 13 13:43:37 010716 [475CD940] 0x02 -> SUBNET UP Apr 13 13:43:47 013017 [47FCE940] 0x01 -> umad_receiver: ERR 5409: send completed with error (method=0x1 attr=0x11 trans_id=0x110000126a) -- dropping Apr 13 13:43:47 013029 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR SMP Hop Ptr: 0x0 Apr 13 13:43:47 013035 [47FCE940] 0x01 -> Received SMP on a 1 hop path: Initial path = 0,0 Return path = 0,0 Apr 13 13:43:47 013059 [47FCE940] 0x01 -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT) Apr 13 13:43:47 013077 [47FCE940] 0x01 -> SMP dump: base_ver................0x1 mgmt_class..............0x81 class_ver...............0x1 method..................0x1 (SubnGet) D bit...................0x0 status..................0x0 hop_ptr.................0x0 hop_count...............0x1 trans_id................0x126a attr_id.................0x11 (NodeInfo) resv....................0x0 attr_mod................0x0 m_key...................0x0000000000000000 dr_slid.................65535 dr_dlid.................65535 Initial path: 0,1 Return path: 0,0 Reserved: [0][0][0][0][0][0][0] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > In order for SA portion of SM to work, SM node must be a full member > of the default partition and other nodes must be at least limited > members (so their queries will be responded to). IPoIB is not needed > on that partition. I think I've got the partition file specified correctly... but then again obviously not, as it doesn't work. Thanks, Chris From sashak at voltaire.com Mon Apr 13 13:11:00 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 13 Apr 2009 23:11:00 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm/osm_ucast_ftree.c: lids are always handled in host order In-Reply-To: <49E21234.2060906@dev.mellanox.co.il> References: <49E21234.2060906@dev.mellanox.co.il> Message-ID: <20090413201100.GE5521@sk> On 19:09 Sun 12 Apr , Yevgeny Kliteynik wrote: > Hi Sasha, > > There's a mess in host vs. network order in lids handling in ftree. > In vast majority of the cases lid is required to be in host order, > so there are many cl_ntoh16() conversions. > Fixing it to be always in host order. > > Signed-off-by: Yevgeny Kliteynik Applied. Thanks. Sasha From sashak at voltaire.com Mon Apr 13 13:11:45 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 13 Apr 2009 23:11:45 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm/partition-config.txt: Update for defmember feature In-Reply-To: <20090413183026.GA12839@comcast.net> References: <20090413183026.GA12839@comcast.net> Message-ID: <20090413201145.GF5521@sk> On 14:30 Mon 13 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From hal.rosenstock at gmail.com Mon Apr 13 14:01:10 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 13 Apr 2009 17:01:10 -0400 Subject: ***SPAM*** Re: [ofa-general] Any easy way to specify to the SM to route/zone? In-Reply-To: References: Message-ID: On Mon, Apr 13, 2009 at 4:09 PM, Chris Worley wrote: > On Mon, Apr 13, 2009 at 12:52 PM, Hal Rosenstock > wrote: >> On Mon, Apr 13, 2009 at 2:26 PM, Chris Worley wrote: >>> On Mon, Apr 13, 2009 at 11:53 AM, Hal Rosenstock >>> wrote: >>>> On Mon, Apr 13, 2009 at 12:02 PM, Chris Worley wrote: >>>>> On Mon, Apr 13, 2009 at 7:43 AM, Hal Rosenstock wrote: >>>>>> On Mon, Apr 13, 2009 at 9:37 AM, Chris Worley wrote: >>>>>>> On Mon, Apr 13, 2009 at 5:39 AM, Hal Rosenstock >>>>>>> wrote: >>>>>>>> On Sun, Apr 12, 2009 at 11:01 PM, Chris Worley wrote: >>>>>>>>> >>>>>>>>> So I need to tell the SM to route specific ports on the server/target >>>>>>>>> to specific clients/initiators. >>>>>>>>> >>>>>>>>> Is there any way to do this? >>>>>>>> >>>>>>>> Do you mean restrict access between certain clients/servers ? >>>>>>> >>>>>>> One server w/ 4QDR boards, 16 clients with one QDR board.  I want each >>>>>>> port on the server routed/zoned to two clients. >>>>>>> >>>>>>>> If so, >>>>>>>> you can do this with partitioning >>>>>>> >>>>>>> What is partitioning? >>>>>> >>>>>> A partition is a collection of ports which are allowed to communicate >>>>>> together. There are two forms of members: full members which can talk >>>>>> to any other member (useful for servers) and limited members which can >>>>>> only talk to full members (useful for clients). See the opensm man >>>>>> page or partition-config.txt on setting this up for OpenSM. >>>>>> >>>>> >>>>> Let me see if I understand this with a simple example... my port GUIDs >>>>> (as reported by ibstat) are for one server (4 QDR ports) and four >>>>> clients (one QDR port each): >>>>> >>>>> >>>>> Server A:           Port GUID: 0x0024717124000029 >>>>> Server B:           Port GUID: 0x002471712400002a >>>>> Server C:           Port GUID: 0x0024717127000035 >>>>> Server D:           Port GUID: 0x0024717127000036 >>>>> >>>>> Client 1:                Port GUID: 0x0002c90300028c01 >>>>> Client 2:                Port GUID: 0x0002c90300026047 >>>>> Client 3:                Port GUID: 0x0002c90300026053 >>>>> Client 4:                Port GUID: 0x0002c9030002603b Is there a switch in between or just back to back HCA ports ? >>>>> >>>>> Assuming I want a 1:1 (one server port to one client) partitioning, I >>>>> would put the following in /etc/ofed/partitions.conf: >>>>> >>>>> part1=0x1, ipoib, defmember=full : 0x0024717124000029, 0x0002c90300028c01; >>>>> part2=0x2, ipoib, defmember=full : 0x002471712400002a, 0x0002c90300026047; >>>>> part3=0x3, ipoib, defmember=full : 0x0024717127000035, 0x0002c90300026053; >>>>> part4=0x4, ipoib, defmember=full : 0x0024717127000036, 0x0002c9030002603b; >>>> >>>> So you want IPoIB. >>> >>> I'm doing SRP, so I need IPoIB working. >> >> SRP needs to query PathRecord with the correct PKey and use the >> correct Pkey index for that partition. I'm not sure how that is done >> in SRP but first IPoIB needs to be made to work (again). >> > > Okay... I'll setup the IPoIB as the ipoib.txt suggests, i.e.: > > echo 0x1 > /sys/class/net/ib0/create_child > > ... but for now, I'm still not seeing the state go to "up"... I think > that's the first problem. Yes, port state needs to be linkup/active first. I see LinkUp/Armed from below. >>>> >>>>> ... and run w/: >>>>> >>>>> opensm -r -B -P/etc/ofed/partitions.conf >> >> Also, do you need to use -r ? It's better not to (reassign LIDs). > > I'm using it to assure that it just doesn't hang on to the old state, > especially since I'm not getting the SM working... OK. > I don't want it to > assume anything is right about the previous state. > > I have tried w/ and w/o and don't see a difference. > > The plan is, once I get it working, to remove the "-r". That's fine. >  Or, are you suggesting I not use it? > >> >>>>> Does that sound correct?  It doesn't work >>>> >>>> What application(s) aren't working ? >>> >>> ping over IPoIB, for example. >>> >>> I am seeing the test node in an "initializing" state right now... I >>> thought it was "up" before. >> >> Yes, this has gone "backwards" (not as far along yet...) >> > > I think getting to an "up" state is the first step. Were the ports getting to LinkUp/Active before partitions were configured ? >>>> Any SM error messages ? >>> >>> The server has one klogd error coming out continuously: >>> >>> ib0: multicast join failed for >>> ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22 >> >> IPoIB broadcast group (on the default partition) can't be joined (I'm >> presuming due to the current partition setup (e.g. it worked prior to >> this, right ?)). >> >> You need to do some IPoIB configuration relative to partitions as well. >> See kernel Documentation/infiniband/ipoib.txt for help with this. >> > > Will do.  As you say, the trick will be getting SRP to use the right > P_Key's... but I need to get the IB in an "up" state first. > > >>>> Which server ? >>> >>> There's only one server... it has many ports for which I'm trying to >>> partition do different clients.  So, in the above, when I say "Server >>> A", I mean server port "A". >> >> I meant which server port is running OpenSM (which GUID is being >> used). I see above it is 0x24717124000029 > > That was it.  I've switched to a client as the SM now, as you suggest > a stand-alone SM. So it's no longer a client in the ULP sense, right ? >> >>>> You still need the default partition with the SM node being full and >>>> the others being limited there (so it's also best to run SM on >>>> separate node if possible otherwise you have the potential of any >>>> client connecting to it on default partition). >>> >>> Are you saying to change the partitions.conf file to: >>> >>> part1=0x1, ipoib: 0x0024717124000029=full, 0x0002c90300028c01; >>> part2=0x2, ipoib: 0x002471712400002a=full, 0x0002c90300026047; >>> part3=0x3, ipoib: 0x0024717127000035=full, 0x0002c90300026053; >>> part4=0x4, ipoib: 0x0024717127000036=full, 0x0002c9030002603b; >> >> That's part of it. >> >>> ... (which still doesn't work) in which case I set all the server's >>> ports to "full", or should just one be "full" (which didn't work >>> either)? >> >> You also need: >> Default=0x7fff: ALL, SELF=FULL; >> I would put that first. > > So, now my /etc/ofed/partitions.conf file looks like: > > Default=0x7fff: ALL, SELF=FULL; > part1=0x1, ipoib: 0x0002c903000292af=full, 0x0002c90300028c01; > part2=0x2, ipoib: 0x0002c903000292b0=full, 0x0002c90300026047; > part4=0x4, ipoib: 0x0024717124000029=full, 0x0002c9030002603b; > ... I pulled out the node on partition 3 to use as an SM exclusive > node, I also changed the server ports to some of the other IB ports on > that machine (port GUIDs as shown by ibstat).  I set the server port > GUID's to "full", as I want the client GUIDs to talk to it, but not > necessarily each other (as there is only one client GUID on each > partition now, it's a moot point). > > Note that I made-up the partition P_Key's of 1, 2, and 4. This all looks/sounds fine to me. > Note that it still doesn't work.  On the stand-alone SM, ibstat looks like: > > # ibstat > CA 'mlx4_0' >        CA type: MT26428 >        Number of ports: 2 >        Firmware version: 2.6.0 >        Hardware version: a0 >        Node GUID: 0x0002c90300026052 >        System image GUID: 0x0002c90300026055 >        Port 1: >                State: Armed >                Physical state: LinkUp >                Rate: 10 >                Base lid: 1 >                LMC: 0 >                SM lid: 1 >                Capability mask: 0x0251086a >                Port GUID: 0x0002c90300026053 >        Port 2: >                State: Down >                Physical state: Polling >                Rate: 10 >                Base lid: 0 >                LMC: 0 >                SM lid: 0 >                Capability mask: 0x02510868 >                Port GUID: 0x0002c90300026054 What's at the other end of port 1 ? Would you do smpquery portinfo for this HCA port and it's peer port ? > ... On the server, the devices mentioned in the partitions file look like: > > CA 'mlx4_0' >        CA type: MT25418 >        Number of ports: 2 >        Firmware version: 2.6.0 >        Hardware version: a0 >        Node GUID: 0x0024717124000028 >        System image GUID: 0x002471712400002b >        Port 1: >                State: Initializing >                Physical state: LinkUp >                Rate: 10 >                Base lid: 0 >                LMC: 0 >                SM lid: 0 >                Capability mask: 0x02510868 >                Port GUID: 0x0024717124000029 >        Port 2: >                State: Initializing >                Physical state: LinkUp >                Rate: 10 >                Base lid: 0 >                LMC: 0 >                SM lid: 0 >                Capability mask: 0x02510868 >                Port GUID: 0x002471712400002a > CA 'mlx4_1' >        CA type: MT26428 >        Number of ports: 2 >        Firmware version: 2.6.0 >        Hardware version: a0 >        Node GUID: 0x0002c903000292ae >        System image GUID: 0x0002c903000292b1 >        Port 1: >                State: Initializing >                Physical state: LinkUp >                Rate: 10 >                Base lid: 0 >                LMC: 0 >                SM lid: 0 >                Capability mask: 0x02510868 >                Port GUID: 0x0002c903000292af >        Port 2: >                State: Initializing >                Physical state: LinkUp >                Rate: 10 >                Base lid: 0 >                LMC: 0 >                SM lid: 0 >                Capability mask: 0x02510868 >                Port GUID: 0x0002c903000292b0 So no SM initialization is occurring there since they are still just in Init. > On one of the clients: > > # ibstat > CA 'mlx4_0' >        CA type: MT26428 >        Number of ports: 2 >        Firmware version: 2.6.0 >        Hardware version: a0 >        Node GUID: 0x0002c90300026046 >        System image GUID: 0x0002c90300026049 >        Port 1: >                State: Initializing >                Physical state: LinkUp >                Rate: 10 >                Base lid: 7 >                LMC: 0 >                SM lid: 1 >                Capability mask: 0x02510868 >                Port GUID: 0x0002c90300026047 >        Port 2: >                State: Down >                Physical state: Polling >                Rate: 10 >                Base lid: 0 >                LMC: 0 >                SM lid: 0 >                Capability mask: 0x02510868 >                Port GUID: 0x0002c90300026048 Ditto. Down means it's likely a port that is not connected. > Partition "part2" with P_Key=2 should connect this client's port 0 to > the sever on port 1 of mlx4_1 Do you really mean port 0 ? >> >>> I did have a difficult time understanding the difference between >>> "full" and "limited" in the man page. >> >> On a given partition, full can talk with all other members whereas a >> limited member can only talk with full members (not other limited >> members). >> > > I think I've got that correctly specified in the above partitions file. > >>> I've got a captive network, so I don't want any paths I've not >>> specified to be allowed.  If that makes any sense.  So, I didn't want >>> to put a statement in like: >>> >>> Default=0x7fff,ipoib:ALL=full; >>> >>> ... that would let a rogue node slip through the cracks. >> >> The only one they can talk with is the SM (the way I'm proposing) so >> it's best if the SM node could be separate. > > It's separate now.  The log looks like (in its entirety at statup): > > Apr 13 13:41:56 182699 [1D71CA30] 0x03 -> OpenSM 3.2.5_20081207 > Apr 13 13:41:56 182764 [1D71CA30] 0x80 -> OpenSM 3.2.5_20081207 > Apr 13 13:41:56 183020 [1D71CA30] 0x02 -> osm_vendor_init: 1000 > pending umads specified > Apr 13 13:41:56 183104 [1D71CA30] 0x80 -> Entering DISCOVERING state > Apr 13 13:41:56 193181 [1D71CA30] 0x02 -> osm_vendor_bind: Binding to > port 0x2c90300026053 > Apr 13 13:41:56 217349 [1D71CA30] 0x02 -> osm_vendor_bind: Binding to > port 0x2c90300026053 > Apr 13 13:41:57 018570 [47FCE940] 0x01 -> umad_receiver: ERR 5409: > send completed with error (method=0x1 attr=0x11 trans_id=0x110000123b) > -- dropping > Apr 13 13:41:57 018586 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR > SMP Hop Ptr: 0x0 > Apr 13 13:41:57 018603 [47FCE940] 0x01 -> Received SMP on a 1 hop path: >                                Initial path = 0,0 >                                Return path  = 0,0 > Apr 13 13:41:57 018608 [47FCE940] 0x01 -> > __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error > (IB_TIMEOUT) > Apr 13 13:41:57 018626 [47FCE940] 0x01 -> SMP dump: >                                base_ver................0x1 >                                mgmt_class..............0x81 >                                class_ver...............0x1 >                                method..................0x1 (SubnGet) >                                D bit...................0x0 >                                status..................0x0 >                                hop_ptr.................0x0 >                                hop_count...............0x1 >                                trans_id................0x123b >                                attr_id.................0x11 (NodeInfo) >                                resv....................0x0 >                                attr_mod................0x0 >                                m_key...................0x0000000000000000 >                                dr_slid.................65535 >                                dr_dlid.................65535 > >                                Initial path: 0,1 >                                Return path:  0,0 >                                Reserved:     [0][0][0][0][0][0][0] > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 This is the first level problem. Some SMA is not responding to a NodeInfo query from the SM. Whatever is the next hop from the SM port appears not to be responding. You may need to reboot that device or otherwise reset it to see if this clears this issue. -- Hal > Apr 13 13:41:57 018681 [475CD940] 0x80 -> Entering MASTER state > Apr 13 13:41:57 019791 [475CD940] 0x80 -> SUBNET UP > Apr 13 13:42:06 986336 [47FCE940] 0x01 -> umad_receiver: ERR 5409: > send completed with error (method=0x1 attr=0x11 trans_id=0x1100001242) > -- dropping > Apr 13 13:42:06 986349 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR > SMP Hop Ptr: 0x0 > Apr 13 13:42:06 986355 [47FCE940] 0x01 -> Received SMP on a 1 hop path: >                                Initial path = 0,0 >                                Return path  = 0,0 > Apr 13 13:42:06 986360 [47FCE940] 0x01 -> > __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error > (IB_TIMEOUT) > Apr 13 13:42:06 986376 [47FCE940] 0x01 -> SMP dump: >                                base_ver................0x1 >                                mgmt_class..............0x81 >                                class_ver...............0x1 >                                method..................0x1 (SubnGet) >                                D bit...................0x0 >                                status..................0x0 >                                hop_ptr.................0x0 >                                hop_count...............0x1 >                                trans_id................0x1242 >                                attr_id.................0x11 (NodeInfo) >                                resv....................0x0 >                                attr_mod................0x0 >                                m_key...................0x0000000000000000 >                                dr_slid.................65535 >                                dr_dlid.................65535 > >                                Initial path: 0,1 >                                Return path:  0,0 >                                Reserved:     [0][0][0][0][0][0][0] > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > > Apr 13 13:42:06 986708 [475CD940] 0x02 -> SUBNET UP > Apr 13 13:42:16 990103 [47FCE940] 0x01 -> umad_receiver: ERR 5409: > send completed with error (method=0x1 attr=0x11 trans_id=0x1100001246) > -- dropping > Apr 13 13:42:16 990114 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR > SMP Hop Ptr: 0x0 > Apr 13 13:42:16 990120 [47FCE940] 0x01 -> Received SMP on a 1 hop path: >                                Initial path = 0,0 >                                Return path  = 0,0 > Apr 13 13:42:16 990125 [47FCE940] 0x01 -> > __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error > (IB_TIMEOUT) > Apr 13 13:42:16 990141 [47FCE940] 0x01 -> SMP dump: >                                base_ver................0x1 >                                mgmt_class..............0x81 >                                class_ver...............0x1 >                                method..................0x1 (SubnGet) >                                D bit...................0x0 >                                status..................0x0 >                                hop_ptr.................0x0 >                                hop_count...............0x1 >                                trans_id................0x1246 >                                attr_id.................0x11 (NodeInfo) >                                resv....................0x0 >                                attr_mod................0x0 >                                m_key...................0x0000000000000000 >                                dr_slid.................65535 >                                dr_dlid.................65535 > >                                Initial path: 0,1 >                                Return path:  0,0 >                                Reserved:     [0][0][0][0][0][0][0] > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > > Apr 13 13:42:16 990475 [475CD940] 0x02 -> SUBNET UP > Apr 13 13:42:26 990871 [47FCE940] 0x01 -> umad_receiver: ERR 5409: > send completed with error (method=0x1 attr=0x11 trans_id=0x110000124a) > -- dropping > Apr 13 13:42:26 990884 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR > SMP Hop Ptr: 0x0 > Apr 13 13:42:26 990890 [47FCE940] 0x01 -> Received SMP on a 1 hop path: >                                Initial path = 0,0 >                                Return path  = 0,0 > Apr 13 13:42:26 990895 [47FCE940] 0x01 -> > __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error > (IB_TIMEOUT) > Apr 13 13:42:26 990912 [47FCE940] 0x01 -> SMP dump: >                                base_ver................0x1 >                                mgmt_class..............0x81 >                                class_ver...............0x1 >                                method..................0x1 (SubnGet) >                                D bit...................0x0 >                                status..................0x0 >                                hop_ptr.................0x0 >                                hop_count...............0x1 >                                trans_id................0x124a >                                attr_id.................0x11 (NodeInfo) >                                resv....................0x0 >                                attr_mod................0x0 >                                m_key...................0x0000000000000000 >                                dr_slid.................65535 >                                dr_dlid.................65535 > >                                Initial path: 0,1 >                                Return path:  0,0 >                                Reserved:     [0][0][0][0][0][0][0] > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > > Apr 13 13:42:26 991227 [475CD940] 0x02 -> SUBNET UP > Apr 13 13:42:36 993638 [47FCE940] 0x01 -> umad_receiver: ERR 5409: > send completed with error (method=0x1 attr=0x11 trans_id=0x110000124e) > -- dropping > Apr 13 13:42:36 993649 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR > SMP Hop Ptr: 0x0 > Apr 13 13:42:36 993655 [47FCE940] 0x01 -> Received SMP on a 1 hop path: >                                Initial path = 0,0 >                                Return path  = 0,0 > Apr 13 13:42:36 993660 [47FCE940] 0x01 -> > __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error > (IB_TIMEOUT) > Apr 13 13:42:36 993676 [47FCE940] 0x01 -> SMP dump: >                                base_ver................0x1 >                                mgmt_class..............0x81 >                                class_ver...............0x1 >                                method..................0x1 (SubnGet) >                                D bit...................0x0 >                                status..................0x0 >                                hop_ptr.................0x0 >                                hop_count...............0x1 >                                trans_id................0x124e >                                attr_id.................0x11 (NodeInfo) >                                resv....................0x0 >                                attr_mod................0x0 >                                m_key...................0x0000000000000000 >                                dr_slid.................65535 >                                dr_dlid.................65535 > >                                Initial path: 0,1 >                                Return path:  0,0 >                                Reserved:     [0][0][0][0][0][0][0] > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > > Apr 13 13:42:36 993996 [475CD940] 0x02 -> SUBNET UP > Apr 13 13:42:46 996409 [47FCE940] 0x01 -> umad_receiver: ERR 5409: > send completed with error (method=0x1 attr=0x11 trans_id=0x1100001252) > -- dropping > Apr 13 13:42:46 996420 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR > SMP Hop Ptr: 0x0 > Apr 13 13:42:46 996426 [47FCE940] 0x01 -> Received SMP on a 1 hop path: >                                Initial path = 0,0 >                                Return path  = 0,0 > Apr 13 13:42:46 996431 [47FCE940] 0x01 -> > __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error > (IB_TIMEOUT) > Apr 13 13:42:46 996449 [47FCE940] 0x01 -> SMP dump: >                                base_ver................0x1 >                                mgmt_class..............0x81 >                                class_ver...............0x1 >                                method..................0x1 (SubnGet) >                                D bit...................0x0 >                                status..................0x0 >                                hop_ptr.................0x0 >                                hop_count...............0x1 >                                trans_id................0x1252 >                                attr_id.................0x11 (NodeInfo) >                                resv....................0x0 >                                attr_mod................0x0 >                                m_key...................0x0000000000000000 >                                dr_slid.................65535 >                                dr_dlid.................65535 > >                                Initial path: 0,1 >                                Return path:  0,0 >                                Reserved:     [0][0][0][0][0][0][0] > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > > Apr 13 13:42:46 996800 [475CD940] 0x02 -> SUBNET UP > Apr 13 13:42:56 999180 [47FCE940] 0x01 -> umad_receiver: ERR 5409: > send completed with error (method=0x1 attr=0x11 trans_id=0x1100001256) > -- dropping > Apr 13 13:42:56 999192 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR > SMP Hop Ptr: 0x0 > Apr 13 13:42:56 999198 [47FCE940] 0x01 -> Received SMP on a 1 hop path: >                                Initial path = 0,0 >                                Return path  = 0,0 > Apr 13 13:42:56 999203 [47FCE940] 0x01 -> > __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error > (IB_TIMEOUT) > Apr 13 13:42:56 999220 [47FCE940] 0x01 -> SMP dump: >                                base_ver................0x1 >                                mgmt_class..............0x81 >                                class_ver...............0x1 >                                method..................0x1 (SubnGet) >                                D bit...................0x0 >                                status..................0x0 >                                hop_ptr.................0x0 >                                hop_count...............0x1 >                                trans_id................0x1256 >                                attr_id.................0x11 (NodeInfo) >                                resv....................0x0 >                                attr_mod................0x0 >                                m_key...................0x0000000000000000 >                                dr_slid.................65535 >                                dr_dlid.................65535 > >                                Initial path: 0,1 >                                Return path:  0,0 >                                Reserved:     [0][0][0][0][0][0][0] > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > > Apr 13 13:42:56 999553 [475CD940] 0x02 -> SUBNET UP > Apr 13 13:43:07 001949 [47FCE940] 0x01 -> umad_receiver: ERR 5409: > send completed with error (method=0x1 attr=0x11 trans_id=0x110000125a) > -- dropping > Apr 13 13:43:07 001963 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR > SMP Hop Ptr: 0x0 > Apr 13 13:43:07 001969 [47FCE940] 0x01 -> Received SMP on a 1 hop path: >                                Initial path = 0,0 >                                Return path  = 0,0 > Apr 13 13:43:07 001975 [47FCE940] 0x01 -> > __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error > (IB_TIMEOUT) > Apr 13 13:43:07 001992 [47FCE940] 0x01 -> SMP dump: >                                base_ver................0x1 >                                mgmt_class..............0x81 >                                class_ver...............0x1 >                                method..................0x1 (SubnGet) >                                D bit...................0x0 >                                status..................0x0 >                                hop_ptr.................0x0 >                                hop_count...............0x1 >                                trans_id................0x125a >                                attr_id.................0x11 (NodeInfo) >                                resv....................0x0 >                                attr_mod................0x0 >                                m_key...................0x0000000000000000 >                                dr_slid.................65535 >                                dr_dlid.................65535 > >                                Initial path: 0,1 >                                Return path:  0,0 >                                Reserved:     [0][0][0][0][0][0][0] > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > > Apr 13 13:43:07 002384 [475CD940] 0x02 -> SUBNET UP > Apr 13 13:43:17 004713 [47FCE940] 0x01 -> umad_receiver: ERR 5409: > send completed with error (method=0x1 attr=0x11 trans_id=0x110000125e) > -- dropping > Apr 13 13:43:17 004727 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR > SMP Hop Ptr: 0x0 > Apr 13 13:43:17 004733 [47FCE940] 0x01 -> Received SMP on a 1 hop path: >                                Initial path = 0,0 >                                Return path  = 0,0 > Apr 13 13:43:17 004738 [47FCE940] 0x01 -> > __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error > (IB_TIMEOUT) > Apr 13 13:43:17 004755 [47FCE940] 0x01 -> SMP dump: >                                base_ver................0x1 >                                mgmt_class..............0x81 >                                class_ver...............0x1 >                                method..................0x1 (SubnGet) >                                D bit...................0x0 >                                status..................0x0 >                                hop_ptr.................0x0 >                                hop_count...............0x1 >                                trans_id................0x125e >                                attr_id.................0x11 (NodeInfo) >                                resv....................0x0 >                                attr_mod................0x0 >                                m_key...................0x0000000000000000 >                                dr_slid.................65535 >                                dr_dlid.................65535 > >                                Initial path: 0,1 >                                Return path:  0,0 >                                Reserved:     [0][0][0][0][0][0][0] > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > > Apr 13 13:43:17 005140 [475CD940] 0x02 -> SUBNET UP > Apr 13 13:43:27 007482 [47FCE940] 0x01 -> umad_receiver: ERR 5409: > send completed with error (method=0x1 attr=0x11 trans_id=0x1100001262) > -- dropping > Apr 13 13:43:27 007497 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR > SMP Hop Ptr: 0x0 > Apr 13 13:43:27 007503 [47FCE940] 0x01 -> Received SMP on a 1 hop path: >                                Initial path = 0,0 >                                Return path  = 0,0 > Apr 13 13:43:27 007508 [47FCE940] 0x01 -> > __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error > (IB_TIMEOUT) > Apr 13 13:43:27 007524 [47FCE940] 0x01 -> SMP dump: >                                base_ver................0x1 >                                mgmt_class..............0x81 >                                class_ver...............0x1 >                                method..................0x1 (SubnGet) >                                D bit...................0x0 >                                status..................0x0 >                                hop_ptr.................0x0 >                                hop_count...............0x1 >                                trans_id................0x1262 >                                attr_id.................0x11 (NodeInfo) >                                resv....................0x0 >                                attr_mod................0x0 >                                m_key...................0x0000000000000000 >                                dr_slid.................65535 >                                dr_dlid.................65535 > >                                Initial path: 0,1 >                                Return path:  0,0 >                                Reserved:     [0][0][0][0][0][0][0] > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > > Apr 13 13:43:27 007958 [475CD940] 0x02 -> SUBNET UP > Apr 13 13:43:37 010250 [47FCE940] 0x01 -> umad_receiver: ERR 5409: > send completed with error (method=0x1 attr=0x11 trans_id=0x1100001266) > -- dropping > Apr 13 13:43:37 010264 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR > SMP Hop Ptr: 0x0 > Apr 13 13:43:37 010270 [47FCE940] 0x01 -> Received SMP on a 1 hop path: >                                Initial path = 0,0 >                                Return path  = 0,0 > Apr 13 13:43:37 010275 [47FCE940] 0x01 -> > __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error > (IB_TIMEOUT) > Apr 13 13:43:37 010292 [47FCE940] 0x01 -> SMP dump: >                                base_ver................0x1 >                                mgmt_class..............0x81 >                                class_ver...............0x1 >                                method..................0x1 (SubnGet) >                                D bit...................0x0 >                                status..................0x0 >                                hop_ptr.................0x0 >                                hop_count...............0x1 >                                trans_id................0x1266 >                                attr_id.................0x11 (NodeInfo) >                                resv....................0x0 >                                attr_mod................0x0 >                                m_key...................0x0000000000000000 >                                dr_slid.................65535 >                                dr_dlid.................65535 > >                                Initial path: 0,1 >                                Return path:  0,0 >                                Reserved:     [0][0][0][0][0][0][0] > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > > Apr 13 13:43:37 010716 [475CD940] 0x02 -> SUBNET UP > Apr 13 13:43:47 013017 [47FCE940] 0x01 -> umad_receiver: ERR 5409: > send completed with error (method=0x1 attr=0x11 trans_id=0x110000126a) > -- dropping > Apr 13 13:43:47 013029 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR > SMP Hop Ptr: 0x0 > Apr 13 13:43:47 013035 [47FCE940] 0x01 -> Received SMP on a 1 hop path: >                                Initial path = 0,0 >                                Return path  = 0,0 > Apr 13 13:43:47 013059 [47FCE940] 0x01 -> > __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error > (IB_TIMEOUT) > Apr 13 13:43:47 013077 [47FCE940] 0x01 -> SMP dump: >                                base_ver................0x1 >                                mgmt_class..............0x81 >                                class_ver...............0x1 >                                method..................0x1 (SubnGet) >                                D bit...................0x0 >                                status..................0x0 >                                hop_ptr.................0x0 >                                hop_count...............0x1 >                                trans_id................0x126a >                                attr_id.................0x11 (NodeInfo) >                                resv....................0x0 >                                attr_mod................0x0 >                                m_key...................0x0000000000000000 >                                dr_slid.................65535 >                                dr_dlid.................65535 > >                                Initial path: 0,1 >                                Return path:  0,0 >                                Reserved:     [0][0][0][0][0][0][0] > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >                                00 00 00 00 00 00 00 00   00 00 00 00 > 00 00 00 00 > >> >> In order for SA portion of SM to work, SM node must be a full member >> of the default partition and other nodes must be at least limited >> members (so their queries will be responded to). IPoIB is not needed >> on that partition. > > I think I've got the partition file specified correctly... but then > again obviously not, as it doesn't work. > > Thanks, > > Chris > From perkinjo at cse.ohio-state.edu Mon Apr 13 14:39:08 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Mon, 13 Apr 2009 17:39:08 -0400 Subject: [ofa-general] non-gcc mpitests seem to compile with gcc compileflags In-Reply-To: <49DB587E.2060508@clustervision.com> References: <49DB587E.2060508@clustervision.com> Message-ID: <20090413213907.GK5186@cse.ohio-state.edu> Guido: Can you try adding -noswitcherror to the CFLAGS? This should keep pgi from choking on these gcc flags. On Tue, Apr 07, 2009 at 03:43:26PM +0200, Guido Passet wrote: > Dear list, > > I could be wrong but it looks like the non-gcc mpitests programs are > using incorrect compile flags. > > > Running rpm -iv > /RPMS/sl-release-5.3-1.x86_64/x86_64/mpitests_mvapich_gcc-3.1-891.x86_64.rpm > Build mpitests_mvapich_pgi RPM > Running LDFLAGS='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 > -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 > -mtune=generic -L/ofed/1.4.1-rc3/lib64 -L/ofed/1.4.1-rc3/lib' > CFLAGS='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions > -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic > -I/ofed/1.4.1-rc3/include' CPPFLAGS='-O2 -g -pipe -Wall > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector > --param=ssp-buffer-size=4 -m64 -mtune=generic -I/ofed/1.4.1-rc3/include' > rpmbuild --rebuild --define '_topdir /var/tmp/OFED_topdir' --define > 'dist %{nil}' --target x86_64 --define '_name mpitests_mvapich_pgi' > --define 'root_path /' --define '_usr /ofed/1.4.1-rc3' --define > 'path_to_mpihome /ofed/1.4.1-rc3/mpi/pgi/mvapich-1.1.0' > /SRPMS/mpitests-3.1-891.src.rpm > > > /ofed/1.4.1-rc3/mpi/pgi/mvapich-1.1.0/bin/mpicc > -I/ofed/1.4.1-rc3/mpi/pgi/mvapich-1.1.0/include -DMPI1 -O3 -c > IMB_cpu_exploit.c > /ofed/1.4.1-rc3/mpi/pgi/mvapich-1.1.0/bin/mpicc -o IMB-MPI1 IMB.o > IMB_declare.o IMB_init.o IMB_mem_manager.o IMB_parse_name_mpi1.o > IMB_benchlist.o IMB_strgs.o IMB_err_handler.o IMB_g_info.o > IMB_warm_up.o IMB_output.o IMB_pingpong.o IMB_pingping.o IMB_allreduce.o > IMB_reduce_scatter.o IMB_reduce.o IMB_exchange.o IMB_bcast.o > IMB_barrier.o IMB_allgather.o IMB_allgatherv.o IMB_gather.o > IMB_gatherv.o IMB_scatter.o IMB_scatterv.o IMB_alltoall.o > IMB_alltoallv.o IMB_sendrecv.o IMB_init_transfer.o IMB_chk_diff.o > IMB_cpu_exploit.o > make[2]: Leaving directory > `/var/tmp/OFED_topdir/BUILD/mpitests-3.1/IMB-3.1/src' > make[1]: Leaving directory > `/var/tmp/OFED_topdir/BUILD/mpitests-3.1/IMB-3.1/src' > cd /var/tmp/OFED_topdir/BUILD/mpitests-3.1/osu_benchmarks-3.0 && make > MPIHOME=/ofed/1.4.1-rc3/mpi/pgi/mvapich-1.1.0 > make[1]: Entering directory > `/var/tmp/OFED_topdir/BUILD/mpitests-3.1/osu_benchmarks-3.0' > /ofed/1.4.1-rc3/mpi/pgi/mvapich-1.1.0/bin/mpicc -O2 -g -pipe -Wall > -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector > --param=ssp-buffer-size=4 -m64 -mtune=generic -I/ofed/1.4.1-rc3/include > -c -o osu_bcast.o osu_bcast.c > pgcc-Error-Unknown switch: -pipe > pgcc-Error-Unknown switch: -Wall > pgcc-Error-Unknown switch: -Wp,-D_FORTIFY_SOURCE=2 > pgcc-Error-Unknown switch: -fexceptions > pgcc-Error-Unknown switch: -fstack-protector > pgcc-Error-Unknown switch: --param=ssp-buffer-size=4 > pgcc-Error-Unknown switch: -m64 > pgcc-Error-Unknown switch: -mtune=generic > make[1]: *** [osu_bcast.o] Error 1 > make[1]: Leaving directory > `/var/tmp/OFED_topdir/BUILD/mpitests-3.1/osu_benchmarks-3.0' > make: *** [osu] Error 2 > error: Bad exit status from /var/tmp/rpm-tmp.25349 (%install) > > > > Cheers, > Guido. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available URL: From worleys at gmail.com Mon Apr 13 14:50:49 2009 From: worleys at gmail.com (Chris Worley) Date: Mon, 13 Apr 2009 15:50:49 -0600 Subject: ***SPAM*** Re: [ofa-general] Any easy way to specify to the SM to route/zone? In-Reply-To: References: Message-ID: On Mon, Apr 13, 2009 at 3:01 PM, Hal Rosenstock wrote: > On Mon, Apr 13, 2009 at 4:09 PM, Chris Worley wrote: >> On Mon, Apr 13, 2009 at 12:52 PM, Hal Rosenstock >> wrote: >>> On Mon, Apr 13, 2009 at 2:26 PM, Chris Worley wrote: >>>> On Mon, Apr 13, 2009 at 11:53 AM, Hal Rosenstock >>>> wrote: >>>>> On Mon, Apr 13, 2009 at 12:02 PM, Chris Worley wrote: >>>>>> On Mon, Apr 13, 2009 at 7:43 AM, Hal Rosenstock wrote: >>>>>>> On Mon, Apr 13, 2009 at 9:37 AM, Chris Worley wrote: >>>>>>>> On Mon, Apr 13, 2009 at 5:39 AM, Hal Rosenstock >>>>>>>> wrote: >>>>>>>>> On Sun, Apr 12, 2009 at 11:01 PM, Chris Worley wrote: >>>>>>>>>> >>>>>>>>>> So I need to tell the SM to route specific ports on the server/target >>>>>>>>>> to specific clients/initiators. >>>>>>>>>> >>>>>>>>>> Is there any way to do this? >>>>>>>>> >>>>>>>>> Do you mean restrict access between certain clients/servers ? >>>>>>>> >>>>>>>> One server w/ 4QDR boards, 16 clients with one QDR board.  I want each >>>>>>>> port on the server routed/zoned to two clients. >>>>>>>> >>>>>>>>> If so, >>>>>>>>> you can do this with partitioning >>>>>>>> >>>>>>>> What is partitioning? >>>>>>> >>>>>>> A partition is a collection of ports which are allowed to communicate >>>>>>> together. There are two forms of members: full members which can talk >>>>>>> to any other member (useful for servers) and limited members which can >>>>>>> only talk to full members (useful for clients). See the opensm man >>>>>>> page or partition-config.txt on setting this up for OpenSM. >>>>>>> >>>>>> >>>>>> Let me see if I understand this with a simple example... my port GUIDs >>>>>> (as reported by ibstat) are for one server (4 QDR ports) and four >>>>>> clients (one QDR port each): >>>>>> >>>>>> >>>>>> Server A:           Port GUID: 0x0024717124000029 >>>>>> Server B:           Port GUID: 0x002471712400002a >>>>>> Server C:           Port GUID: 0x0024717127000035 >>>>>> Server D:           Port GUID: 0x0024717127000036 >>>>>> >>>>>> Client 1:                Port GUID: 0x0002c90300028c01 >>>>>> Client 2:                Port GUID: 0x0002c90300026047 >>>>>> Client 3:                Port GUID: 0x0002c90300026053 >>>>>> Client 4:                Port GUID: 0x0002c9030002603b > > Is there a switch in between or just back to back HCA ports ? Yes, there's a switch; it's not directly connected from port to port. In the end, there will be 2 or 4 clients per server port (this simple configuration is just to get me going), so a switch is needed. > >>>>>> >>>>>> Assuming I want a 1:1 (one server port to one client) partitioning, I >>>>>> would put the following in /etc/ofed/partitions.conf: >>>>>> >>>>>> part1=0x1, ipoib, defmember=full : 0x0024717124000029, 0x0002c90300028c01; >>>>>> part2=0x2, ipoib, defmember=full : 0x002471712400002a, 0x0002c90300026047; >>>>>> part3=0x3, ipoib, defmember=full : 0x0024717127000035, 0x0002c90300026053; >>>>>> part4=0x4, ipoib, defmember=full : 0x0024717127000036, 0x0002c9030002603b; >>>>> >>>>> So you want IPoIB. >>>> >>>> I'm doing SRP, so I need IPoIB working. >>> >>> SRP needs to query PathRecord with the correct PKey and use the >>> correct Pkey index for that partition. I'm not sure how that is done >>> in SRP but first IPoIB needs to be made to work (again). >>> >> >> Okay... I'll setup the IPoIB as the ipoib.txt suggests, i.e.: >> >> echo 0x1 > /sys/class/net/ib0/create_child >> >> ... but for now, I'm still not seeing the state go to "up"... I think >> that's the first problem. > > Yes, port state needs to be linkup/active first. I see LinkUp/Armed from below. > >>>>> >>>>>> ... and run w/: >>>>>> >>>>>> opensm -r -B -P/etc/ofed/partitions.conf >>> >>> Also, do you need to use -r ? It's better not to (reassign LIDs). >> >> I'm using it to assure that it just doesn't hang on to the old state, >> especially since I'm not getting the SM working... > > OK. > >> I don't want it to >> assume anything is right about the previous state. >> >> I have tried w/ and w/o and don't see a difference. >> >> The plan is, once I get it working, to remove the "-r". > > That's fine. > >>  Or, are you suggesting I not use it? >> >>> >>>>>> Does that sound correct?  It doesn't work >>>>> >>>>> What application(s) aren't working ? >>>> >>>> ping over IPoIB, for example. >>>> >>>> I am seeing the test node in an "initializing" state right now... I >>>> thought it was "up" before. >>> >>> Yes, this has gone "backwards" (not as far along yet...) >>> >> >> I think getting to an "up" state is the first step. > > Were the ports getting to LinkUp/Active before partitions were configured ? Yes, before I started trying to partition, all the nodes could communicate... except they'd all use just one port on the server and I couldn't get the throughput I needed. > >>>>> Any SM error messages ? >>>> >>>> The server has one klogd error coming out continuously: >>>> >>>> ib0: multicast join failed for >>>> ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22 >>> >>> IPoIB broadcast group (on the default partition) can't be joined (I'm >>> presuming due to the current partition setup (e.g. it worked prior to >>> this, right ?)). >>> >>> You need to do some IPoIB configuration relative to partitions as well. >>> See kernel Documentation/infiniband/ipoib.txt for help with this. >>> >> >> Will do.  As you say, the trick will be getting SRP to use the right >> P_Key's... but I need to get the IB in an "up" state first. >> >> >>>>> Which server ? >>>> >>>> There's only one server... it has many ports for which I'm trying to >>>> partition do different clients.  So, in the above, when I say "Server >>>> A", I mean server port "A". >>> >>> I meant which server port is running OpenSM (which GUID is being >>> used). I see above it is 0x24717124000029 >> >> That was it.  I've switched to a client as the SM now, as you suggest >> a stand-alone SM. > > So it's no longer a client in the ULP sense, right ? It is just being used for SM now. > >>> >>>>> You still need the default partition with the SM node being full and >>>>> the others being limited there (so it's also best to run SM on >>>>> separate node if possible otherwise you have the potential of any >>>>> client connecting to it on default partition). >>>> >>>> Are you saying to change the partitions.conf file to: >>>> >>>> part1=0x1, ipoib: 0x0024717124000029=full, 0x0002c90300028c01; >>>> part2=0x2, ipoib: 0x002471712400002a=full, 0x0002c90300026047; >>>> part3=0x3, ipoib: 0x0024717127000035=full, 0x0002c90300026053; >>>> part4=0x4, ipoib: 0x0024717127000036=full, 0x0002c9030002603b; >>> >>> That's part of it. >>> >>>> ... (which still doesn't work) in which case I set all the server's >>>> ports to "full", or should just one be "full" (which didn't work >>>> either)? >>> >>> You also need: >>> Default=0x7fff: ALL, SELF=FULL; >>> I would put that first. >> >> So, now my /etc/ofed/partitions.conf file looks like: >> >> Default=0x7fff: ALL, SELF=FULL; >> part1=0x1, ipoib: 0x0002c903000292af=full, 0x0002c90300028c01; >> part2=0x2, ipoib: 0x0002c903000292b0=full, 0x0002c90300026047; >> part4=0x4, ipoib: 0x0024717124000029=full, 0x0002c9030002603b; > >> ... I pulled out the node on partition 3 to use as an SM exclusive >> node, I also changed the server ports to some of the other IB ports on >> that machine (port GUIDs as shown by ibstat).  I set the server port >> GUID's to "full", as I want the client GUIDs to talk to it, but not >> necessarily each other (as there is only one client GUID on each >> partition now, it's a moot point). >> >> Note that I made-up the partition P_Key's of 1, 2, and 4. > > This all looks/sounds fine to me. :( > >> Note that it still doesn't work.  On the stand-alone SM, ibstat looks like: >> >> # ibstat >> CA 'mlx4_0' >>        CA type: MT26428 >>        Number of ports: 2 >>        Firmware version: 2.6.0 >>        Hardware version: a0 >>        Node GUID: 0x0002c90300026052 >>        System image GUID: 0x0002c90300026055 >>        Port 1: >>                State: Armed >>                Physical state: LinkUp >>                Rate: 10 >>                Base lid: 1 >>                LMC: 0 >>                SM lid: 1 >>                Capability mask: 0x0251086a >>                Port GUID: 0x0002c90300026053 >>        Port 2: >>                State: Down >>                Physical state: Polling >>                Rate: 10 >>                Base lid: 0 >>                LMC: 0 >>                SM lid: 0 >>                Capability mask: 0x02510868 >>                Port GUID: 0x0002c90300026054 > > What's at the other end of port 1 ? Would you do smpquery portinfo for > this HCA port and it's peer port ? > >> ... On the server, the devices mentioned in the partitions file look like: >> >> CA 'mlx4_0' >>        CA type: MT25418 >>        Number of ports: 2 >>        Firmware version: 2.6.0 >>        Hardware version: a0 >>        Node GUID: 0x0024717124000028 >>        System image GUID: 0x002471712400002b >>        Port 1: >>                State: Initializing >>                Physical state: LinkUp >>                Rate: 10 >>                Base lid: 0 >>                LMC: 0 >>                SM lid: 0 >>                Capability mask: 0x02510868 >>                Port GUID: 0x0024717124000029 >>        Port 2: >>                State: Initializing >>                Physical state: LinkUp >>                Rate: 10 >>                Base lid: 0 >>                LMC: 0 >>                SM lid: 0 >>                Capability mask: 0x02510868 >>                Port GUID: 0x002471712400002a >> CA 'mlx4_1' >>        CA type: MT26428 >>        Number of ports: 2 >>        Firmware version: 2.6.0 >>        Hardware version: a0 >>        Node GUID: 0x0002c903000292ae >>        System image GUID: 0x0002c903000292b1 >>        Port 1: >>                State: Initializing >>                Physical state: LinkUp >>                Rate: 10 >>                Base lid: 0 >>                LMC: 0 >>                SM lid: 0 >>                Capability mask: 0x02510868 >>                Port GUID: 0x0002c903000292af >>        Port 2: >>                State: Initializing >>                Physical state: LinkUp >>                Rate: 10 >>                Base lid: 0 >>                LMC: 0 >>                SM lid: 0 >>                Capability mask: 0x02510868 >>                Port GUID: 0x0002c903000292b0 > > So no SM initialization is occurring there since they are still just in Init. Correct. But, the SM is running. > >> On one of the clients: >> >> # ibstat >> CA 'mlx4_0' >>        CA type: MT26428 >>        Number of ports: 2 >>        Firmware version: 2.6.0 >>        Hardware version: a0 >>        Node GUID: 0x0002c90300026046 >>        System image GUID: 0x0002c90300026049 >>        Port 1: >>                State: Initializing >>                Physical state: LinkUp >>                Rate: 10 >>                Base lid: 7 >>                LMC: 0 >>                SM lid: 1 >>                Capability mask: 0x02510868 >>                Port GUID: 0x0002c90300026047 >>        Port 2: >>                State: Down >>                Physical state: Polling >>                Rate: 10 >>                Base lid: 0 >>                LMC: 0 >>                SM lid: 0 >>                Capability mask: 0x02510868 >>                Port GUID: 0x0002c90300026048 > > Ditto. Down means it's likely a port that is not connected. > >> Partition "part2" with P_Key=2 should connect this client's port 0 to >> the sever on port 1 of mlx4_1 > > Do you really mean port 0 ? Nope... in this case I have 0x0002c903000292b0 in part2 in my partitions file, which is port 1, the second port of the adapter. I'm hoping to use both ports of all adapters on the server. > >>> >>>> I did have a difficult time understanding the difference between >>>> "full" and "limited" in the man page. >>> >>> On a given partition, full can talk with all other members whereas a >>> limited member can only talk with full members (not other limited >>> members). >>> >> >> I think I've got that correctly specified in the above partitions file. >> >>>> I've got a captive network, so I don't want any paths I've not >>>> specified to be allowed.  If that makes any sense.  So, I didn't want >>>> to put a statement in like: >>>> >>>> Default=0x7fff,ipoib:ALL=full; >>>> >>>> ... that would let a rogue node slip through the cracks. >>> >>> The only one they can talk with is the SM (the way I'm proposing) so >>> it's best if the SM node could be separate. >> >> It's separate now.  The log looks like (in its entirety at statup): >> >> Apr 13 13:41:56 182699 [1D71CA30] 0x03 -> OpenSM 3.2.5_20081207 >> Apr 13 13:41:56 182764 [1D71CA30] 0x80 -> OpenSM 3.2.5_20081207 >> Apr 13 13:41:56 183020 [1D71CA30] 0x02 -> osm_vendor_init: 1000 >> pending umads specified >> Apr 13 13:41:56 183104 [1D71CA30] 0x80 -> Entering DISCOVERING state >> Apr 13 13:41:56 193181 [1D71CA30] 0x02 -> osm_vendor_bind: Binding to >> port 0x2c90300026053 >> Apr 13 13:41:56 217349 [1D71CA30] 0x02 -> osm_vendor_bind: Binding to >> port 0x2c90300026053 >> Apr 13 13:41:57 018570 [47FCE940] 0x01 -> umad_receiver: ERR 5409: >> send completed with error (method=0x1 attr=0x11 trans_id=0x110000123b) >> -- dropping >> Apr 13 13:41:57 018586 [47FCE940] 0x01 -> umad_receiver: ERR 5411: DR >> SMP Hop Ptr: 0x0 >> Apr 13 13:41:57 018603 [47FCE940] 0x01 -> Received SMP on a 1 hop path: >>                                Initial path = 0,0 >>                                Return path  = 0,0 >> Apr 13 13:41:57 018608 [47FCE940] 0x01 -> >> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error >> (IB_TIMEOUT) >> Apr 13 13:41:57 018626 [47FCE940] 0x01 -> SMP dump: >>                                base_ver................0x1 >>                                mgmt_class..............0x81 >>                                class_ver...............0x1 >>                                method..................0x1 (SubnGet) >>                                D bit...................0x0 >>                                status..................0x0 >>                                hop_ptr.................0x0 >>                                hop_count...............0x1 >>                                trans_id................0x123b >>                                attr_id.................0x11 (NodeInfo) >>                                resv....................0x0 >>                                attr_mod................0x0 >>                                m_key...................0x0000000000000000 >>                                dr_slid.................65535 >>                                dr_dlid.................65535 >> >>                                Initial path: 0,1 >>                                Return path:  0,0 >>                                Reserved:     [0][0][0][0][0][0][0] >> >>                                00 00 00 00 00 00 00 00   00 00 00 00 >> 00 00 00 00 >> >>                                00 00 00 00 00 00 00 00   00 00 00 00 >> 00 00 00 00 >> >>                                00 00 00 00 00 00 00 00   00 00 00 00 >> 00 00 00 00 >> >>                                00 00 00 00 00 00 00 00   00 00 00 00 >> 00 00 00 00 > > This is the first level problem. Some SMA is not responding to a > NodeInfo query from the SM. Whatever is the next hop from the SM port > appears not to be responding. You may need to reboot that device or > otherwise reset it to see if this clears this issue. After power-cycling the switch, the ports went "active"! Note that I didn't restart the SM... I just left it running. So, on one client... the one corresponding to "part2" in the partitions file, I put the P_Key into the "create child": echo 0x2 > /sys/class/net/ib0/create_child ... and did likewise on the host, for ib3 (the second port on the second adapter): echo 0x2 > /sys/class/net/ib3/create_child Still, no ping (the interfaces are setup correctly). Thanks, Chris From hal.rosenstock at gmail.com Mon Apr 13 15:24:04 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 13 Apr 2009 18:24:04 -0400 Subject: [ofa-general] Any easy way to specify to the SM to route/zone? In-Reply-To: References: Message-ID: On Mon, Apr 13, 2009 at 5:50 PM, Chris Worley wrote: >> Were the ports getting to LinkUp/Active before partitions were configured ? > > Yes, before I started trying to partition, all the nodes could > communicate... except they'd all use just one port on the server and I > couldn't get the throughput I needed. I suspect the switch SMA went south sometime after this. >> So no SM initialization is occurring there since they are still just in Init. > > Correct.  But, the SM is running. Nothing the SM can do when a device SMA malfunctions like that. >>> Partition "part2" with P_Key=2 should connect this client's port 0 to >>> the sever on port 1 of mlx4_1 >> >> Do you really mean port 0 ? > > Nope... in this case I have 0x0002c903000292b0 in part2 in my > partitions file, which is port 1, the second port of the adapter.  I'm > hoping to use both ports of all adapters on the server. So you're talking about physical marking on the card rather than actual (logical) port number. > After power-cycling the switch, the ports went "active"!  Note that I > didn't restart the SM... I just left it running. That should be fine. > So, on one client... the one corresponding to "part2" in the > partitions file, I put the P_Key into the "create child": > > echo 0x2 > /sys/class/net/ib0/create_child > > ... and did likewise on the host, for ib3 (the second port on the > second adapter): > > echo 0x2 > /sys/class/net/ib3/create_child I'm not 100% sure but I think you may need the full member PKey on at least one of them (0x800x). > Still, no ping (the interfaces are setup correctly). Are there still join failure messages on the client and/or server ? What do they say now ? -- Hal > Thanks, > > Chris > From hal.rosenstock at gmail.com Mon Apr 13 15:26:59 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 13 Apr 2009 18:26:59 -0400 Subject: [ofa-general] Any easy way to specify to the SM to route/zone? In-Reply-To: References: Message-ID: On Mon, Apr 13, 2009 at 6:24 PM, Hal Rosenstock wrote: > On Mon, Apr 13, 2009 at 5:50 PM, Chris Worley wrote: > > > >>> Were the ports getting to LinkUp/Active before partitions were configured ? >> >> Yes, before I started trying to partition, all the nodes could >> communicate... except they'd all use just one port on the server and I >> couldn't get the throughput I needed. > > I suspect the switch SMA went south sometime after this. > > > >>> So no SM initialization is occurring there since they are still just in Init. >> >> Correct.  But, the SM is running. > > Nothing the SM can do when a device SMA malfunctions like that. > > > > >>>> Partition "part2" with P_Key=2 should connect this client's port 0 to >>>> the sever on port 1 of mlx4_1 >>> >>> Do you really mean port 0 ? >> >> Nope... in this case I have 0x0002c903000292b0 in part2 in my >> partitions file, which is port 1, the second port of the adapter.  I'm >> hoping to use both ports of all adapters on the server. > > So you're talking about physical marking on the card rather than > actual (logical) port number. > > > > >> After power-cycling the switch, the ports went "active"!  Note that I >> didn't restart the SM... I just left it running. > > That should be fine. > >> So, on one client... the one corresponding to "part2" in the >> partitions file, I put the P_Key into the "create child": >> >> echo 0x2 > /sys/class/net/ib0/create_child >> >> ... and did likewise on the host, for ib3 (the second port on the >> second adapter): >> >> echo 0x2 > /sys/class/net/ib3/create_child > > I'm not 100% sure but I think you may need the full member PKey on at > least one of them (0x800x). Maybe the simplest thing to do is make them all full members but on separate partitions. There are some other issues with using the limited members I forgot about. Using full members but separate partitions should still isolate the clients. All nodes would still only be limited members of the default partition. -- Hal >> Still, no ping (the interfaces are setup correctly). > > Are there still join failure messages on the client and/or server ? > What do they say now ? > > -- Hal > >> Thanks, >> >> Chris >> > From worleys at gmail.com Mon Apr 13 16:39:36 2009 From: worleys at gmail.com (Chris Worley) Date: Mon, 13 Apr 2009 17:39:36 -0600 Subject: ***SPAM*** Re: [ofa-general] Any easy way to specify to the SM to route/zone? In-Reply-To: References: Message-ID: On Mon, Apr 13, 2009 at 4:24 PM, Hal Rosenstock wrote: > On Mon, Apr 13, 2009 at 5:50 PM, Chris Worley wrote: > > > >>> Were the ports getting to LinkUp/Active before partitions were configured ? >> >> Yes, before I started trying to partition, all the nodes could >> communicate... except they'd all use just one port on the server and I >> couldn't get the throughput I needed. > > I suspect the switch SMA went south sometime after this. I'm now power-cycling the switch for each partition change. >>>> Partition "part2" with P_Key=2 should connect this client's port 0 to >>>> the sever on port 1 of mlx4_1 >>> >>> Do you really mean port 0 ? >> >> Nope... in this case I have 0x0002c903000292b0 in part2 in my >> partitions file, which is port 1, the second port of the adapter.  I'm >> hoping to use both ports of all adapters on the server. > > So you're talking about physical marking on the card rather than > actual (logical) port number. I'm not sure about board markings... both ports are attached to the switch, for all IB adapters, so all should work. I'm using the numbers provided by ibstat. >> So, on one client... the one corresponding to "part2" in the >> partitions file, I put the P_Key into the "create child": >> >> echo 0x2 > /sys/class/net/ib0/create_child >> >> ... and did likewise on the host, for ib3 (the second port on the >> second adapter): >> >> echo 0x2 > /sys/class/net/ib3/create_child > > I'm not 100% sure but I think you may need the full member PKey on at > least one of them (0x800x). I've changed the P_Keys to 0x800x, and set the "create_child" files appropriately. > >> Still, no ping (the interfaces are setup correctly). > > Are there still join failure messages on the client and/or server ? > What do they say now ? Lot's of "bad P_Key" notices: Apr 13 17:32:56 649698 [F59A9A30] 0x03 -> OpenSM 3.2.5_20081207 Apr 13 17:32:56 649737 [F59A9A30] 0x80 -> OpenSM 3.2.5_20081207 Apr 13 17:32:56 650078 [F59A9A30] 0x02 -> osm_vendor_init: 1000 pending umads specified Apr 13 17:32:56 650201 [F59A9A30] 0x80 -> Entering DISCOVERING state Apr 13 17:32:56 660286 [F59A9A30] 0x02 -> osm_vendor_bind: Binding to port 0x2c90300026053 Apr 13 17:32:56 684519 [F59A9A30] 0x02 -> osm_vendor_bind: Binding to port 0x2c90300026053 Apr 13 17:32:56 703826 [470BE940] 0x80 -> Entering MASTER state Apr 13 17:32:56 704953 [470BE940] 0x02 -> osm_ucast_mgr_process: minhop tables configured on all switches Apr 13 17:32:56 713917 [470BE940] 0x80 -> SUBNET UP Apr 13 17:32:57 112574 [452BB940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:2 num:257 (Bad P_Key) Producer:1 (Channel Adapter) from LID:1 TID:0x0000000000000741 Apr 13 17:32:57 112642 [452BB940] 0x02 -> osm_report_notice: Reporting Generic Notice type:2 num:257 (Bad P_Key) from LID:1 GID:fe80::2:c903:2:6053 Apr 13 17:32:57 282788 [416B5940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:2 num:259 (Bad P_Key (switch external port)) Producer:2 (Switch) from LID:11 TID:0x000000000000018e Apr 13 17:32:57 282817 [416B5940] 0x02 -> osm_report_notice: Reporting Generic Notice type:2 num:259 (Bad P_Key (switch external port)) from LID:11 GID:fe80::2:c902:40:46f8 Apr 13 17:32:58 280801 [42AB7940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:2 num:259 (Bad P_Key (switch external port)) Producer:2 (Switch) from LID:11 TID:0x000000000000018f Apr 13 17:32:58 280828 [42AB7940] 0x02 -> osm_report_notice: Reporting Generic Notice type:2 num:259 (Bad P_Key (switch external port)) from LID:11 GID:fe80::2:c902:40:46f8 Apr 13 17:32:58 761835 [434B8940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:2 num:257 (Bad P_Key) Producer:1 (Channel Adapter) from LID:1 TID:0x0000000000000742 Apr 13 17:32:58 761858 [434B8940] 0x02 -> osm_report_notice: Reporting Generic Notice type:2 num:257 (Bad P_Key) from LID:1 GID:fe80::2:c903:2:6053 Apr 13 17:32:59 278816 [452BB940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:2 num:259 (Bad P_Key (switch external port)) Producer:2 (Switch) from LID:11 TID:0x0000000000000190 Apr 13 17:32:59 278835 [452BB940] 0x02 -> osm_report_notice: Reporting Generic Notice type:2 num:259 (Bad P_Key (switch external port)) from LID:11 GID:fe80::2:c902:40:46f8 Apr 13 17:33:00 276841 [416B5940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:2 num:259 (Bad P_Key (switch external port)) Producer:2 (Switch) from LID:11 TID:0x0000000000000191 Apr 13 17:33:00 276862 [416B5940] 0x02 -> osm_report_notice: Reporting Generic Notice type:2 num:259 (Bad P_Key (switch external port)) from LID:11 GID:fe80::2:c902:40:46f8 Apr 13 17:33:03 459759 [42AB7940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:2 num:257 (Bad P_Key) Producer:1 (Channel Adapter) from LID:1 TID:0x0000000000000743 Apr 13 17:33:03 459785 [42AB7940] 0x02 -> osm_report_notice: Reporting Generic Notice type:2 num:257 (Bad P_Key) from LID:1 GID:fe80::2:c903:2:6053 Apr 13 17:33:04 268908 [434B8940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:2 num:259 (Bad P_Key (switch external port)) Producer:2 (Switch) from LID:11 TID:0x0000000000000192 Apr 13 17:33:04 268927 [434B8940] 0x02 -> osm_report_notice: Reporting Generic Notice type:2 num:259 (Bad P_Key (switch external port)) from LID:11 GID:fe80::2:c902:40:46f8 Apr 13 17:33:05 266929 [452BB940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:2 num:259 (Bad P_Key (switch external port)) Producer:2 (Switch) from LID:11 TID:0x0000000000000193 Apr 13 17:33:05 266950 [452BB940] 0x02 -> osm_report_notice: Reporting Generic Notice type:2 num:259 (Bad P_Key (switch external port)) from LID:11 GID:fe80::2:c902:40:46f8 Apr 13 17:33:10 456664 [420B6940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:2 num:257 (Bad P_Key) Producer:1 (Channel Adapter) from LID:1 TID:0x0000000000000744 Apr 13 17:33:10 456690 [420B6940] 0x02 -> osm_report_notice: Reporting Generic Notice type:2 num:257 (Bad P_Key) from LID:1 GID:fe80::2:c903:2:6053 Apr 13 17:33:11 255037 [43EB9940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:2 num:259 (Bad P_Key (switch external port)) Producer:2 (Switch) from LID:11 TID:0x0000000000000194 Apr 13 17:33:11 255083 [43EB9940] 0x02 -> osm_report_notice: Reporting Generic Notice type:2 num:259 (Bad P_Key (switch external port)) from LID:11 GID:fe80::2:c902:40:46f8 Apr 13 17:33:12 253054 [45CBC940] 0x01 -> __osm_trap_rcv_process_request: Received Generic Notice type:2 num:259 (Bad P_Key (switch external port)) Producer:2 (Switch) from LID:11 TID:0x0000000000000195 Chris From devel-ofed at morey-chaisemartin.com Tue Apr 14 00:00:19 2009 From: devel-ofed at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Tue, 14 Apr 2009 09:00:19 +0200 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm/osm_ucast_ftree.c: lids are always handled in host order In-Reply-To: <20090413201100.GE5521@sk> References: <49E21234.2060906@dev.mellanox.co.il> <20090413201100.GE5521@sk> Message-ID: <49E43483.5040909@morey-chaisemartin.com> Are you sure it was applied? I just pulled from your repo and it doesn't seem to be there. Nicolas Le 13/04/2009 22:11, Sasha Khapyorsky a écrit : > On 19:09 Sun 12 Apr , Yevgeny Kliteynik wrote: >> Hi Sasha, >> >> There's a mess in host vs. network order in lids handling in ftree. >> In vast majority of the cases lid is required to be in host order, >> so there are many cl_ntoh16() conversions. >> Fixing it to be always in host order. >> >> Signed-off-by: Yevgeny Kliteynik > > Applied. Thanks. > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > From sashak at voltaire.com Tue Apr 14 00:10:08 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Apr 2009 10:10:08 +0300 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCH] opensm/osm_ucast_ftree.c: lids are always handled in host order In-Reply-To: <49E43483.5040909@morey-chaisemartin.com> References: <49E21234.2060906@dev.mellanox.co.il> <20090413201100.GE5521@sk> <49E43483.5040909@morey-chaisemartin.com> Message-ID: <20090414071008.GG5521@sk> On 09:00 Tue 14 Apr , Nicolas Morey-Chaisemartin wrote: > Are you sure it was applied? I didn't push yet. It is done now, Sasha From sashak at voltaire.com Tue Apr 14 01:08:15 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Apr 2009 11:08:15 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] Enhance min hops counters in Ftree In-Reply-To: <49C7A927.1040304@ext.bull.net> References: <49C7A927.1040304@ext.bull.net> Message-ID: <20090414080815.GA14033@sk> Hi Nicolas, On 16:22 Mon 23 Mar , Nicolas Morey Chaisemartin wrote: > This patch enhances the use of the min hop table done in the Fat-Tree algorithm. > Before this patch, the algorithm was using the osm_sw hops table to store the minhop values toward any lid (Switch or not). > As this table is allocated as we need it, it required a lot of malloc calls and quite some time to set the hops values on remote ports. > > This patch corrects this behaviour: > -The osm_sw hops table is only used for switch lid > -ftree_sw_t struct now has its own hop table (only 1 dimensionnal as we don't need to know which port is used) to store its minhop value > > Signed-off-by: Nicolas Morey-Chaisemartin > --- > Here are some performances tests I have realized with and without this patch: > For ~3500 nodes: > Without = 0.981s / 169.5MB used memory > With = 0.549s / 123MB used memory > > For over 5000 nodes: > Without = 2.25s / 308MB used memory > With = 1.29s / 186MB used memory > > Computation time is taken in do_ouring function (thus only taking computation not LFT setting time or anything else) Nice numbers! However I have a question below. > > opensm/opensm/osm_ucast_ftree.c | 68 +++++++++++++++++++++++++++++++-------- > 1 files changed, 54 insertions(+), 14 deletions(-) > > diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c > index 58d1c14..a5da571 100644 > --- a/opensm/opensm/osm_ucast_ftree.c > +++ b/opensm/opensm/osm_ucast_ftree.c > @@ -172,6 +172,7 @@ typedef struct ftree_sw_t_ { > uint8_t up_port_groups_num; > boolean_t is_leaf; > unsigned down_port_groups_idx; > + uint8_t *hops; > } ftree_sw_t; > > /*************************************************** > @@ -553,6 +554,9 @@ static ftree_sw_t *sw_create(IN ftree_fabric_t * p_ftree, > > /* initialize lft buffer */ > memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); > + p_sw->hops = > + malloc(cl_ntoh16(p_osm_sw->max_lid_ho) * sizeof(*(p_sw->hops))); > + memset(p_sw->hops, OSM_NO_PATH, cl_ntoh16(p_osm_sw->max_lid_ho)); AFAIR p_osm_sw->max_lid_ho is already in host byte order and cl_ntoh16() should swap bytes, how then did it work? Also malloc() may fail and return value check is needed. Sasha From devel-ofed at morey-chaisemartin.com Tue Apr 14 01:23:03 2009 From: devel-ofed at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Tue, 14 Apr 2009 10:23:03 +0200 Subject: [ofa-general] ***SPAM*** Re: [PATCH] Enhance min hops counters in Ftree In-Reply-To: <20090414080815.GA14033@sk> References: <49C7A927.1040304@ext.bull.net> <20090414080815.GA14033@sk> Message-ID: <49E447E7.2090107@morey-chaisemartin.com> Le 14/04/2009 10:08, Sasha Khapyorsky a écrit : >> /*************************************************** >> @@ -553,6 +554,9 @@ static ftree_sw_t *sw_create(IN ftree_fabric_t * p_ftree, >> >> /* initialize lft buffer */ >> memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); >> + p_sw->hops = >> + malloc(cl_ntoh16(p_osm_sw->max_lid_ho) * sizeof(*(p_sw->hops))); >> + memset(p_sw->hops, OSM_NO_PATH, cl_ntoh16(p_osm_sw->max_lid_ho)); > > AFAIR p_osm_sw->max_lid_ho is already in host byte order and cl_ntoh16() > should swap bytes, how then did it work? This is actually a bug. In my tests it seemed I always hat network order value > host order value so it worked. I'm fixing this. > > Also malloc() may fail and return value check is needed. I'll had this too. Anyway I have to rewrite few things due to the last patch from Yevgeny which changes usage with network/host order. Nicolas From nicolas.morey-chaisemartin at ext.bull.net Tue Apr 14 01:48:17 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Tue, 14 Apr 2009 10:48:17 +0200 Subject: [ofa-general] [PATCH v2] opensm/osm_ucat_ftree.c Enhance min hops counters usage Message-ID: <49E44DD1.1050508@ext.bull.net> This patch enhances the use of the min hop table done in the Fat-Tree algorithm. Before this patch, the algorithm was using the osm_sw hops table to store the minhop values toward any lid (Switch or not). As this table is allocated as we need it, it required a lot of malloc calls and quite some time to set the hops values on remote ports. This patch corrects this behaviour: -The osm_sw hops table is only used for switch lid -ftree_sw_t struct now has its own hop table (only 1 dimensionnal as we don't need to know which port is used) to store its minhop value Signed-off-by: Nicolas Morey-Chaisemartin --- Fixed to work after commit a10b57a2de9ace61455176ad5e43b7ca3d148cfb opensm/osm_ucast_ftree.c: lids are always handled in host order. Memory allocation fixed (using right byte order + checking if succesfull) opensm/opensm/osm_ucast_ftree.c | 66 +++++++++++++++++++++++++++++++-------- 1 files changed, 53 insertions(+), 13 deletions(-) diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c index dfe7009..83c901e 100644 --- a/opensm/opensm/osm_ucast_ftree.c +++ b/opensm/opensm/osm_ucast_ftree.c @@ -172,6 +172,7 @@ typedef struct ftree_sw_t_ { uint8_t up_port_groups_num; boolean_t is_leaf; unsigned down_port_groups_idx; + uint8_t *hops; } ftree_sw_t; /*************************************************** @@ -554,6 +555,11 @@ static ftree_sw_t *sw_create(IN ftree_fabric_t * p_ftree, /* initialize lft buffer */ memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); + p_sw->hops = + malloc(p_osm_sw->max_lid_ho * sizeof(*(p_sw->hops))); + if(p_sw->hops == NULL) + return NULL; + memset(p_sw->hops, OSM_NO_PATH, p_osm_sw->max_lid_ho); return p_sw; } /* sw_create() */ @@ -566,6 +572,7 @@ static void sw_destroy(IN ftree_fabric_t * p_ftree, IN ftree_sw_t * p_sw) if (!p_sw) return; + free(p_sw->hops); for (i = 0; i < p_sw->down_port_groups_num; i++) port_group_destroy(p_sw->down_port_groups[i]); @@ -693,32 +700,54 @@ static void sw_add_port(IN ftree_sw_t * p_sw, IN uint8_t port_num, /***************************************************/ static inline cl_status_t sw_set_hops(IN ftree_sw_t * p_sw, IN uint16_t lid, - IN uint8_t port_num, IN uint8_t hops) + IN uint8_t port_num, IN uint8_t hops, + IN boolean_t is_target_sw) { /* set local min hop table(LID) */ - return osm_switch_set_hops(p_sw->p_osm_sw, lid, port_num, hops); + p_sw->hops[lid] = hops; + if (is_target_sw) + return osm_switch_set_hops(p_sw->p_osm_sw, lid, port_num, hops); + return 0; } /***************************************************/ static int set_hops_on_remote_sw(IN ftree_port_group_t * p_group, - IN uint16_t target_lid, IN uint8_t hops) + IN uint16_t target_lid, IN uint8_t hops, + IN boolean_t is_target_sw) { ftree_port_t *p_port; uint8_t i, ports_num; ftree_sw_t *p_remote_sw = p_group->remote_hca_or_sw.p_sw; + /* if lid is a switch, we set the min hop table in the osm_switch struct */ CL_ASSERT(p_group->remote_node_type == IB_NODE_TYPE_SWITCH); + p_remote_sw->hops[target_lid] = hops; + + /* If taget lid is a switch we set the min hop table values + * for each port on the associated osm_sw struct */ + if (!is_target_sw) + return 0; + ports_num = (uint8_t) cl_ptr_vector_get_size(&p_group->ports); for (i = 0; i < ports_num; i++) { cl_ptr_vector_at(&p_group->ports, i, (void *)&p_port); if (sw_set_hops(p_remote_sw, target_lid, - p_port->remote_port_num, hops)) + p_port->remote_port_num, hops, is_target_sw)) return -1; } return 0; } +/***************************************************/ + +static inline uint8_t +sw_get_least_hops(IN ftree_sw_t * p_sw, IN uint16_t target_lid) +{ + CL_ASSERT(p_sw->hops != NULL); + return p_sw->hops[target_lid]; +} + /*************************************************** ** ** ftree_hca_t functions @@ -1878,6 +1907,7 @@ fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, IN uint8_t target_rank, IN boolean_t is_real_lid, IN boolean_t is_main_path, + IN boolean_t is_target_a_sw, IN uint8_t highest_rank_in_route, IN uint16_t reverse_hops) { @@ -1940,8 +1970,7 @@ fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, set LFT(target_lid) on the remote switch to the remote port */ p_remote_sw = p_group->remote_hca_or_sw.p_sw; - if (osm_switch_get_least_hops(p_remote_sw->p_osm_sw, - target_lid) != OSM_NO_PATH) { + if (sw_get_least_hops(p_remote_sw, target_lid) != OSM_NO_PATH) { /* Loop in the fabric - we already routed the remote switch on our way UP, and now we see it again on our way DOWN */ OSM_LOG(&p_ftree->p_osm->log, OSM_LOG_DEBUG, @@ -2007,7 +2036,8 @@ fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, highest_rank_in_route) + (p_remote_sw->rank - highest_rank_in_route) + - reverse_hops * 2)); + reverse_hops * 2), + is_target_a_sw); } /* The number of upgoing routes is tracked in the @@ -2026,6 +2056,7 @@ fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree, target_rank, /* rank of the LID that we're routing to */ is_real_lid, /* whether the target LID is real or dummy */ is_main_path, /* whether this is path to HCA that should by tracked by counters */ + is_target_a_sw, /* Wheter target lid is a switch or not */ highest_rank_in_route, reverse_hops); /* highest visited point in the tree before going down */ } /* done scanning all the down-going port groups */ @@ -2060,6 +2091,7 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, IN uint8_t target_rank, IN boolean_t is_real_lid, IN boolean_t is_main_path, + IN boolean_t is_target_a_sw, IN uint16_t reverse_hop_credit, IN uint16_t reverse_hops) { @@ -2082,6 +2114,7 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, target_rank, /* rank of the LID that we're routing to */ is_real_lid, /* whether this target LID is real or dummy */ is_main_path, /* whether this path to HCA should by tracked by counters */ + is_target_a_sw, /* Wheter target lid is a switch or not */ p_sw->rank, /* the highest visited point in the tree before going down */ reverse_hops); /* Number of reverse_hops done up to this point */ @@ -2111,6 +2144,7 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, target_rank, /* rank of the LID that we're routing to */ is_real_lid, /* whether this target LID is real or dummy */ is_main_path, /* whether this is path to HCA that should by tracked by counters */ + is_target_a_sw, /* Wheter target lid is a switch or not */ reverse_hop_credit - 1, /* Remaining reverse_hops allowed */ reverse_hops + 1); /* Number of reverse_hops done up to this point */ } @@ -2221,7 +2255,7 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, set_hops_on_remote_sw(p_min_group, target_lid, target_rank - p_remote_sw->rank + - 2 * reverse_hops); + 2 * reverse_hops, is_target_a_sw); } /* Recursion step: @@ -2232,6 +2266,7 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, target_rank, /* rank of the LID that we're routing to */ is_real_lid, /* whether this target LID is real or dummy */ is_main_path, /* whether this is path to HCA that should by tracked by counters */ + is_target_a_sw, /* Wheter target lid is a switch or not */ reverse_hop_credit, /* Remaining reverse_hops allowed */ reverse_hops); /* Number of reverse_hops done up to this point */ } @@ -2304,7 +2339,7 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, set_hops_on_remote_sw(p_group, target_lid, target_rank - p_remote_sw->rank + - 2 * reverse_hops); + 2 * reverse_hops, is_target_a_sw); /* Recursion step: Assign downgoing ports by stepping up, starting on REMOTE switch. */ @@ -2314,6 +2349,7 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, target_rank, /* rank of the LID that we're routing to */ TRUE, /* whether the target LID is real or dummy */ FALSE, /* whether this is path to HCA that should by tracked by counters */ + is_target_a_sw, /* Wheter target lid is a switch or not */ reverse_hop_credit, /* Remaining reverse_hops allowed */ reverse_hops); /* Number of reverse_hops done up to this point */ } @@ -2342,6 +2378,7 @@ static void fabric_route_downgoing_by_going_up(IN ftree_fabric_t * p_ftree, target_rank, /* rank of the LID that we're routing to */ TRUE, /* whether the target LID is real or dummy */ TRUE, /* whether this is path to HCA that should by tracked by counters */ + is_target_a_sw, /* Wheter target lid is a switch or not */ reverse_hop_credit - 1, /* Remaining reverse_hops allowed */ reverse_hops + 1); /* Number of reverse_hops done up to this point */ } @@ -2420,8 +2457,7 @@ static void fabric_route_to_cns(IN ftree_fabric_t * p_ftree) hca_lid, p_port->port_num); /* set local min hop table(LID) to route to the CA */ - sw_set_hops(p_sw, hca_lid, - p_port->port_num, 1); + sw_set_hops(p_sw, hca_lid, p_port->port_num, 1, FALSE); /* Assign downgoing ports by stepping up. Since we're routing here only CNs, we're routing it as REAL @@ -2432,6 +2468,7 @@ static void fabric_route_to_cns(IN ftree_fabric_t * p_ftree) p_sw->rank + 1, /* rank of the LID that we're routing to */ TRUE, /* whether this HCA LID is real or dummy */ TRUE, /* whether this path to HCA should by tracked by counters */ + FALSE, /* wheter target lid is a switch or not */ 0, /* Number of reverse hops allowed */ 0); /* Number of reverse hops done yet */ @@ -2459,6 +2496,7 @@ static void fabric_route_to_cns(IN ftree_fabric_t * p_ftree) 0, /* rank of the LID that we're routing to - ignored for dummy HCA */ FALSE, /* whether this HCA LID is real or dummy */ TRUE, /* whether this path to HCA should by tracked by counters */ + FALSE, /* Wheter the target LID is a switch or not */ 0, /* Number of reverse hops allowed */ 0); /* Number of reverse hops done yet */ } @@ -2533,7 +2571,7 @@ static void fabric_route_to_non_cns(IN ftree_fabric_t * p_ftree) /* set local min hop table(LID) to route to the CA */ sw_set_hops(p_sw, hca_lid, port_num_on_switch, /* port num */ - 1); /* hops */ + 1, FALSE); /* hops */ /* Assign downgoing ports by stepping up. We're routing REAL targets. They are not CNs and not included @@ -2545,6 +2583,7 @@ static void fabric_route_to_non_cns(IN ftree_fabric_t * p_ftree) p_sw->rank + 1, /* rank of the LID that we're routing to */ TRUE, /* whether this HCA LID is real or dummy */ TRUE, /* whether this path to HCA should by tracked by counters */ + FALSE, /* Wheter the target LID is a switch or not */ p_hca_port_group->is_io ? p_ftree->p_osm->subn.opt.max_reverse_hops : 0, /* Number or reverse hops allowed */ 0); /* Number or reverse hops done yet */ } @@ -2589,7 +2628,7 @@ static void fabric_route_to_switches(IN ftree_fabric_t * p_ftree) /* set min hop table of the switch to itself */ sw_set_hops(p_sw, p_sw->base_lid, 0, /* port_num */ - 0); /* hops */ + 0, TRUE); /* hops */ fabric_route_downgoing_by_going_up(p_ftree, p_sw, /* local switch - used as a route-downgoing alg. start point */ NULL, /* prev. position switch */ @@ -2597,6 +2636,7 @@ static void fabric_route_to_switches(IN ftree_fabric_t * p_ftree) p_sw->rank, /* rank of the LID that we're routing to */ TRUE, /* whether the target LID is a real or dummy */ FALSE, /* whether this path to HCA should by tracked by counters */ + TRUE, /* Wheter the target LID is a switch or not */ 0, /* Number of reverse hops allowed */ 0); /* Number of reverse hops done yet */ } -- 1.6.2-rc2.GIT From sashak at voltaire.com Tue Apr 14 01:51:27 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Apr 2009 11:51:27 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm/include/ib_types.h: Add ib_switch_info_get_state_opt_sl2vlmapping routine In-Reply-To: <20090324194121.GA1010@comcast.net> References: <20090324194121.GA1010@comcast.net> Message-ID: <20090414085127.GB14033@sk> On 14:41 Tue 24 Mar , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Tue Apr 14 01:57:37 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Apr 2009 11:57:37 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH v2] opensm/osm_ucat_ftree.c Enhance min hops counters usage In-Reply-To: <49E44DD1.1050508@ext.bull.net> References: <49E44DD1.1050508@ext.bull.net> Message-ID: <20090414085737.GC14033@sk> On 10:48 Tue 14 Apr , Nicolas Morey Chaisemartin wrote: > This patch enhances the use of the min hop table done in the Fat-Tree algorithm. > Before this patch, the algorithm was using the osm_sw hops table to store the minhop values toward any lid (Switch or not). > As this table is allocated as we need it, it required a lot of malloc calls and quite some time to set the hops values on remote ports. > > This patch corrects this behaviour: > -The osm_sw hops table is only used for switch lid > -ftree_sw_t struct now has its own hop table (only 1 dimensionnal as we don't need to know which port is used) to store its minhop value > > > Signed-off-by: Nicolas Morey-Chaisemartin Applied. Thanks. Sasha From sashak at voltaire.com Tue Apr 14 02:03:40 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Apr 2009 12:03:40 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH][TRIVIAL] opensm/osm_qos.c: Cosmetic formatting changes In-Reply-To: <20090324221655.GA3495@comcast.net> References: <20090324221655.GA3495@comcast.net> Message-ID: <20090414090340.GD14033@sk> On 17:16 Tue 24 Mar , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. I also formatted this using osm_indent. Sasha From sashak at voltaire.com Tue Apr 14 02:09:42 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Apr 2009 12:09:42 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCHv2][TRIVIAL] opensm/osm_qos.c: Cosmetic formatting changes In-Reply-To: <20090325152103.GA8376@comcast.net> References: <20090325152103.GA8376@comcast.net> Message-ID: <20090414090942.GE14033@sk> On 10:21 Wed 25 Mar , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Rebased and applied. Thanks. Sasha From sashak at voltaire.com Tue Apr 14 02:11:00 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Apr 2009 12:11:00 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH][TRIVIAL] opensm/osm_slvl_map_rcv.c: Cosmetic formatting changes In-Reply-To: <20090325152153.GB8376@comcast.net> References: <20090325152153.GB8376@comcast.net> Message-ID: <20090414091100.GF14033@sk> On 10:21 Wed 25 Mar , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Tue Apr 14 02:12:36 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Apr 2009 12:12:36 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH][TRIVIAL] opensm/osm_link_mgr.c: Remove extraneous parentheses In-Reply-To: <20090325152246.GA8386@comcast.net> References: <20090325152246.GA8386@comcast.net> Message-ID: <20090414091236.GG14033@sk> On 10:22 Wed 25 Mar , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Tue Apr 14 02:23:06 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Apr 2009 12:23:06 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] libibmad/fields.c: Display CounterSelect2 in hex rather than decimal In-Reply-To: <20090326163908.GA31265@comcast.net> References: <20090326163908.GA31265@comcast.net> Message-ID: <20090414092306.GH14033@sk> On 11:39 Thu 26 Mar , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Tue Apr 14 02:24:20 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Apr 2009 12:24:20 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH][TRIVIAL] opensm/osm_ucast_mgr.c: Cosmetic formatting change In-Reply-To: <20090326163956.GA31271@comcast.net> References: <20090326163956.GA31271@comcast.net> Message-ID: <20090414092420.GI14033@sk> On 11:39 Thu 26 Mar , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Tue Apr 14 02:47:16 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Apr 2009 12:47:16 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm/osm_helper.c: Add more info for traps 144 and 256-259 in osm_dump_notice In-Reply-To: <20090318121807.GA5284@comcast.net> References: <20090318121807.GA5284@comcast.net> Message-ID: <20090414094716.GJ14033@sk> Hi Hal, On 07:18 Wed 18 Mar , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock > > --- > diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c > index b40ba0c..593bbf7 100644 > --- a/opensm/opensm/osm_helper.c > +++ b/opensm/opensm/osm_helper.c > @@ -2,6 +2,7 @@ > * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. > * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. > * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -1720,7 +1721,10 @@ osm_dump_notice(IN osm_log_t * const p_log, > { > if (osm_log_is_active(p_log, log_level)) { > if (ib_notice_is_generic(p_ntci)) { > + int i; > char buff[1024]; > + char line[BUF_SIZE]; There is no need for two buffers. See below. > + > buff[0] = '\0'; > > /* immediate data based on the trap */ > @@ -1758,11 +1762,16 @@ osm_dump_notice(IN osm_log_t * const p_log, > case 144: > sprintf(buff, > "\t\t\t\tlid......................%u\n" > - "\t\t\t\tnew_cap_mask.............0x%08x\n", > + "\t\t\t\tlocal_changes............%d\n" Likely %u would be more suitable here. > + "\t\t\t\tnew_cap_mask.............0x%08x\n" > + "\t\t\t\tchange_flags.............0x%x\n", > cl_ntoh16(p_ntci->data_details.ntc_144. > lid), > + p_ntci->data_details.ntc_144.local_changes, > cl_ntoh32(p_ntci->data_details.ntc_144. > - new_cap_mask)); > + new_cap_mask), > + cl_ntoh16(p_ntci->data_details.ntc_144. > + change_flgs)); > break; > case 145: > sprintf(buff, > @@ -1774,6 +1783,95 @@ osm_dump_notice(IN osm_log_t * const p_log, > cl_ntoh64(p_ntci->data_details.ntc_145. > new_sys_guid)); > break; > + case 256: > + sprintf(buff, > + "\t\t\t\tlid......................%u\n" > + "\t\t\t\tdrslid...................%u\n" > + "\t\t\t\tmethod...................0x%x\n" > + "\t\t\t\tattr_id..................0x%x\n" > + "\t\t\t\tattr_mod.................0x%x\n" > + "\t\t\t\tm_key....................0x%016" PRIx64 "\n" > + "\t\t\t\tdr_notice................%d\n" > + "\t\t\t\tdr_path_truncated........%d\n" > + "\t\t\t\tdr_hop_count.............%u\n", > + cl_ntoh16(p_ntci->data_details.ntc_256.lid), > + cl_ntoh16(p_ntci->data_details.ntc_256.dr_slid), > + p_ntci->data_details.ntc_256.method, > + cl_ntoh16(p_ntci->data_details.ntc_256.attr_id), > + cl_ntoh32(p_ntci->data_details.ntc_256.attr_mod), > + cl_ntoh64(p_ntci->data_details.ntc_256.mkey), > + p_ntci->data_details.ntc_256.dr_trunc_hop >> 7, > + p_ntci->data_details.ntc_256.dr_trunc_hop >> 6, > + p_ntci->data_details.ntc_256.dr_trunc_hop & 0x3f); > + sprintf(line, "Directed Path Dump of %u hop path:" > + "\n\t\t\t\tPath = ", > + p_ntci->data_details.ntc_256.dr_trunc_hop & 0x3f); > + strcat(buff, line); Instead of buffer copying it is simpler to use sprintf() return value: n = sprintf(buf, "blah-blah...", ...); sprintf(buf + n, "another blah-blah...", ...); And using snprintf() will be even safer. > + for (i = 0; > + i <= (p_ntci->data_details.ntc_256.dr_trunc_hop & 0x3f); > + i++) { > + if (i == 0) > + sprintf(line, "%d", > + p_ntci->data_details.ntc_256.dr_rtn_path[i]); > + else > + sprintf(line, ",%d", > + p_ntci->data_details.ntc_256.dr_rtn_path[i]); > + strcat(buff, line); > + } Ditto. > + break; > + case 257: > + case 258: > + sprintf(buff, > + "\t\t\t\tlid1.....................%u\n" > + "\t\t\t\tlid2.....................%u\n" > + "\t\t\t\tkey......................0x%x\n" > + "\t\t\t\tsl.......................%d\n" > + "\t\t\t\tqp1......................0x%x\n" > + "\t\t\t\tqp2......................0x%x\n" > + "\t\t\t\tgid1.....................0x%016" PRIx64 " : " > + "0x%016" PRIx64 "\n" > + "\t\t\t\tgid2.....................0x%016" PRIx64 " : " > + "0x%016" PRIx64 "\n", Isn't IPv6 address format preferable for GIDs printing? Sasha From sashak at voltaire.com Tue Apr 14 03:23:21 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Apr 2009 13:23:21 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] ibsim: support xmitwait counters In-Reply-To: <49D0B659.2040107@Voltaire.COM> References: <49D0B659.2040107@Voltaire.COM> Message-ID: <20090414102321.GA5519@sk> On 15:08 Mon 30 Mar , Doron Shoham wrote: > support xmitwait counters > > Signed-off-by: Doron Shoham Applied. Thanks. What are the plans to use it (now unlike other counters XmitWait will be always zero)? Sasha From vlad at lists.openfabrics.org Tue Apr 14 03:32:44 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 14 Apr 2009 03:32:44 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090414-0200 daily build status Message-ID: <20090414103244.996A8E612D4@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From dzieko at wcss.pl Tue Apr 14 03:38:42 2009 From: dzieko at wcss.pl (Pawel Dziekonski) Date: Tue, 14 Apr 2009 12:38:42 +0200 Subject: [ofa-general] ib_mthca 0000:0d:00.0: Async event 16 for bogus QP 00da0407 In-Reply-To: <200904112205.12254.bs_lists@aakef.fastmail.fm> References: <200904022007.20630.bs_lists@aakef.fastmail.fm> <20090406105424.GC6165@cefeid.wcss.wroc.pl> <200904112205.12254.bs_lists@aakef.fastmail.fm> Message-ID: <20090414103842.GC30191@cefeid.wcss.wroc.pl> On Sat, 11 Apr 2009 at 10:05:11PM +0200, Bernd Schubert wrote: > just out of interest, which applications are using IPoIB? My main example is GAMESS, which you know probably. GAMESS is using sockets, but I did not have time to check SDP. There is also a possibility to use MPI but it fails on jobs using very large memory (~1.5TB). Pawel -- Pawel Dziekonski Wroclaw Centre for Networking & Supercomputing, HPC Department Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl From hal.rosenstock at gmail.com Tue Apr 14 04:17:23 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 14 Apr 2009 07:17:23 -0400 Subject: [ofa-general] Any easy way to specify to the SM to route/zone? In-Reply-To: References: Message-ID: On Mon, Apr 13, 2009 at 7:39 PM, Chris Worley wrote: >>>>> Partition "part2" with P_Key=2 should connect this client's port 0 to >>>>> the sever on port 1 of mlx4_1 >>>> >>>> Do you really mean port 0 ? >>> >>> Nope... in this case I have 0x0002c903000292b0 in part2 in my >>> partitions file, which is port 1, the second port of the adapter.  I'm >>> hoping to use both ports of all adapters on the server. >> >> So you're talking about physical marking on the card rather than >> actual (logical) port number. > > I'm not sure about board markings... both ports are attached to the > switch, for all IB adapters, so all should work.  I'm using the > numbers provided by ibstat. OK but you did mention port 0 on the HCA. >>> So, on one client... the one corresponding to "part2" in the >>> partitions file, I put the P_Key into the "create child": >>> >>> echo 0x2 > /sys/class/net/ib0/create_child >>> >>> ... and did likewise on the host, for ib3 (the second port on the >>> second adapter): >>> >>> echo 0x2 > /sys/class/net/ib3/create_child >> >> I'm not 100% sure but I think you may need the full member PKey on at >> least one of them (0x800x). > > I've changed the P_Keys to 0x800x, and set the "create_child" files > appropriately. > >> >>> Still, no ping (the interfaces are setup correctly). >> >> Are there still join failure messages on the client and/or server ? >> What do they say now ? > > Lot's of "bad P_Key" notices: Yes, that's why I said to change it all to full membership (on separate partitions) with the default partition having everything but SM as limited. -- Hal From sashak at voltaire.com Tue Apr 14 05:00:07 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Apr 2009 15:00:07 +0300 Subject: [ofa-general] Re: [PATCH] Added send trap for trap 129 (local link integrity) In-Reply-To: <49BE19E6.6060101@gmail.com> References: <49BE19E6.6060101@gmail.com> Message-ID: <20090414120007.GB5519@sk> On 11:20 Mon 16 Mar , Eli Dorfman (Voltaire) wrote: > Added send trap for trap 129 (local link integrity). > > Signed-off-by: Julia Volynsky Applied. Thanks. Next time when you sending patch prepared by somebody else please mark it using 'From: ' line. Also ideally you can add your SOB line too. For more details look there: /usr/src/linux/Documentation/SubmittingPatches Sasha From sashak at voltaire.com Tue Apr 14 05:15:37 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Apr 2009 15:15:37 +0300 Subject: [ofa-general] ***SPAM*** [PATCH] infiniband-diags/ibsendtrap: code consolidation In-Reply-To: <20090414120007.GB5519@sk> References: <49BE19E6.6060101@gmail.com> <20090414120007.GB5519@sk> Message-ID: <20090414121537.GC5519@sk> Code consolidation to prevent duplications. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/src/ibsendtrap.c | 95 ++++++++++++++----------------------- 1 files changed, 36 insertions(+), 59 deletions(-) diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c index b50cc61..51f2327 100644 --- a/infiniband-diags/src/ibsendtrap.c +++ b/infiniband-diags/src/ibsendtrap.c @@ -1,5 +1,6 @@ /* * Copyright (c) 2008 Lawrence Livermore National Security + * Copyright (c) 2009 Voltaire Inc. All rights reserved. * * Produced at Lawrence Livermore National Laboratory. * Written by Ira Weiny . @@ -51,43 +52,32 @@ struct ibmad_port *srcport; /* for local link integrity */ int error_port = 1; -static int send_144_node_desc_update(void) +static void build_trap144(ib_mad_notice_attr_t * n, uint16_t lid) { - ib_portid_t sm_port; - ib_portid_t selfportid; - int selfport; - ib_rpc_t trap_rpc; - ib_mad_notice_attr_t notice; - - if (ib_resolve_self_via(&selfportid, &selfport, NULL, srcport)) - IBERROR("can't resolve self"); - - if (ib_resolve_smlid_via(&sm_port, 0, srcport)) - IBERROR("can't resolve SM destination port"); - - memset(&trap_rpc, 0, sizeof(trap_rpc)); - trap_rpc.mgtclass = IB_SMI_CLASS; - trap_rpc.method = IB_MAD_METHOD_TRAP; - trap_rpc.trid = mad_trid(); - trap_rpc.attr.id = NOTICE; - trap_rpc.datasz = IB_SMP_DATA_SIZE; - trap_rpc.dataoffs = IB_SMP_DATA_OFFS; - - memset(¬ice, 0, sizeof(notice)); - notice.generic_type = 0x80 | IB_NOTICE_TYPE_INFO; - notice.g_or_v.generic.prod_type_lsb = cl_hton16(IB_NODE_TYPE_CA); - notice.g_or_v.generic.trap_num = cl_hton16(144); - notice.issuer_lid = cl_hton16((uint16_t) selfportid.lid); - notice.data_details.ntc_144.lid = cl_hton16((uint16_t) selfportid.lid); - notice.data_details.ntc_144.local_changes = + n->generic_type = 0x80 | IB_NOTICE_TYPE_INFO; + n->g_or_v.generic.prod_type_lsb = cl_hton16(IB_NODE_TYPE_CA); + n->g_or_v.generic.trap_num = cl_hton16(144); + n->issuer_lid = cl_hton16(lid); + n->data_details.ntc_144.lid = cl_hton16(lid); + n->data_details.ntc_144.local_changes = TRAP_144_MASK_OTHER_LOCAL_CHANGES; - notice.data_details.ntc_144.change_flgs = + n->data_details.ntc_144.change_flgs = TRAP_144_MASK_NODE_DESCRIPTION_CHANGE; +} - return (mad_send_via(&trap_rpc, &sm_port, NULL, ¬ice, srcport)); +static void build_trap129(ib_mad_notice_attr_t * n, uint16_t lid) +{ + n->generic_type = 0x80 | IB_NOTICE_TYPE_INFO; + n->g_or_v.generic.prod_type_lsb = cl_hton16(IB_NODE_TYPE_CA); + n->g_or_v.generic.trap_num = cl_hton16(129); + n->issuer_lid = cl_hton16(lid); + n->data_details.ntc_129_131.lid = cl_hton16(lid); + n->data_details.ntc_129_131.pad = 0; + n->data_details.ntc_129_131.port_num = error_port; } -static int send_129_local_link_integrity(void) +static int send_trap(const char *name, + void (*build) (ib_mad_notice_attr_t *, uint16_t)) { ib_portid_t sm_port; ib_portid_t selfportid; @@ -110,39 +100,31 @@ static int send_129_local_link_integrity(void) trap_rpc.dataoffs = IB_SMP_DATA_OFFS; memset(¬ice, 0, sizeof(notice)); - notice.generic_type = 0x80 | IB_NOTICE_TYPE_INFO; - notice.g_or_v.generic.prod_type_lsb = cl_hton16(IB_NODE_TYPE_CA); - notice.g_or_v.generic.trap_num = cl_hton16(129); - notice.issuer_lid = cl_hton16((uint16_t) selfportid.lid); - notice.data_details.ntc_129_131.lid = cl_hton16((uint16_t) selfportid.lid); - notice.data_details.ntc_129_131.pad = 0; - notice.data_details.ntc_129_131.port_num = error_port; - - return (mad_send_via(&trap_rpc, &sm_port, NULL, ¬ice, srcport)); + build(¬ice, selfportid.lid); + + return mad_send_via(&trap_rpc, &sm_port, NULL, ¬ice, srcport); } typedef struct _trap_def { char *trap_name; - int (*send_func) (void); + void (*build_func) (ib_mad_notice_attr_t *, uint16_t); } trap_def_t; trap_def_t traps[3] = { - {"node_desc_change", send_144_node_desc_update}, - {"local_link_integrity", send_129_local_link_integrity}, + {"node_desc_change", build_trap144}, + {"local_link_integrity", build_trap129}, {NULL, NULL} }; -int send_trap(char *trap_name) +int process_send_trap(char *trap_name) { int i; - for (i = 0; traps[i].trap_name; i++) { - if (strcmp(traps[i].trap_name, trap_name) == 0) { - return (traps[i].send_func()); - } - } + for (i = 0; traps[i].trap_name; i++) + if (strcmp(traps[i].trap_name, trap_name) == 0) + return send_trap(trap_name, traps[i].build_func); ibdiag_show_usage(); - return(1); + return 1; } int main(int argc, char **argv) @@ -169,15 +151,10 @@ int main(int argc, char **argv) argc -= optind; argv += optind; - if (!argv[0]) { - trap_name = traps[0].trap_name; - } else { - trap_name = argv[0]; - } + trap_name = argv[0] ? argv[0] : traps[0].trap_name; - if (argc > 1) { + if (argc > 1) error_port = atoi(argv[1]); - } madrpc_show_errors(1); @@ -185,7 +162,7 @@ int main(int argc, char **argv) if (!srcport) IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); - rc = send_trap(trap_name); + rc = process_send_trap(trap_name); mad_rpc_close_port(srcport); - return (rc); + return rc; } -- 1.6.1.2.319.gbd9e From hal.rosenstock at gmail.com Tue Apr 14 06:33:14 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 14 Apr 2009 09:33:14 -0400 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCH] opensm/osm_helper.c: Add more info for traps 144 and 256-259 in osm_dump_notice In-Reply-To: <20090414094716.GJ14033@sk> References: <20090318121807.GA5284@comcast.net> <20090414094716.GJ14033@sk> Message-ID: Hi Sasha, On Tue, Apr 14, 2009 at 5:47 AM, Sasha Khapyorsky wrote: >> +                     case 257: >> +                     case 258: >> +                             sprintf(buff, >> +                                     "\t\t\t\tlid1.....................%u\n" >> +                                     "\t\t\t\tlid2.....................%u\n" >> +                                     "\t\t\t\tkey......................0x%x\n" >> +                                     "\t\t\t\tsl.......................%d\n" >> +                                     "\t\t\t\tqp1......................0x%x\n" >> +                                     "\t\t\t\tqp2......................0x%x\n" >> +                                     "\t\t\t\tgid1.....................0x%016" PRIx64 " : " >> +                                     "0x%016" PRIx64 "\n" >> +                                     "\t\t\t\tgid2.....................0x%016" PRIx64 " : " >> +                                     "0x%016" PRIx64 "\n", > > Isn't IPv6 address format preferable for GIDs printing? Yes but it's consistent with other GID printing in this file so I didn't want this to stand out as different. -- Hal > Sasha From hnrose at comcast.net Tue Apr 14 06:42:49 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Tue, 14 Apr 2009 09:42:49 -0400 Subject: [ofa-general] ***SPAM*** [PATCHv2] opensm/osm_helper.c: Add more info for traps 144 and 256-259 in osm_dump_notice Message-ID: <20090414134249.GA27308@comcast.net> Signed-off-by: Hal Rosenstock --- Changes from v1: In trap 144, display local changes as %u rather than %d In trap 256, don't use additional buffer and eliminate buffer copying diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c index 3fa4ed7..4c497cd 100644 --- a/opensm/opensm/osm_helper.c +++ b/opensm/opensm/osm_helper.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -1687,6 +1688,7 @@ void osm_dump_notice(IN osm_log_t * p_log, { if (osm_log_is_active(p_log, log_level)) { if (ib_notice_is_generic(p_ntci)) { + int i, n; char buff[1024]; buff[0] = '\0'; @@ -1725,11 +1727,16 @@ void osm_dump_notice(IN osm_log_t * p_log, case 144: sprintf(buff, "\t\t\t\tlid......................%u\n" - "\t\t\t\tnew_cap_mask.............0x%08x\n", + "\t\t\t\tlocal_changes............%u\n" + "\t\t\t\tnew_cap_mask.............0x%08x\n" + "\t\t\t\tchange_flags.............0x%x\n", cl_ntoh16(p_ntci->data_details.ntc_144. lid), + p_ntci->data_details.ntc_144.local_changes, cl_ntoh32(p_ntci->data_details.ntc_144. - new_cap_mask)); + new_cap_mask), + cl_ntoh16(p_ntci->data_details.ntc_144. + change_flgs)); break; case 145: sprintf(buff, @@ -1741,6 +1748,93 @@ void osm_dump_notice(IN osm_log_t * p_log, cl_ntoh64(p_ntci->data_details.ntc_145. new_sys_guid)); break; + case 256: + n = sprintf(buff, + "\t\t\t\tlid......................%u\n" + "\t\t\t\tdrslid...................%u\n" + "\t\t\t\tmethod...................0x%x\n" + "\t\t\t\tattr_id..................0x%x\n" + "\t\t\t\tattr_mod.................0x%x\n" + "\t\t\t\tm_key....................0x%016" PRIx64 "\n" + "\t\t\t\tdr_notice................%d\n" + "\t\t\t\tdr_path_truncated........%d\n" + "\t\t\t\tdr_hop_count.............%u\n", + cl_ntoh16(p_ntci->data_details.ntc_256.lid), + cl_ntoh16(p_ntci->data_details.ntc_256.dr_slid), + p_ntci->data_details.ntc_256.method, + cl_ntoh16(p_ntci->data_details.ntc_256.attr_id), + cl_ntoh32(p_ntci->data_details.ntc_256.attr_mod), + cl_ntoh64(p_ntci->data_details.ntc_256.mkey), + p_ntci->data_details.ntc_256.dr_trunc_hop >> 7, + p_ntci->data_details.ntc_256.dr_trunc_hop >> 6, + p_ntci->data_details.ntc_256.dr_trunc_hop & 0x3f); + n += sprintf(buff + n, "Directed Path Dump of %u hop path:" + "\n\t\t\t\tPath = ", + p_ntci->data_details.ntc_256.dr_trunc_hop & 0x3f); + for (i = 0; + i <= (p_ntci->data_details.ntc_256.dr_trunc_hop & 0x3f); + i++) { + if (i == 0) + sprintf(buff + n, "%d", + p_ntci->data_details.ntc_256.dr_rtn_path[i]); + else + sprintf(buff + n, ",%d", + p_ntci->data_details.ntc_256.dr_rtn_path[i]); + } + break; + case 257: + case 258: + sprintf(buff, + "\t\t\t\tlid1.....................%u\n" + "\t\t\t\tlid2.....................%u\n" + "\t\t\t\tkey......................0x%x\n" + "\t\t\t\tsl.......................%d\n" + "\t\t\t\tqp1......................0x%x\n" + "\t\t\t\tqp2......................0x%x\n" + "\t\t\t\tgid1.....................0x%016" PRIx64 " : " + "0x%016" PRIx64 "\n" + "\t\t\t\tgid2.....................0x%016" PRIx64 " : " + "0x%016" PRIx64 "\n", + cl_ntoh16(p_ntci->data_details.ntc_257_258.lid1), + cl_ntoh16(p_ntci->data_details.ntc_257_258.lid2), + cl_ntoh32(p_ntci->data_details.ntc_257_258.key), + cl_ntoh32(p_ntci->data_details.ntc_257_258.qp1) >> 24, + cl_ntoh32(p_ntci->data_details.ntc_257_258.qp1) & 0xffffff, + cl_ntoh32(p_ntci->data_details.ntc_257_258.qp2), + cl_ntoh64(p_ntci->data_details.ntc_257_258.gid1.unicast.prefix), + cl_ntoh64(p_ntci->data_details.ntc_257_258.gid1.unicast.interface_id), + cl_ntoh64(p_ntci->data_details.ntc_257_258.gid2.unicast.prefix), + cl_ntoh64(p_ntci->data_details.ntc_257_258.gid2.unicast.interface_id)); + break; + case 259: + sprintf(buff, + "\t\t\t\tdata_valid...............0x%x\n" + "\t\t\t\tlid1.....................%u\n" + "\t\t\t\tlid2.....................%u\n" + "\t\t\t\tpkey.....................0x%x\n" + "\t\t\t\tsl.......................%d\n" + "\t\t\t\tqp1......................0x%x\n" + "\t\t\t\tqp2......................0x%x\n" + "\t\t\t\tgid1.....................0x%016" PRIx64 " : " + "0x%016" PRIx64 "\n" + "\t\t\t\tgid2.....................0x%016" PRIx64 " : " + "0x%016" PRIx64 "\n" + "\t\t\t\tsw_lid...................%u\n" + "\t\t\t\tport_no..................%u\n", + cl_ntoh16(p_ntci->data_details.ntc_259.data_valid), + cl_ntoh16(p_ntci->data_details.ntc_259.lid1), + cl_ntoh16(p_ntci->data_details.ntc_259.lid2), + cl_ntoh16(p_ntci->data_details.ntc_259.pkey), + cl_ntoh32(p_ntci->data_details.ntc_259.sl_qp1) >> 24, + cl_ntoh32(p_ntci->data_details.ntc_259.sl_qp1) & 0xffffff, + cl_ntoh32(p_ntci->data_details.ntc_259.qp2), + cl_ntoh64(p_ntci->data_details.ntc_259.gid1.unicast.prefix), + cl_ntoh64(p_ntci->data_details.ntc_259.gid1.unicast.interface_id), + cl_ntoh64(p_ntci->data_details.ntc_259.gid2.unicast.prefix), + cl_ntoh64(p_ntci->data_details.ntc_259.gid2.unicast.interface_id), + cl_ntoh16(p_ntci->data_details.ntc_259.sw_lid), + p_ntci->data_details.ntc_259.port_no); + break; } osm_log(p_log, log_level, From hnrose at comcast.net Tue Apr 14 06:54:19 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Tue, 14 Apr 2009 09:54:19 -0400 Subject: [ofa-general] ***SPAM*** [PATCH] libibmad: Add decode support for SwitchInfo OptimizedSLtoVLMappingProgramming Message-ID: <20090414135419.GA27549@comcast.net> Signed-off-by: Hal Rosenstock --- Resending due to fat finguring list address diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index b8290a7..2a1fa30 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -357,6 +357,7 @@ enum MAD_FIELDS { IB_SW_DEF_MCAST_NOT_PRIM_F, IB_SW_LIFE_TIME_F, IB_SW_STATE_CHANGE_F, + IB_SW_OPT_SLTOVL_MAPPING_F, IB_SW_LIDS_PER_PORT_F, IB_SW_PARTITION_ENFORCE_CAP_F, IB_SW_PARTITION_ENF_INB_F, diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c index df43ceb..5b0639c 100644 --- a/libibmad/src/fields.c +++ b/libibmad/src/fields.c @@ -198,6 +198,7 @@ static const ib_field_t ib_mad_f[] = { {BITSOFFS(80, 8), "DefMcastNotPrimPort", mad_dump_uint}, {BITSOFFS(88, 5), "LifeTime", mad_dump_uint}, {BITSOFFS(93, 1), "StateChange", mad_dump_uint}, + {BITSOFFS(94, 2), "OptSLtoVLMapping", mad_dump_uint}, {BITSOFFS(96, 16), "LidsPerPort", mad_dump_uint}, {BITSOFFS(112, 16), "PartEnforceCap", mad_dump_uint}, {BITSOFFS(128, 1), "InboundPartEnf", mad_dump_uint}, From sashak at voltaire.com Tue Apr 14 06:56:00 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Apr 2009 16:56:00 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCHv3] libibmad/rpc.c: Handle redirection status In-Reply-To: <20090311113059.GB12004@comcast.net> References: <20090311113059.GB12004@comcast.net> Message-ID: <20090414135600.GD5519@sk> On 06:30 Wed 11 Mar , Hal Rosenstock wrote: > > Also, in mad_rpc, status should be based on management class > > Signed-off-by: Hal Rosenstock > --- > Changes since v2: > Made ERRS into macro > > Changes since v1: > Always get 16 bits of MAD status and mask when DR SMP > Remove some extraneous braces > > diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c > index 8c68cf9..e9a1c06 100644 > --- a/libibmad/src/rpc.c > +++ b/libibmad/src/rpc.c > @@ -1,5 +1,6 @@ > /* > * Copyright (c) 2004-2006 Voltaire Inc. All rights reserved. > + * Copyright (c) 2009 HNR Consulting. All rights reserved. > * > * This software is available to you under a choice of one of two > * licenses. You may choose to be licensed under the terms of the GNU > @@ -59,7 +60,10 @@ static int save_mad_len = 256; > > #undef DEBUG > #define DEBUG if (ibdebug) IBWARN > -#define ERRS if (iberrs || ibdebug) IBWARN > +#define ERRS(fmt, ...) do { \ > + if (iberrs || ibdebug) \ > + IBWARN(fmt, ## __VA_ARGS__); \ > +} while (0) > > #define MAD_TID(mad) (*((uint64_t *)((char *)(mad) + 8))) > > @@ -128,9 +132,8 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, > (uint32_t) mad_get_field64(umad_get_mad(sndbuf), 0, IB_MAD_TRID_F); > > for (retries = 0; retries < madrpc_retries; retries++) { > - if (retries) { > + if (retries) > ERRS("retry %d (timeout %d ms)", retries, timeout); > - } > > length = len; > if (umad_send(port_id, agentid, sndbuf, length, timeout, 0) < 0) { I'm taking this macro wrapping code as separate patch - this seems unrelated to the rest. > @@ -170,7 +173,7 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport > void *payload, void *rcvdata) > { > int status, len; > - uint8_t sndbuf[1024], rcvbuf[1024], *mad; > + uint8_t sndbuf[1024], rcvbuf[1024], *mad, mgmtclass; > > len = 0; > memset(sndbuf, 0, umad_size() + IB_MAD_SIZE); > @@ -187,7 +190,18 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport > > mad = umad_get_mad(rcvbuf); > > - if ((status = mad_get_field(mad, 0, IB_DRSMP_STATUS_F)) != 0) { Do you know any example where this will not work properly? I don't. This code assumed as fast execution path, so adding not practically needed flows doesn't seem like a good idea for me. > + status = mad_get_field(mad, 0, IB_MAD_STATUS_F); > + mgmtclass = mad_get_field(mad, 0, IB_MAD_MGMTCLASS_F); > + if (mgmtclass == IB_SMI_DIRECT_CLASS) > + status &= 0x7fff; > + else if (mgmtclass != IB_SMI_CLASS) { > + if (status & 2) { > + ERRS("MAD redirection not supported; dport (%s)", > + portid2str(dport)); > + return 0; > + } > + } > + if (status) { > ERRS("MAD completed with error status 0x%x; dport (%s)", > status, portid2str(dport)); > return 0; > @@ -227,8 +241,12 @@ void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * > mad = umad_get_mad(rcvbuf); > > if ((status = mad_get_field(mad, 0, IB_MAD_STATUS_F)) != 0) { > - ERRS("MAD completed with error status 0x%x; dport (%s)", > - status, portid2str(dport)); > + if (status & 2) > + ERRS("MAD redirection not supported; dport (%s)", > + portid2str(dport)); > + else > + ERRS("MAD completed with error status 0x%x; dport (%s)", > + status, portid2str(dport)); The error status is printed originally, isn't it not sufficient? If not what about adding generic function which prints know error codes as string instead of handling some values separately? Sasha From sashak at voltaire.com Tue Apr 14 07:09:31 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Apr 2009 17:09:31 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCHv2] opensm/osm_helper.c: Add more info for traps 144 and 256-259 in osm_dump_notice In-Reply-To: <20090414134249.GA27308@comcast.net> References: <20090414134249.GA27308@comcast.net> Message-ID: <20090414140931.GE5519@sk> On 09:42 Tue 14 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock > > --- > Changes from v1: > In trap 144, display local changes as %u rather than %d > In trap 256, don't use additional buffer and eliminate buffer copying > > diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c [snip...] > + n += sprintf(buff + n, "Directed Path Dump of %u hop path:" > + "\n\t\t\t\tPath = ", > + p_ntci->data_details.ntc_256.dr_trunc_hop & 0x3f); > + for (i = 0; > + i <= (p_ntci->data_details.ntc_256.dr_trunc_hop & 0x3f); > + i++) { > + if (i == 0) > + sprintf(buff + n, "%d", > + p_ntci->data_details.ntc_256.dr_rtn_path[i]); > + else > + sprintf(buff + n, ",%d", > + p_ntci->data_details.ntc_256.dr_rtn_path[i]); Should be n += sprintf() in both cases. Also I would recommend to use snprintf() instead of sprintf() to prevent possible buf overflow. Sasha From hal.rosenstock at gmail.com Tue Apr 14 07:12:47 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 14 Apr 2009 10:12:47 -0400 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCHv3] libibmad/rpc.c: Handle redirection status In-Reply-To: <20090414135600.GD5519@sk> References: <20090311113059.GB12004@comcast.net> <20090414135600.GD5519@sk> Message-ID: On Tue, Apr 14, 2009 at 9:56 AM, Sasha Khapyorsky wrote: > On 06:30 Wed 11 Mar     , Hal Rosenstock wrote: >> >> Also, in mad_rpc, status should be based on management class >> >> Signed-off-by: Hal Rosenstock >> --- >> Changes since v2: >> Made ERRS into macro >> >> Changes since v1: >> Always get 16 bits of MAD status and mask when DR SMP >> Remove some extraneous braces >> >> diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c >> index 8c68cf9..e9a1c06 100644 >> --- a/libibmad/src/rpc.c >> +++ b/libibmad/src/rpc.c >> @@ -1,5 +1,6 @@ >>  /* >>   * Copyright (c) 2004-2006 Voltaire Inc.  All rights reserved. >> + * Copyright (c) 2009 HNR Consulting.  All rights reserved. >>   * >>   * This software is available to you under a choice of one of two >>   * licenses.  You may choose to be licensed under the terms of the GNU >> @@ -59,7 +60,10 @@ static int save_mad_len = 256; >> >>  #undef DEBUG >>  #define DEBUG        if (ibdebug)    IBWARN >> -#define ERRS if (iberrs || ibdebug)  IBWARN >> +#define ERRS(fmt, ...) do {  \ >> +     if (iberrs || ibdebug)  \ >> +             IBWARN(fmt, ## __VA_ARGS__); \ >> +} while (0) >> >>  #define MAD_TID(mad) (*((uint64_t *)((char *)(mad) + 8))) >> >> @@ -128,9 +132,8 @@ _do_madrpc(int port_id, void *sndbuf, void *rcvbuf, int agentid, int len, >>           (uint32_t) mad_get_field64(umad_get_mad(sndbuf), 0, IB_MAD_TRID_F); >> >>       for (retries = 0; retries < madrpc_retries; retries++) { >> -             if (retries) { >> +             if (retries) >>                       ERRS("retry %d (timeout %d ms)", retries, timeout); >> -             } >> >>               length = len; >>               if (umad_send(port_id, agentid, sndbuf, length, timeout, 0) < 0) { > > I'm taking this macro wrapping code as separate patch - this seems > unrelated to the rest. > >> @@ -170,7 +173,7 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport >>             void *payload, void *rcvdata) >>  { >>       int status, len; >> -     uint8_t sndbuf[1024], rcvbuf[1024], *mad; >> +     uint8_t sndbuf[1024], rcvbuf[1024], *mad, mgmtclass; >> >>       len = 0; >>       memset(sndbuf, 0, umad_size() + IB_MAD_SIZE); >> @@ -187,7 +190,18 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport >> >>       mad = umad_get_mad(rcvbuf); >> >> -     if ((status = mad_get_field(mad, 0, IB_DRSMP_STATUS_F)) != 0) { > > Do you know any example where this will not work properly? I don't. Not currently but it's a spec compliance issue. > This code assumed as fast execution path, so adding not practically needed > flows doesn't seem like a good idea for me. > >> +     status = mad_get_field(mad, 0, IB_MAD_STATUS_F); >> +     mgmtclass = mad_get_field(mad, 0, IB_MAD_MGMTCLASS_F); >> +     if (mgmtclass == IB_SMI_DIRECT_CLASS) >> +             status &= 0x7fff; >> +     else if (mgmtclass != IB_SMI_CLASS) { >> +             if (status & 2) { >> +                     ERRS("MAD redirection not supported; dport (%s)", >> +                          portid2str(dport)); >> +                     return 0; >> +             } >> +     } >> +     if (status) { >>               ERRS("MAD completed with error status 0x%x; dport (%s)", >>                    status, portid2str(dport)); >>               return 0; >> @@ -227,8 +241,12 @@ void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * >>       mad = umad_get_mad(rcvbuf); >> >>       if ((status = mad_get_field(mad, 0, IB_MAD_STATUS_F)) != 0) { >> -             ERRS("MAD completed with error status 0x%x; dport (%s)", >> -                  status, portid2str(dport)); >> +             if (status & 2) >> +                     ERRS("MAD redirection not supported; dport (%s)", >> +                          portid2str(dport)); >> +             else >> +                     ERRS("MAD completed with error status 0x%x; dport (%s)", >> +                          status, portid2str(dport)); > > The error status is printed originally, isn't it not sufficient? IMO it's not sufficient as redirection is not an error. > If not > what about adding generic function which prints know error codes as > string instead of handling some values separately? That's doable but a separate proposition IMO. -- Hal > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Tue Apr 14 07:16:24 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 14 Apr 2009 10:16:24 -0400 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCHv2] opensm/osm_helper.c: Add more info for traps 144 and 256-259 in osm_dump_notice In-Reply-To: <20090414140931.GE5519@sk> References: <20090414134249.GA27308@comcast.net> <20090414140931.GE5519@sk> Message-ID: On Tue, Apr 14, 2009 at 10:09 AM, Sasha Khapyorsky wrote: > On 09:42 Tue 14 Apr     , Hal Rosenstock wrote: >> >> Signed-off-by: Hal Rosenstock >> >> --- >> Changes from v1: >> In trap 144, display local changes as %u rather than %d >> In trap 256, don't use additional buffer and eliminate buffer copying >> >> diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c > > [snip...] > >> +                             n += sprintf(buff + n, "Directed Path Dump of %u hop path:" >> +                                     "\n\t\t\t\tPath = ", >> +                                     p_ntci->data_details.ntc_256.dr_trunc_hop & 0x3f); >> +                             for (i = 0; >> +                                  i <= (p_ntci->data_details.ntc_256.dr_trunc_hop & 0x3f); >> +                                  i++) { >> +                                     if (i == 0) >> +                                             sprintf(buff + n, "%d", >> +                                                     p_ntci->data_details.ntc_256.dr_rtn_path[i]); >> +                                     else >> +                                             sprintf(buff + n, ",%d", >> +                                                     p_ntci->data_details.ntc_256.dr_rtn_path[i]); > > Should be n += sprintf() in both cases. Right. > Also I would recommend to use > snprintf() instead of sprintf() to prevent possible buf overflow. It doesn't overflow but I can change it. Also, there are many other places even in this file where this is not done. -- Hal > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From swise at opengridcomputing.com Tue Apr 14 07:20:12 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 14 Apr 2009 09:20:12 -0500 Subject: [ofa-general] ofed autoconf.h In-Reply-To: <20090408175934.GB9167@obsidianresearch.com> References: <49DBC4A8.8020708@opengridcomputing.com> <1239200023.1541.242.camel@pc.interlinx.bc.ca> <49DCB44A.7020200@opengridcomputing.com> <1239205922.1541.296.camel@pc.interlinx.bc.ca> <20090408175934.GB9167@obsidianresearch.com> Message-ID: <49E49B9C.9030603@opengridcomputing.com> Attached is my final solution that resolves 1538 and 1578. With this patch, the generated autoconf.h will have slightly different content based on whether __OFED__BUILD__ is defined. I've added this define to the ofed makefile so it is defined when building the ofed kernel internally. External modules including the ofed autoconf.h must _not_ define __OFED_BUILD__. Semantically, here are the differences: With __OFED_BUILD__ defined: - #undef all ofed CONFIG defines - #define only the ofed CONFIG defines enabled by the ofa kernel tree configuration Without __OFED_BUILD__ defined: - #include_next first to get the backing kernel CONFIG defines - #define only the CONFIG defines enabled by the ofa kernel tree configuration Comments? If nobody barks, I'll push this up to my git tree for Vlad to pull tomorrow. Steve. -------------- next part -------------- A non-text attachment was scrubbed... Name: autoconf_fix.patch Type: text/x-diff Size: 16357 bytes Desc: not available URL: From sashak at voltaire.com Tue Apr 14 07:20:02 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Apr 2009 17:20:02 +0300 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCHv3] libibmad/rpc.c: Handle redirection status In-Reply-To: References: <20090311113059.GB12004@comcast.net> <20090414135600.GD5519@sk> Message-ID: <20090414142002.GF5519@sk> On 10:12 Tue 14 Apr , Hal Rosenstock wrote: > > > > Do you know any example where this will not work properly? I don't. > > Not currently but it's a spec compliance issue. I don't think we want to slow down fast path without a real needs. > > The error status is printed originally, isn't it not sufficient? > > IMO it's not sufficient as redirection is not an error. It is handled as error in this patch, the only message is different. > > If not > > what about adding generic function which prints know error codes as > > string instead of handling some values separately? > > That's doable but a separate proposition IMO. This is better than doing duplicated status parsing flows. Sasha From hal.rosenstock at gmail.com Tue Apr 14 07:27:04 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 14 Apr 2009 10:27:04 -0400 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCHv3] libibmad/rpc.c: Handle redirection status In-Reply-To: <20090414142002.GF5519@sk> References: <20090311113059.GB12004@comcast.net> <20090414135600.GD5519@sk> <20090414142002.GF5519@sk> Message-ID: On Tue, Apr 14, 2009 at 10:20 AM, Sasha Khapyorsky wrote: > On 10:12 Tue 14 Apr     , Hal Rosenstock wrote: >> > >> > Do you know any example where this will not work properly? I don't. >> >> Not currently but it's a spec compliance issue. > > I don't think we want to slow down fast path without a real needs. Isn't compliance a real need ? >> > The error status is printed originally, isn't it not sufficient? >> >> IMO it's not sufficient as redirection is not an error. > > It is handled as error in this patch, the only message is different. > >> > If not >> > what about adding generic function which prints know error codes as >> > string instead of handling some values separately? >> >> That's doable but a separate proposition IMO. > > This is better than doing duplicated status parsing flows. Sure but still separate IMO. -- Hal > Sasha From sashak at voltaire.com Tue Apr 14 07:30:23 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Apr 2009 17:30:23 +0300 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCHv2] opensm/osm_helper.c: Add more info for traps 144 and 256-259 in osm_dump_notice In-Reply-To: References: <20090414134249.GA27308@comcast.net> <20090414140931.GE5519@sk> Message-ID: <20090414143023.GG5519@sk> On 10:16 Tue 14 Apr , Hal Rosenstock wrote: > On Tue, Apr 14, 2009 at 10:09 AM, Sasha Khapyorsky wrote: > > On 09:42 Tue 14 Apr ?? ?? , Hal Rosenstock wrote: > >> > >> Signed-off-by: Hal Rosenstock > >> > >> --- > >> Changes from v1: > >> In trap 144, display local changes as %u rather than %d > >> In trap 256, don't use additional buffer and eliminate buffer copying > >> > >> diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c > > > > [snip...] > > > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? n += sprintf(buff + n, "Directed Path Dump of %u hop path:" > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? "\n\t\t\t\tPath = ", > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? p_ntci->data_details.ntc_256.dr_trunc_hop & 0x3f); > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? for (i = 0; > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??i <= (p_ntci->data_details.ntc_256.dr_trunc_hop & 0x3f); > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??i++) { > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? if (i == 0) > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? sprintf(buff + n, "%d", > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? p_ntci->data_details.ntc_256.dr_rtn_path[i]); > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? else > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? sprintf(buff + n, ",%d", > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? p_ntci->data_details.ntc_256.dr_rtn_path[i]); > > > > Should be n += sprintf() in both cases. > > Right. And also "\n" at the end. > > Also I would recommend to use > > snprintf() instead of sprintf() to prevent possible buf overflow. > > It doesn't overflow but I can change it. If you know for sure. I will need to calculate all lengths including cases when broken data is received - better to not think about this and to not rely on hardcoded buffer size. > Also, there are many other > places even in this file where this is not done. This is likely true, but why to add new potentially buggy things. Sasha From sashak at voltaire.com Tue Apr 14 07:32:54 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Apr 2009 17:32:54 +0300 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCHv3] libibmad/rpc.c: Handle redirection status In-Reply-To: References: <20090311113059.GB12004@comcast.net> <20090414135600.GD5519@sk> <20090414142002.GF5519@sk> Message-ID: <20090414143254.GH5519@sk> On 10:27 Tue 14 Apr , Hal Rosenstock wrote: > On Tue, Apr 14, 2009 at 10:20 AM, Sasha Khapyorsky wrote: > > On 10:12 Tue 14 Apr ?? ?? , Hal Rosenstock wrote: > >> > > >> > Do you know any example where this will not work properly? I don't. > >> > >> Not currently but it's a spec compliance issue. > > > > I don't think we want to slow down fast path without a real needs. > > Isn't compliance a real need ? It is complaint now - we agreed there are no cases yet when it is not. Sasha From hnrose at comcast.net Tue Apr 14 07:45:47 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Tue, 14 Apr 2009 10:45:47 -0400 Subject: [ofa-general] ***SPAM*** [PATCHv3] opensm/osm_helper.c: Add more info for traps 144 and 256-259 in osm_dump_notice Message-ID: <20090414144547.GA32013@comcast.net> Signed-off-by: Hal Rosenstock --- Changes from v2: In trap 256, use snprintf to preclude buffer overflow Also added new line at end Changes from v1: In trap 144, display local changes as %u rather than %d In trap 256, don't use additional buffer and eliminate buffer copying diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c index 3fa4ed7..b0faf26 100644 --- a/opensm/opensm/osm_helper.c +++ b/opensm/opensm/osm_helper.c @@ -2,6 +2,7 @@ * Copyright (c) 2004-2008 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -1687,6 +1688,7 @@ void osm_dump_notice(IN osm_log_t * p_log, { if (osm_log_is_active(p_log, log_level)) { if (ib_notice_is_generic(p_ntci)) { + int i, n; char buff[1024]; buff[0] = '\0'; @@ -1725,11 +1727,16 @@ void osm_dump_notice(IN osm_log_t * p_log, case 144: sprintf(buff, "\t\t\t\tlid......................%u\n" - "\t\t\t\tnew_cap_mask.............0x%08x\n", + "\t\t\t\tlocal_changes............%u\n" + "\t\t\t\tnew_cap_mask.............0x%08x\n" + "\t\t\t\tchange_flags.............0x%x\n", cl_ntoh16(p_ntci->data_details.ntc_144. lid), + p_ntci->data_details.ntc_144.local_changes, cl_ntoh32(p_ntci->data_details.ntc_144. - new_cap_mask)); + new_cap_mask), + cl_ntoh16(p_ntci->data_details.ntc_144. + change_flgs)); break; case 145: sprintf(buff, @@ -1741,6 +1748,95 @@ void osm_dump_notice(IN osm_log_t * p_log, cl_ntoh64(p_ntci->data_details.ntc_145. new_sys_guid)); break; + case 256: + n = sprintf(buff, + "\t\t\t\tlid......................%u\n" + "\t\t\t\tdrslid...................%u\n" + "\t\t\t\tmethod...................0x%x\n" + "\t\t\t\tattr_id..................0x%x\n" + "\t\t\t\tattr_mod.................0x%x\n" + "\t\t\t\tm_key....................0x%016" PRIx64 "\n" + "\t\t\t\tdr_notice................%d\n" + "\t\t\t\tdr_path_truncated........%d\n" + "\t\t\t\tdr_hop_count.............%u\n", + cl_ntoh16(p_ntci->data_details.ntc_256.lid), + cl_ntoh16(p_ntci->data_details.ntc_256.dr_slid), + p_ntci->data_details.ntc_256.method, + cl_ntoh16(p_ntci->data_details.ntc_256.attr_id), + cl_ntoh32(p_ntci->data_details.ntc_256.attr_mod), + cl_ntoh64(p_ntci->data_details.ntc_256.mkey), + p_ntci->data_details.ntc_256.dr_trunc_hop >> 7, + p_ntci->data_details.ntc_256.dr_trunc_hop >> 6, + p_ntci->data_details.ntc_256.dr_trunc_hop & 0x3f); + n += snprintf(buff + n, sizeof(buff) - n, + "Directed Path Dump of %u hop path:" + "\n\t\t\t\tPath = ", + p_ntci->data_details.ntc_256.dr_trunc_hop & 0x3f); + for (i = 0; + i <= (p_ntci->data_details.ntc_256.dr_trunc_hop & 0x3f); + i++) { + if (i == 0) + n += snprintf(buff + n, sizeof(buff) - n, "%d", + p_ntci->data_details.ntc_256.dr_rtn_path[i]); + else + n += snprintf(buff + n, sizeof(buff) - n, ",%d", + p_ntci->data_details.ntc_256.dr_rtn_path[i]); + } + snprintf(buff + n, sizeof(buff) - n, "\n"); + break; + case 257: + case 258: + sprintf(buff, + "\t\t\t\tlid1.....................%u\n" + "\t\t\t\tlid2.....................%u\n" + "\t\t\t\tkey......................0x%x\n" + "\t\t\t\tsl.......................%d\n" + "\t\t\t\tqp1......................0x%x\n" + "\t\t\t\tqp2......................0x%x\n" + "\t\t\t\tgid1.....................0x%016" PRIx64 " : " + "0x%016" PRIx64 "\n" + "\t\t\t\tgid2.....................0x%016" PRIx64 " : " + "0x%016" PRIx64 "\n", + cl_ntoh16(p_ntci->data_details.ntc_257_258.lid1), + cl_ntoh16(p_ntci->data_details.ntc_257_258.lid2), + cl_ntoh32(p_ntci->data_details.ntc_257_258.key), + cl_ntoh32(p_ntci->data_details.ntc_257_258.qp1) >> 24, + cl_ntoh32(p_ntci->data_details.ntc_257_258.qp1) & 0xffffff, + cl_ntoh32(p_ntci->data_details.ntc_257_258.qp2), + cl_ntoh64(p_ntci->data_details.ntc_257_258.gid1.unicast.prefix), + cl_ntoh64(p_ntci->data_details.ntc_257_258.gid1.unicast.interface_id), + cl_ntoh64(p_ntci->data_details.ntc_257_258.gid2.unicast.prefix), + cl_ntoh64(p_ntci->data_details.ntc_257_258.gid2.unicast.interface_id)); + break; + case 259: + sprintf(buff, + "\t\t\t\tdata_valid...............0x%x\n" + "\t\t\t\tlid1.....................%u\n" + "\t\t\t\tlid2.....................%u\n" + "\t\t\t\tpkey.....................0x%x\n" + "\t\t\t\tsl.......................%d\n" + "\t\t\t\tqp1......................0x%x\n" + "\t\t\t\tqp2......................0x%x\n" + "\t\t\t\tgid1.....................0x%016" PRIx64 " : " + "0x%016" PRIx64 "\n" + "\t\t\t\tgid2.....................0x%016" PRIx64 " : " + "0x%016" PRIx64 "\n" + "\t\t\t\tsw_lid...................%u\n" + "\t\t\t\tport_no..................%u\n", + cl_ntoh16(p_ntci->data_details.ntc_259.data_valid), + cl_ntoh16(p_ntci->data_details.ntc_259.lid1), + cl_ntoh16(p_ntci->data_details.ntc_259.lid2), + cl_ntoh16(p_ntci->data_details.ntc_259.pkey), + cl_ntoh32(p_ntci->data_details.ntc_259.sl_qp1) >> 24, + cl_ntoh32(p_ntci->data_details.ntc_259.sl_qp1) & 0xffffff, + cl_ntoh32(p_ntci->data_details.ntc_259.qp2), + cl_ntoh64(p_ntci->data_details.ntc_259.gid1.unicast.prefix), + cl_ntoh64(p_ntci->data_details.ntc_259.gid1.unicast.interface_id), + cl_ntoh64(p_ntci->data_details.ntc_259.gid2.unicast.prefix), + cl_ntoh64(p_ntci->data_details.ntc_259.gid2.unicast.interface_id), + cl_ntoh16(p_ntci->data_details.ntc_259.sw_lid), + p_ntci->data_details.ntc_259.port_no); + break; } osm_log(p_log, log_level, From weiny2 at llnl.gov Tue Apr 14 09:12:23 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 14 Apr 2009 09:12:23 -0700 Subject: ***SPAM*** Re: [ofa-general] mlx4: errors and failures on OOM In-Reply-To: References: <200904112233.51105.bs_lists@aakef.fastmail.fm> Message-ID: <20090414091223.c7911402.weiny2@llnl.gov> On Mon, 13 Apr 2009 07:40:33 -0400 Hal Rosenstock wrote: > On Sat, Apr 11, 2009 at 4:33 PM, Bernd Schubert > wrote: > > Hello, > > > > last week I had issues with Lustre failures, which turned out to be > > failures of many clients, which run into out-of-memory due to bad user space jobs > > (and no protection again that by the queuing system). > > > > Anyway, I don't think IB is supposed to fail, when the oom killer activates. > > > > Errors for 0x001b0d0000008ede "Cisco Switch" > >   5: [XmtDiscards == 270] > >         Link info:     38    5[  ]  ==( 4X 5.0 Gbps)==>  0x00188b9097fe2a81    1[  ] "eul0605 HCA-1" > >   16: [XmtDiscards == 132] > >         Link info:     38   16[  ]  ==( 4X 5.0 Gbps)==>  0x00188b9097fe2a01    1[  ] "eul0616 HCA-1" > > > > I used a script to monitor the fabric for failures every 5 min and just when the oom > > killer activated on the clients the messages above came up. > > XmtDiscards are the total number of outbound packets discarded by the port > because the port is down or congested. Reasons for this include: > • Output port is not in the active state > • Packet length exceeded NeighborMTU > • Switch Lifetime Limit exceeded > • Switch HOQ Lifetime Limit exceeded > This may also include packets discarded while in VLStalled State. For what you are describing this is "normal". "Normal" in the sense that the HCA is no longer accepting inbound packets and the switch discards them. > > > Below are syslogs from one of these clients > > > > Apr  4 08:50:38 eul0605 kernel: Lustre: Request x50173 sent from MGC172.17.31.247 at o2ib to NID 172.17.31.247 at o2ib 51s ago has timed out (limit > > 300s). > > Apr  4 08:50:38 eul0605 kernel: Lustre: Skipped 30 previous similar messages > > Apr  4 08:50:38 eul0605 kernel: LustreError: 166-1: MGC172.17.31.247 at o2ib: Connection to service MGS via nid 172.17.31.247 at o2ib was lost; in > > progress operations using this service will fail. > > Apr  4 08:50:38 eul0605 kernel: Lustre: home1-MDT0000-mdc-0000010430fa0800: Connection to service home1-MDT0000 via nid 172.17.31.247 at o2ib was > > lost; in progress operations using this service will wait for recovery to complete. > > Apr  4 08:50:38 eul0605 kernel: Lustre: Skipped 7 previous similar messages > > Apr  4 08:50:38 eul0605 kernel: Lustre: tmp-OST0003-osc-0000010423750000: Connection to service tmp-OST0003 via nid 172.17.31.231 at o2ib was lost; in > > progress operations using this service will wait for recovery to complete. > > Apr  4 08:50:38 eul0605 kernel: Lustre: Skipped 29 previous similar messages > > Apr  4 08:50:38 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  req at 000001041bcbb800 x50205/t0 > > o250->MGS at 172.17.31.247@o2ib:26/25 lens 304/456 e 0 to 1 dl 1238828031 ref 2 fl Rpc:N/0/0 rc 0/0 > > Apr  4 08:50:38 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) Skipped 31 previous similar messages > > Apr  4 08:50:38 eul0605 kernel: Lustre: Request x50205 sent from MGC172.17.31.247 at o2ib to NID 172.17.31.247 at o2ib 51s ago has timed out (limit > > 300s). > > > > ===> So somehow lustre lost the network connection. On the server side the > > logs simply show this node didn't answer to pings anymore. > > > > > > Apr  4 08:52:58 eul0605 kernel: Lustre: Skipped 31 previous similar messages > > Apr  4 08:52:59 eul0605 kernel: Lustre: Changing connection for MGC172.17.31.247 at o2ib to MGC172.17.31.247 at o2ib_1/172.17.31.246 at o2ib > > Apr  4 08:52:59 eul0605 kernel: Lustre: Skipped 61 previous similar messages > > Apr  4 08:53:00 eul0605 kernel: oom-killer: gfp_mask=0xd2 > > > > [...] > > > > Apr  4 08:53:05 eul0605 kernel: Out of Memory: Killed process 10612 (gamos). > > Apr  4 08:53:10 eul0605 kernel: 3212 pages swap cached > > Apr  4 08:53:10 eul0605 kernel: Out of Memory: Killed process 10292 (tcsh). > > > > ===> And here we see, gamos consumed all memory again. > > > > Apr  4 08:53:10 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  req at 0000010430f8f800 x50237/t0 > > o250->MGS at MGC172.17.31.247@o2ib_1:26/25 lens 304/456 e 0 to 1 dl 1238828107 ref 2 fl Rpc:N/0/0 rc 0/0 > > Apr  4 08:53:10 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) Skipped 31 previous similar messages > > Apr  4 08:53:10 eul0605 kernel: Lustre: Request x50237 sent from MGC172.17.31.247 at o2ib to NID 172.17.31.246 at o2ib 50s ago has timed out (limit > > 300s). > > Apr  4 08:53:10 eul0605 kernel: Lustre: Skipped 31 previous similar messages > > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > > Apr  4 08:53:11 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > > Apr  4 08:53:11 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > > That multicast group looks like the IPv4 broadcast group; -11 is > EAGAIN. I'm not sure what's causing IPoIB to indicate this but I > wonder if this is a second level failure due to the previous (Lustre) > error detected. > > -- Hal > > > ===> So we see the reason why Lustre lost network connection - infiniband is down. > > > > > > In most cases IB recovers from that situation, not always. If it then entirely > > fails, ibnetdiscover or ibclearerrors will report that can't resolve the route > > to these nodes. > > > > > > This with drivers from ofed-1.3.1. Any ideas why OOM causes issues with IB? Are you getting any errors on the console from the kernel on these nodes? Specifically from the HCA (I think it was mlx4) driver? If the nodes recover I assume that means the ib0 errors go away and lustre reconnects? Ira > > > > > > Thanks, > > Bernd From sashak at voltaire.com Tue Apr 14 09:15:38 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 14 Apr 2009 19:15:38 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCHv3] opensm/osm_helper.c: Add more info for traps 144 and 256-259 in osm_dump_notice In-Reply-To: <20090414144547.GA32013@comcast.net> References: <20090414144547.GA32013@comcast.net> Message-ID: <20090414161538.GJ5519@sk> On 10:45 Tue 14 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. See note below. > + for (i = 0; > + i <= (p_ntci->data_details.ntc_256.dr_trunc_hop & 0x3f); > + i++) { > + if (i == 0) > + n += snprintf(buff + n, sizeof(buff) - n, "%d", > + p_ntci->data_details.ntc_256.dr_rtn_path[i]); > + else > + n += snprintf(buff + n, sizeof(buff) - n, ",%d", > + p_ntci->data_details.ntc_256.dr_rtn_path[i]); When snprintf() overflows it returns number of bytes which would be written otherwise, so return value should be checked anyway. So I'm adding this: if (n >= sizeof(buf)) { n = sizeof(buff) - 2; break; } (in order to preserve space for new line). Sasha From hal.rosenstock at gmail.com Tue Apr 14 10:03:40 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 14 Apr 2009 13:03:40 -0400 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCHv3] opensm/osm_helper.c: Add more info for traps 144 and 256-259 in osm_dump_notice In-Reply-To: <20090414161538.GJ5519@sk> References: <20090414144547.GA32013@comcast.net> <20090414161538.GJ5519@sk> Message-ID: On Tue, Apr 14, 2009 at 12:15 PM, Sasha Khapyorsky wrote: > On 10:45 Tue 14 Apr     , Hal Rosenstock wrote: >> >> Signed-off-by: Hal Rosenstock > > Applied. Thanks. See note below. > >> +                             for (i = 0; >> +                                  i <= (p_ntci->data_details.ntc_256.dr_trunc_hop & 0x3f); >> +                                  i++) { >> +                                     if (i == 0) >> +                                             n += snprintf(buff + n, sizeof(buff) - n, "%d", >> +                                                     p_ntci->data_details.ntc_256.dr_rtn_path[i]); >> +                                     else >> +                                             n += snprintf(buff + n, sizeof(buff) - n, ",%d", >> +                                                     p_ntci->data_details.ntc_256.dr_rtn_path[i]); > > When snprintf() overflows it returns number of bytes which would be > written otherwise, so return value should be checked anyway. So I'm > adding this: > >        if (n >= sizeof(buf)) { ^^^ buff >                n = sizeof(buff) - 2; >                break; >        } > > (in order to preserve space for new line). Sounds right. Doesn't this same issue exist elsewhere in opensm where snprintf is used and the return value is not checked in comparison to the size supplied ? -- Hal > > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Tue Apr 14 10:12:36 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 14 Apr 2009 13:12:36 -0400 Subject: [ofa-general] ***SPAM*** Re: [PATCHv3] libibmad/rpc.c: Handle redirection status In-Reply-To: <20090414143254.GH5519@sk> References: <20090311113059.GB12004@comcast.net> <20090414135600.GD5519@sk> <20090414142002.GF5519@sk> <20090414143254.GH5519@sk> Message-ID: On Tue, Apr 14, 2009 at 10:32 AM, Sasha Khapyorsky wrote: > On 10:27 Tue 14 Apr     , Hal Rosenstock wrote: >> On Tue, Apr 14, 2009 at 10:20 AM, Sasha Khapyorsky wrote: >> > On 10:12 Tue 14 Apr ?? ?? , Hal Rosenstock wrote: >> >> > >> >> > Do you know any example where this will not work properly? I don't. >> >> >> >> Not currently but it's a spec compliance issue. >> > >> > I don't think we want to slow down fast path without a real needs. >> >> Isn't compliance a real need ? > > It is complaint now - we agreed there are no cases yet when it is not. Huh ? What we agreed on was there was no known use right now but that doesn't mean a compliance test couldn't detect this. That's what compliance means to me. -- Hal > Sasha From hal.rosenstock at gmail.com Tue Apr 14 10:27:49 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 14 Apr 2009 13:27:49 -0400 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCHv3] opensm/osm_helper.c: Add more info for traps 144 and 256-259 in osm_dump_notice In-Reply-To: References: <20090414144547.GA32013@comcast.net> <20090414161538.GJ5519@sk> Message-ID: On Tue, Apr 14, 2009 at 1:03 PM, Hal Rosenstock wrote: >> When snprintf() overflows it returns number of bytes which would be >> written otherwise, so return value should be checked anyway. So I'm >> adding this: >> >>        if (n >= sizeof(buf)) { >                               ^^^ >                               buff > >>                n = sizeof(buff) - 2; >>                break; >>        } >> >> (in order to preserve space for new line). > > Sounds right. > > Doesn't this same issue exist elsewhere in opensm where snprintf is > used and the return value is not checked in comparison to the size > supplied ? I take that back; I audited the other places and they look fine to me. There are some changes which will make it less likely to fail if some buffer size is changed though. I will make up a patch for that in due time. -- Hal From hnrose at comcast.net Tue Apr 14 11:53:19 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Tue, 14 Apr 2009 14:53:19 -0400 Subject: [ofa-general] ***SPAM*** [PATCH] opensm: Add Dell to known vendor list Message-ID: <20090414185319.GA4413@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h index 2d0ecd7..e973a70 100644 --- a/opensm/include/opensm/osm_base.h +++ b/opensm/include/opensm/osm_base.h @@ -871,6 +871,7 @@ typedef enum _osm_sm_signal { #define OSM_VENDOR_ID_3LEAFNTWKS 0x0016A1 #define OSM_VENDOR_ID_XSIGO 0x001397 #define OSM_VENDOR_ID_HP2 0x0018FE +#define OSM_VENDOR_ID_DELL 0x00188B /**********/ diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c index 10547fa..ac4b372 100644 --- a/opensm/opensm/osm_helper.c +++ b/opensm/opensm/osm_helper.c @@ -2234,6 +2234,7 @@ const char *osm_get_manufacturer_str(IN uint64_t const guid_ho) static const char *sun_str = "Sun"; static const char *leafntwks_str = "3LeafNtwks"; static const char *xsigo_str = "Xsigo"; + static const char *dell_str = "Dell"; static const char *unknown_str = "Unknown"; switch ((uint32_t) (guid_ho >> (5 * 8))) { @@ -2285,6 +2286,8 @@ const char *osm_get_manufacturer_str(IN uint64_t const guid_ho) return (leafntwks_str); case OSM_VENDOR_ID_XSIGO: return (xsigo_str); + case OSM_VENDOR_ID_DELL: + return (dell_str); default: return (unknown_str); } From hal.rosenstock at gmail.com Tue Apr 14 12:28:24 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 14 Apr 2009 15:28:24 -0400 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] ibsim: Add SMSL support to PortInfo attribute In-Reply-To: <20090412092502.GG7664@sk> References: <20090324182510.GA18072@comcast.net> <20090324182213.GF20085@sashak.voltaire.com> <20090412092502.GG7664@sk> Message-ID: On Sun, Apr 12, 2009 at 5:25 AM, Sasha Khapyorsky wrote: > On 14:39 Tue 24 Mar     , Hal Rosenstock wrote: >> > >> > What is a purpose of this? Do you have any plans to use this field? >> > >> > If no, I don't see what this patch adds - SMSL is handled already as part >> > of PortInfo buffer. >> >> It's needed when SMSL is not 0 (e.g. Line's recent patch for lash). > > Ok. I see. Actually the problem is that in do_portinfo() received > PortInfo is not copied to target port's PortInfo (as I thought) and > update is done for only selected fields. > > Wouldn't it be better to rework it in the way where we will not need to > store useless (for simulator) PortInfo values as separate port structure > fields? So incoming PortInfo buffer will be just copied (of course with > caring about special fields - states, RO, etc..). Perhaps but that seems like a separate cleanup to me. -- Hal > Sasha > From swise at opengridcomputing.com Tue Apr 14 12:53:42 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 14 Apr 2009 14:53:42 -0500 Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Don't zero the qp attrs when moving to IDLE. Message-ID: <20090414195342.16529.35283.stgit@build.ogc.int> QP attributes must stay initialized when moving back to IDLE. Zeroing them will crash the system in _flush_qp() if the QP is subsequently moved to ERROR and back to IDLE. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/iwch_qp.c | 1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c index 2f546a6..27bbdc8 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_qp.c +++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c @@ -1069,7 +1069,6 @@ int iwch_modify_qp(struct iwch_dev *rhp, struct iwch_qp *qhp, goto out; } qhp->attr.state = IWCH_QP_STATE_IDLE; - memset(&qhp->attr, 0, sizeof(qhp->attr)); break; case IWCH_QP_STATE_TERMINATE: if (!internal) { From faisal.latif at intel.com Tue Apr 14 14:20:18 2009 From: faisal.latif at intel.com (Faisal Latif) Date: Tue, 14 Apr 2009 16:20:18 -0500 Subject: [ofa-general] [PATCH] RDMA:nes: improve cm_id reference count handling Message-ID: <20090414212017.GA9084@flatif-MOBL> We are now calling cm_id's ref count increment from only connect, accept, reject, and listen. The ref count will only be derecremented when cm_node is freed. Couple of error handling improvements while doing the cm_id reference count enhancement are also included. Signed-off-by: Faisal Latif --- drivers/infiniband/hw/nes/nes_cm.c | 41 +++++++++++++++----------------- drivers/infiniband/hw/nes/nes_hw.c | 1 - drivers/infiniband/hw/nes/nes_verbs.c | 2 - 3 files changed, 19 insertions(+), 25 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index 61da9d3..57d867e 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -490,7 +490,6 @@ static void nes_retrans_expired(struct nes_cm_node *cm_node) static void handle_recv_entry(struct nes_cm_node *cm_node, u32 rem_node) { struct nes_timer_entry *recv_entry = cm_node->recv_entry; - struct iw_cm_id *cm_id = cm_node->cm_id; struct nes_qp *nesqp; unsigned long qplockflags; @@ -522,8 +521,6 @@ static void handle_recv_entry(struct nes_cm_node *cm_node, u32 rem_node) /* TIME_WAIT state */ rem_ref_cm_node(cm_node->cm_core, cm_node); } - if (cm_node->cm_id) - cm_id->rem_ref(cm_id); kfree(recv_entry); cm_node->recv_entry = NULL; } @@ -994,8 +991,6 @@ static int mini_cm_dec_refcnt_listen(struct nes_cm_core *cm_core, event.cm_info.cm_id = cm_node->cm_id; cm_event_reset(&event); - rem_ref_cm_node(cm_node->cm_core, - cm_node); } } @@ -1219,6 +1214,7 @@ static int rem_ref_cm_node(struct nes_cm_core *cm_core, { unsigned long flags; struct nes_qp *nesqp; + struct iw_cm_id *cm_id = cm_node->cm_id; if (!cm_node) return -EINVAL; @@ -1260,6 +1256,14 @@ static int rem_ref_cm_node(struct nes_cm_core *cm_core, nes_rem_ref(&nesqp->ibqp); cm_node->nesqp = NULL; } + if (cm_id) { + if (cm_node->listener) { + if (cm_node->cm_id != cm_node->listener->cm_id) + cm_id->rem_ref(cm_id); + } else { + cm_id->rem_ref(cm_id); + } + } kfree(cm_node); return 0; @@ -1410,14 +1414,10 @@ static void handle_rst_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb, drop_packet(skb); break; case NES_CM_STATE_LAST_ACK: - cm_node->cm_id->rem_ref(cm_node->cm_id); + case NES_CM_STATE_FIN_WAIT1: case NES_CM_STATE_TIME_WAIT: - cm_node->state = NES_CM_STATE_CLOSED; - rem_ref_cm_node(cm_node->cm_core, cm_node); - drop_packet(skb); + passive_open_err(cm_node, skb, reset); break; - case NES_CM_STATE_FIN_WAIT1: - nes_debug(NES_DBG_CM, "Bad state %s[%u]\n", __func__, __LINE__); default: drop_packet(skb); break; @@ -1721,7 +1721,6 @@ static int handle_ack_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb, case NES_CM_STATE_CLOSING: cleanup_retrans_entry(cm_node); cm_node->state = NES_CM_STATE_CLOSED; - cm_node->cm_id->rem_ref(cm_node->cm_id); rem_ref_cm_node(cm_node->cm_core, cm_node); drop_packet(skb); break; @@ -2101,6 +2100,7 @@ static int mini_cm_reject(struct nes_cm_core *cm_core, passive_state = atomic_add_return(1, &cm_node->passive_state); if (passive_state == NES_SEND_RESET_EVENT) { cm_node->state = NES_CM_STATE_CLOSED; + cm_id->add_ref(cm_id); rem_ref_cm_node(cm_core, cm_node); } else { ret = send_mpa_reject(cm_node); @@ -2126,7 +2126,6 @@ static int mini_cm_reject(struct nes_cm_core *cm_core, cm_id = loopback->cm_id; rem_ref_cm_node(cm_core, loopback); - cm_id->rem_ref(cm_id); } return ret; @@ -2588,7 +2587,6 @@ static int nes_cm_disconn_true(struct nes_qp *nesqp) nes_debug(NES_DBG_CM, "OFA CM event_handler returned, ret=%d\n", ret); } - cm_id->rem_ref(cm_id); spin_lock_irqsave(&nesqp->lock, flags); if (nesqp->flush_issued == 0) { @@ -2708,7 +2706,6 @@ int nes_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) /* associate the node with the QP */ nesqp->cm_node = (void *)cm_node; cm_node->nesqp = nesqp; - nes_add_ref(&nesqp->ibqp); nes_debug(NES_DBG_CM, "QP%u, cm_node=%p, jiffies = %lu listener = %p\n", nesqp->hwqp.qp_id, cm_node, jiffies, cm_node->listener); @@ -2761,6 +2758,9 @@ int nes_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) nes_debug(NES_DBG_CM, "Unable to register memory region" "for lSMM for cm_node = %p \n", cm_node); + pci_free_consistent(nesdev->pcidev, + nesqp->private_data_len+sizeof(struct ietf_mpa_frame), + nesqp->ietf_frame, nesqp->ietf_frame_pbase); return -ENOMEM; } @@ -2797,6 +2797,8 @@ int nes_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) /* Cache the cm_id in the qp */ + nes_add_ref(&nesqp->ibqp); + cm_id->add_ref(cm_id); nesqp->cm_id = cm_id; cm_node->cm_id = cm_id; @@ -2875,8 +2877,7 @@ int nes_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) sizeof(struct ietf_mpa_frame)); - /* notify OF layer that accept event was successful */ - cm_id->add_ref(cm_id); + /* notify OF layer that accept event was successfull */ cm_event.event = IW_CM_EVENT_ESTABLISHED; cm_event.status = IW_CM_EVENT_STATUS_ACCEPTED; @@ -3360,7 +3361,6 @@ static void cm_event_connect_error(struct nes_cm_event *event) if (ret) printk(KERN_ERR "%s[%u] OFA CM event_handler returned, " "ret=%d\n", __func__, __LINE__, ret); - cm_id->rem_ref(cm_id); rem_ref_cm_node(event->cm_node->cm_core, event->cm_node); return; @@ -3400,7 +3400,6 @@ static void cm_event_reset(struct nes_cm_event *event) cm_event.private_data_len = 0; ret = cm_id->event_handler(cm_id, &cm_event); - cm_id->add_ref(cm_id); atomic_inc(&cm_closes); cm_event.event = IW_CM_EVENT_CLOSE; cm_event.status = IW_CM_EVENT_STATUS_OK; @@ -3416,7 +3415,7 @@ static void cm_event_reset(struct nes_cm_event *event) /* notify OF layer about this connection error event */ - cm_id->rem_ref(cm_id); + rem_ref_cm_node(event->cm_node->cm_core, event->cm_node); return; } @@ -3518,7 +3517,6 @@ static int nes_cm_post_event(struct nes_cm_event *event) { atomic_inc(&event->cm_node->cm_core->events_posted); add_ref_cm_node(event->cm_node); - event->cm_info.cm_id->add_ref(event->cm_info.cm_id); INIT_WORK(&event->event_work, nes_cm_event_handler); nes_debug(NES_DBG_CM, "cm_node=%p queue_work, event=%p\n", event->cm_node, event); @@ -3590,7 +3588,6 @@ static void nes_cm_event_handler(struct work_struct *work) } atomic_dec(&cm_core->events_posted); - event->cm_info.cm_id->rem_ref(event->cm_info.cm_id); rem_ref_cm_node(cm_core, event->cm_node); kfree(event); diff --git a/drivers/infiniband/hw/nes/nes_hw.c b/drivers/infiniband/hw/nes/nes_hw.c index d6fc9ae..9c9e4ff 100644 --- a/drivers/infiniband/hw/nes/nes_hw.c +++ b/drivers/infiniband/hw/nes/nes_hw.c @@ -2944,7 +2944,6 @@ static void nes_process_iwarp_aeqe(struct nes_device *nesdev, case NES_AEQE_AEID_LLP_FIN_RECEIVED: nesqp = *((struct nes_qp **)&context); if (atomic_inc_return(&nesqp->close_timer_started) == 1) { - nesqp->cm_id->add_ref(nesqp->cm_id); schedule_nes_timer(nesqp->cm_node, (struct sk_buff *)nesqp, NES_TIMER_TYPE_CLOSE, 1, 0); nes_debug(NES_DBG_AEQ, "QP%u Not decrementing QP refcount (%d)," diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index dab7e2f..ad9c1f5 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -1542,7 +1542,6 @@ static int nes_destroy_qp(struct ib_qp *ibqp) "QP%u. cm_id = %p, refcount = %u. \n", nesqp->hwqp.qp_id, cm_id, atomic_read(&nesqp->refcount)); - cm_id->rem_ref(cm_id); ret = cm_id->event_handler(cm_id, &cm_event); if (ret) nes_debug(NES_DBG_QP, "OFA CM event_handler returned, ret=%d\n", ret); @@ -3178,7 +3177,6 @@ int nes_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, if (nesqp->cm_id) { /* These two are for the timer thread */ if (atomic_inc_return(&nesqp->close_timer_started) == 1) { - nesqp->cm_id->add_ref(nesqp->cm_id); nes_debug(NES_DBG_MOD_QP, "QP%u Not decrementing QP refcount (%d)," " need ae to finish up, original_last_aeq = 0x%04X." " last_aeq = 0x%04X, scheduling timer.\n", -- 1.5.3.3 From YJia at tmriusa.com Tue Apr 14 16:49:44 2009 From: YJia at tmriusa.com (Yicheng Jia) Date: Tue, 14 Apr 2009 18:49:44 -0500 Subject: [ofa-general] change port's physical state without reset HCA In-Reply-To: <49D641B2.5010104@morey-chaisemartin.com> Message-ID: Hi Nicolas, I tried with "ibportstate reset" but I got the following error: ibwarn: [19660] mad_rpc: _do_madrpc failed; dport (Lid 7) ibportstate: iberror: failed: smp set portinfo failed Have you ever run it successfully on Qlogic 9024 switch? Thanks! Yicheng Jia Nicolas Morey-Chaisemartin 04/03/2009 12:05 PM Please respond to devel at morey-chaisemartin.com To Yicheng Jia cc general at lists.openfabrics.org Subject Re: [ofa-general] change port's physical state without reset HCA No you just need to do one or the other. It's not necessaray to do both. Nicolas Yicheng Jia a écrit : > > Hi Nicolas, > > Do I need to restart HCA driver if I just use ibportstate to reset the > cable on the switch side? > > Thanks! > > Yicheng Jia > > > > > *Nicolas Morey-Chaisemartin * > > 04/03/2009 11:47 AM > Please respond to > devel at morey-chaisemartin.com > > _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From acceptany at gmail.com Tue Apr 14 19:39:19 2009 From: acceptany at gmail.com (Jordan) Date: Wed, 15 Apr 2009 10:39:19 +0800 Subject: [ofa-general] ***SPAM*** some problem about the forward tables in up*/down* algorithm Message-ID: <91fe68d50904141939s5da57c7bie6cdef8b929f6e21@mail.gmail.com> Recently I have read the source code of the up*/down* routing algorithm. It seems that this algorithm only updates the hops[lid_no][port], does not update the lft (linear forward tables). So , how does the switch forward the packet ? Does the switch look up the hops[lid_no][port] to forward the packet? Another problem is that there are two arrays , lft and new_lft. I don't know the difference between these two tables, can anyone tell me ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From vlad at lists.openfabrics.org Wed Apr 15 03:24:56 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 15 Apr 2009 03:24:56 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090415-0200 daily build status Message-ID: <20090415102457.15609E613E8@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From rkrishnakumar at gmail.com Wed Apr 15 05:56:50 2009 From: rkrishnakumar at gmail.com (Krishnakumar R) Date: Wed, 15 Apr 2009 18:26:50 +0530 Subject: [ofa-general] ***SPAM*** Regarding QOS from the HCA side in OFED Message-ID: <89c400ad0904150556n2854b2e3w30344a825edda67e@mail.gmail.com> Hi Folks, Is there possibility of marking data sent out from an HCA to fall into certain service level, ie. does OFED currently support features to the extend that one can setup rules on the host side to ensure that traffic from a certain application will be considered to fall into certain SL (given that global policy part is setup with the opensm features already available) ? -- Thanks and Regards, KK. From chocapiiic.tiery at gmail.com Wed Apr 15 05:58:42 2009 From: chocapiiic.tiery at gmail.com (Thierry) Date: Wed, 15 Apr 2009 14:58:42 +0200 Subject: [ofa-general] RDMA over infiniband, diffrences between rdam_cm and libmthca-rdmav2 Message-ID: <8d9c773c0904150558n64169bffl8db8445f66c029dc@mail.gmail.com> Hi, I am new in infiniband, and I am doing some research on rdma. I have found two diffrents way of sending data on infiniband protucts using rdma. The first one use rdam_cm module (from kernel source code), and second one use libmthca-rdmav2/libibverbs. i have tried to monitor each program using strace and systemtap (on kernel 2.6.18, centos52), but both use diffrerents libraries. If someone can explain me the diffrences between this two types of programming. Thierry -------------------------------- using libibverbs: ibv_open_device() ibv_alloc_pd() ibv_reg_mr() ibv_create_cq() ibv_create_qp() ibv_modify_qp() etc.. --------------------------------- using rdma_cm module: /* * build: * cc -o server server.c -lrdmacm * * usage: * server * * waits for client to connect, receives two integers, and sends their * sum back to the client. */ #include #include #include #include #include enum { RESOLVE_TIMEOUT_MS = 5000, }; struct pdata { uint64_t buf_va; uint32_t buf_rkey; }; int main(int argc, char *argv[]) { struct pdata rep_pdata; struct rdma_event_channel *cm_channel; struct rdma_cm_id *listen_id; struct rdma_cm_id *cm_id; struct rdma_cm_event *event; struct rdma_conn_param conn_param = { }; struct ibv_pd *pd; struct ibv_comp_channel *comp_chan; struct ibv_cq *cq; struct ibv_cq *evt_cq; struct ibv_mr *mr; struct ibv_qp_init_attr qp_attr = { }; struct ibv_sge sge; struct ibv_send_wr send_wr = { }; struct ibv_send_wr *bad_send_wr; struct ibv_recv_wr recv_wr = { }; struct ibv_recv_wr *bad_recv_wr; struct ibv_wc wc; void *cq_context; struct sockaddr_in sin; uint32_t *buf; int err; /* Set up RDMA CM structures */ cm_channel = rdma_create_event_channel(); if (!cm_channel) return 1; err = rdma_create_id(cm_channel, &listen_id, NULL, RDMA_PS_TCP); if (err) return err; sin.sin_family = AF_INET; sin.sin_port = htons(20079); sin.sin_addr.s_addr = INADDR_ANY; /* Bind to local port and listen for connection request */ err = rdma_bind_addr(listen_id, (struct sockaddr *) &sin); if (err) return 1; err = rdma_listen(listen_id, 1); if (err) return 1; err = rdma_get_cm_event(cm_channel, &event); if (err) return err; if (event->event != RDMA_CM_EVENT_CONNECT_REQUEST) return 1; cm_id = event->id; rdma_ack_cm_event(event); /* Create verbs objects now that we know which device to use */ pd = ibv_alloc_pd(cm_id->verbs); if (!pd) return 1; comp_chan = ibv_create_comp_channel(cm_id->verbs); if (!comp_chan) return 1; cq = ibv_create_cq(cm_id->verbs, 2, NULL, comp_chan, 0); if (!cq) return 1; if (ibv_req_notify_cq(cq, 0)) return 1; buf = calloc(2, sizeof (uint32_t)); if (!buf) return 1; mr = ibv_reg_mr(pd, buf, 2 * sizeof (uint32_t), IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_WRITE); if (!mr) return 1; qp_attr.cap.max_send_wr = 1; qp_attr.cap.max_send_sge = 1; qp_attr.cap.max_recv_wr = 1; qp_attr.cap.max_recv_sge = 1; qp_attr.send_cq = cq; qp_attr.recv_cq = cq; qp_attr.qp_type = IBV_QPT_RC; err = rdma_create_qp(cm_id, pd, &qp_attr); if (err) return err; /* Post receive before accepting connection */ sge.addr = (uintptr_t) buf + sizeof (uint32_t); sge.length = sizeof (uint32_t); sge.lkey = mr->lkey; recv_wr.sg_list = &sge; recv_wr.num_sge = 1; if (ibv_post_recv(cm_id->qp, &recv_wr, &bad_recv_wr)) return 1; rep_pdata.buf_va = htonll((uintptr_t) buf); rep_pdata.buf_rkey = htonl(mr->rkey); conn_param.responder_resources = 1; conn_param.private_data = &rep_pdata; conn_param.private_data_len = sizeof rep_pdata; /* Accept connection */ err = rdma_accept(cm_id, &conn_param); if (err) return 1; err = rdma_get_cm_event(cm_channel, &event); if (err) return err; if (event->event != RDMA_CM_EVENT_ESTABLISHED) return 1; rdma_ack_cm_event(event); /* Wait for receive completion */ if (ibv_get_cq_event(comp_chan, &evt_cq, &cq_context)) return 1; if (ibv_req_notify_cq(cq, 0)) return 1; if (ibv_poll_cq(cq, 1, &wc) < 1) return 1; if (wc.status != IBV_WC_SUCCESS) return 1; /* Add two integers and send reply back */ buf[0] = htonl(ntohl(buf[0]) + ntohl(buf[1])); sge.addr = (uintptr_t) buf; sge.length = sizeof (uint32_t); sge.lkey = mr->lkey; send_wr.opcode = IBV_WR_SEND; send_wr.send_flags = IBV_SEND_SIGNALED; send_wr.sg_list = &sge; send_wr.num_sge = 1; if (ibv_post_send(cm_id->qp, &send_wr, &bad_send_wr)) return 1; /* Wait for send completion */ if (ibv_get_cq_event(comp_chan, &evt_cq, &cq_context)) return 1; if (ibv_poll_cq(cq, 1, &wc) < 1) return 1; if (wc.status != IBV_WC_SUCCESS) return 1; ibv_ack_cq_events(cq, 2); return 0; } -------------------------------------- From sashak at voltaire.com Wed Apr 15 05:59:25 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 15 Apr 2009 15:59:25 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCHv2] opensm/PerfMgr: Better redirection support In-Reply-To: <20090312202134.GC25024@comcast.net> References: <20090312202134.GC25024@comcast.net> Message-ID: <20090415125925.GF7353@sk> Hi Hal, On 15:21 Thu 12 Mar , Hal Rosenstock wrote: > > Handle PKey and QPN redirection information > GID redirection handling remains > > Signed-off-by: Hal Rosenstock > > --- > Changes since v1: > Added include of osm_helper.h to osm_perfmgr.c > > Notes: osm_console redirection patch is unchanged > Also, relies on ib_gid_is_notzero patch > > diff --git a/opensm/include/opensm/osm_perfmgr.h b/opensm/include/opensm/osm_perfmgr.h > index 45dec54..651ea80 100644 > --- a/opensm/include/opensm/osm_perfmgr.h > +++ b/opensm/include/opensm/osm_perfmgr.h > @@ -92,8 +92,14 @@ typedef enum { > > /* Redirection information */ > typedef struct redir { > + boolean_t redirection; > + boolean_t invalid; Why using lid value != 0 is/was bad for redirection invalidation? > + ib_gid_t redir_gid; > ib_net16_t redir_lid; > + ib_net16_t redir_pkey; > ib_net32_t redir_qp; > + uint16_t redir_pkey_ix; Don't need to repeat structure name (redir) in field names - it just makes lines longer - 'redir->pkey_idx' looks pretty clear. > + ib_net16_t orig_lid; > } redir_t; > > /* Node to store information about which nodes we are monitoring */ > @@ -134,6 +140,7 @@ typedef struct osm_perfmgr { > uint32_t max_outstanding_queries; > cl_qmap_t monitored_map; /* map the nodes we are tracking */ > __monitored_node_t *remove_list; > + int16_t local_port; > } osm_perfmgr_t; > /* > * FIELDS > diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c > index 4a6f65c..fc1b7cb 100644 > --- a/opensm/opensm/osm_perfmgr.c > +++ b/opensm/opensm/osm_perfmgr.c > @@ -47,7 +47,6 @@ > #endif /* HAVE_CONFIG_H */ > > #ifdef ENABLE_OSM_PERF_MGR > - > #include > #include > #include > @@ -65,8 +64,11 @@ > #include > #include > #include > +#include > > #define OSM_PERFMGR_INITIAL_TID_VALUE 0xcafe > +#define MAX_LOCAL_IBPORTS 64 > +#define MAX_LOCAL_PKEYS 256 > > #if ENABLE_OSM_PERF_MGR_PROFILE > struct { > @@ -118,8 +120,6 @@ static inline void diff_time(struct timeval *before, > } > #endif > > -extern int wait_for_pending_transactions(osm_stats_t * stats); > - > /********************************************************************** > * Internal helper functions. > **********************************************************************/ > @@ -203,6 +203,7 @@ osm_perfmgr_mad_send_err_callback(void *bind_context, osm_madw_t * p_madw) > uint8_t port = context->perfmgr_context.port; > cl_map_item_t *p_node; > __monitored_node_t *p_mon_node; > + ib_net16_t orig_lid; > > OSM_LOG_ENTER(pm->log); > > @@ -233,9 +234,10 @@ osm_perfmgr_mad_send_err_callback(void *bind_context, osm_madw_t * p_madw) > p_mon_node->guid, p_mon_node->redir_tbl_size); > goto Exit; > } > - /* Clear redirection info */ > - p_mon_node->redir_port[port].redir_lid = 0; > - p_mon_node->redir_port[port].redir_qp = 0; > + /* Clear redirection info for this port except orig_lid */ > + orig_lid = p_mon_node->redir_port[port].orig_lid; > + memset(&p_mon_node->redir_port[port], 0, sizeof(redir_t)); > + p_mon_node->redir_port[port].orig_lid = orig_lid; Hmm, why should 'orig_lid' be part of redirection structure and not placed on original node/port (below I see that it is used in non-redirected paths)? I think it would be better to use structures like: struct node { .... uint16_t lid; uint16_t pkey_ix; unsigned num_ports; struct port { .... struct redir_info { /* or even same 'struct port' */ ... } *ri; } ports[0]; } > cl_plock_release(pm->lock); > } > > @@ -292,6 +294,40 @@ osm_perfmgr_bind(osm_perfmgr_t * const pm, const ib_net64_t port_guid) > goto Exit; > } > > + /* if redirection enabled, determine local port from port GUID */ > + if (pm->subn->opt.perfmgr_redir) { > + ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS]; > + uint32_t num_ports = MAX_LOCAL_IBPORTS; > + int i; > + > + for (i = 0; i < num_ports; i++) { > + attr_array[i].num_pkeys = 0; > + attr_array[i].p_pkey_table = NULL; > + } > + > + /* call transport layer for a list of local port GUIDs */ > + status = osm_vendor_get_all_port_attr(pm->subn->p_osm->p_vendor, > + attr_array, &num_ports); > + if (status != IB_SUCCESS) { > + OSM_LOG(pm->log, OSM_LOG_ERROR, > + "ERR 4C1C: osm_vendor_get_all_port_attr status 0x%x\n", > + status); > + goto Exit; > + } > + if (num_ports == 0) { > + OSM_LOG(pm->log, OSM_LOG_ERROR, > + "ERR 4C1D: No local ports detected!\n"); > + goto Exit; > + } > + > + for (i = 0; i < num_ports; i++) { > + if (port_guid == attr_array[i].port_guid) { > + pm->local_port = attr_array[i].port_num; > + break; > + } > + } > + } > + PerfMgr is always running over discovered fabric so maybe local port number should be detected later at start of PerfMgr process cycle just using OpenSM DB. Also see comment below about pkey validation per redirection request. > Exit: > OSM_LOG_EXIT(pm->log); > return (status); > @@ -321,25 +357,16 @@ static ib_net32_t get_qp(__monitored_node_t * mon_node, uint8_t port) > > if (mon_node && mon_node->redir_tbl_size && > port < mon_node->redir_tbl_size && > - mon_node->redir_port[port].redir_lid && > + mon_node->redir_port[port].redirection && > mon_node->redir_port[port].redir_qp) > qp = mon_node->redir_port[port].redir_qp; > > return qp; > } > > -/********************************************************************** > - * Given a node, a port, and an optional monitored node, > - * return the appropriate lid to query that port > - **********************************************************************/ > static ib_net16_t > -get_lid(osm_node_t * p_node, uint8_t port, __monitored_node_t * mon_node) > +get_base_lid(osm_node_t * p_node, uint8_t port) > { > - if (mon_node && mon_node->redir_tbl_size && > - port < mon_node->redir_tbl_size && > - mon_node->redir_port[port].redir_lid) > - return mon_node->redir_port[port].redir_lid; > - > switch (p_node->node_info.node_type) { > case IB_NODE_TYPE_CA: > case IB_NODE_TYPE_ROUTER: > @@ -352,12 +379,27 @@ get_lid(osm_node_t * p_node, uint8_t port, __monitored_node_t * mon_node) > } > > /********************************************************************** > + * Given a node, a port, and an optional monitored node, > + * return the lid appropriate to query that port > + **********************************************************************/ > +static ib_net16_t > +get_lid(osm_node_t * p_node, uint8_t port, __monitored_node_t * mon_node) > +{ > + if (mon_node && mon_node->redir_tbl_size && > + port < mon_node->redir_tbl_size && > + mon_node->redir_port[port].redir_lid) > + return mon_node->redir_port[port].redir_lid; > + > + return get_base_lid(p_node, port); > +} > + > +/********************************************************************** > * Form and send the Port Counters MAD for a single port. > **********************************************************************/ > static ib_api_status_t > osm_perfmgr_send_pc_mad(osm_perfmgr_t * perfmgr, ib_net16_t dest_lid, > - ib_net32_t dest_qp, uint8_t port, uint8_t mad_method, > - osm_madw_context_t * const p_context) > + ib_net32_t dest_qp, uint16_t pkey_ix, uint8_t port, > + uint8_t mad_method, osm_madw_context_t * const p_context) > { > ib_api_status_t status = IB_SUCCESS; > ib_port_counters_t *port_counter = NULL; > @@ -396,8 +438,7 @@ osm_perfmgr_send_pc_mad(osm_perfmgr_t * perfmgr, ib_net16_t dest_lid, > p_madw->mad_addr.addr_type.gsi.remote_qp = dest_qp; > p_madw->mad_addr.addr_type.gsi.remote_qkey = > cl_hton32(IB_QP1_WELL_KNOWN_Q_KEY); > - /* FIXME what about other partitions */ > - p_madw->mad_addr.addr_type.gsi.pkey_ix = 0; > + p_madw->mad_addr.addr_type.gsi.pkey_ix = pkey_ix; > p_madw->mad_addr.addr_type.gsi.service_level = 0; > p_madw->mad_addr.addr_type.gsi.global_route = FALSE; > p_madw->resp_expected = TRUE; > @@ -432,28 +473,32 @@ static void __collect_guids(cl_map_item_t * const p_map_item, void *context) > uint64_t node_guid = cl_ntoh64(node->node_info.node_guid); > osm_perfmgr_t *pm = (osm_perfmgr_t *) context; > __monitored_node_t *mon_node = NULL; > - uint32_t size; > + uint32_t num_ports; > + int port; > > OSM_LOG_ENTER(pm->log); > > if (cl_qmap_get(&pm->monitored_map, node_guid) > == cl_qmap_end(&pm->monitored_map)) { > /* if not already in our map add it */ > - size = osm_node_get_num_physp(node); > - mon_node = malloc(sizeof(*mon_node) + sizeof(redir_t) * size); > + num_ports = osm_node_get_num_physp(node); > + mon_node = malloc(sizeof(*mon_node) + sizeof(redir_t) * num_ports); > if (!mon_node) { > OSM_LOG(pm->log, OSM_LOG_ERROR, "PerfMgr: ERR 4C06: " > "malloc failed: not handling node %s" > "(GUID 0x%" PRIx64 ")\n", node->print_desc, node_guid); > goto Exit; > } > - memset(mon_node, 0, sizeof(*mon_node) + sizeof(redir_t) * size); > + memset(mon_node, 0, sizeof(*mon_node) + sizeof(redir_t) * num_ports); > mon_node->guid = node_guid; > mon_node->name = strdup(node->print_desc); > - mon_node->redir_tbl_size = size; > + mon_node->redir_tbl_size = num_ports; > /* check for enhanced switch port 0 */ > mon_node->esp0 = (node->sw && > ib_switch_info_is_enhanced_port0(&node->sw->switch_info)); > + for (port = (mon_node->esp0) ? 0 : 1; port < num_ports; port++) > + mon_node->redir_port[port].orig_lid = get_base_lid(node, port); > + > cl_qmap_insert(&(pm->monitored_map), node_guid, > (cl_map_item_t *) mon_node); > } > @@ -511,6 +556,10 @@ __osm_perfmgr_query_counters(cl_map_item_t * const p_map_item, void *context) > if (!osm_node_get_physp_ptr(node, port)) > continue; > > + if (mon_node->redir_port[port].redirection && > + mon_node->redir_port[port].invalid) > + continue; > + Are two flags really needed? Couldn't this be stripped down? Also what about letting "chance" for port to refresh redirection info? > lid = get_lid(node, port, mon_node); > if (lid == 0) { > OSM_LOG(pm->log, OSM_LOG_DEBUG, "WARN: node 0x%" PRIx64 > @@ -532,8 +581,10 @@ __osm_perfmgr_query_counters(cl_map_item_t * const p_map_item, void *context) > PRIx64 " port %d (lid %u) (%s)\n", node_guid, port, > cl_ntoh16(lid), node->print_desc); > status = > - osm_perfmgr_send_pc_mad(pm, lid, remote_qp, port, > - IB_MAD_METHOD_GET, &mad_context); > + osm_perfmgr_send_pc_mad(pm, lid, remote_qp, > + mon_node->redir_port[port].redir_pkey_ix, > + port, IB_MAD_METHOD_GET, > + &mad_context); > if (status != IB_SUCCESS) > OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C09: " > "Failed to issue port counter query for node 0x%" > @@ -550,6 +601,7 @@ Exit: > * Discovery stuff. > * Basically this code should not be here, but merged with main OpenSM > **********************************************************************/ > +extern int wait_for_pending_transactions(osm_stats_t * stats); > extern void osm_drop_mgr_process(IN osm_sm_t *sm); > > static int sweep_hop_1(osm_sm_t * sm) > @@ -980,6 +1032,10 @@ osm_perfmgr_check_overflow(osm_perfmgr_t * pm, __monitored_node_t *mon_node, > osm_node_t *p_node = NULL; > ib_net16_t lid = 0; > > + if (mon_node->redir_port[port].redirection && > + mon_node->redir_port[port].invalid) > + goto Exit; > + > osm_log(pm->log, OSM_LOG_VERBOSE, > "PerfMgr: Counter overflow: %s (0x%" PRIx64 > ") port %d; clearing counters\n", > @@ -1004,8 +1060,10 @@ osm_perfmgr_check_overflow(osm_perfmgr_t * pm, __monitored_node_t *mon_node, > mad_context.perfmgr_context.mad_method = IB_MAD_METHOD_SET; > /* clear port counters */ > status = > - osm_perfmgr_send_pc_mad(pm, lid, remote_qp, port, > - IB_MAD_METHOD_SET, &mad_context); > + osm_perfmgr_send_pc_mad(pm, lid, remote_qp, > + mon_node->redir_port[port].redir_pkey_ix, > + port, IB_MAD_METHOD_SET, > + &mad_context); > if (status != IB_SUCCESS) > OSM_LOG(pm->log, OSM_LOG_ERROR, "PerfMgr: ERR 4C11: " > "Failed to send clear counters MAD for %s (0x%" > @@ -1063,6 +1121,73 @@ osm_perfmgr_log_events(osm_perfmgr_t * pm, __monitored_node_t *mon_node, uint8_t > time_diff, mon_node->name, mon_node->guid, port); > } > > +static boolean_t validate_redir_pkey(osm_perfmgr_t *pm, ib_net16_t pkey, > + uint16_t *pkey_ix) > +{ This function can just return pkey index value (or negative in case of failure). > + ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS]; > + uint32_t num_ports = MAX_LOCAL_IBPORTS; > + ib_api_status_t status; > + boolean_t sts = FALSE; > + uint16_t i = 0; > + > + OSM_LOG_ENTER(pm->log); > + > + for (i = 0; i < num_ports; i++) { > + attr_array[i].num_pkeys = 0; > + attr_array[i].p_pkey_table = NULL; > + } > + > + /* If local port couldn't be determined previously */ > + if (pm->local_port == -1) > + goto not_found; > + > + attr_array[pm->local_port].num_pkeys = MAX_LOCAL_PKEYS; > + attr_array[pm->local_port].p_pkey_table = > + malloc(MAX_LOCAL_PKEYS * sizeof(ib_net16_t)); > + if (!attr_array[pm->local_port].p_pkey_table) { > + OSM_LOG(pm->log, OSM_LOG_ERROR, > + "ERR 4C20: No memory for port %d pkey table\n", > + pm->local_port); > + goto not_found; > + } > + > + /* call the transport layer for a list of local port pkeys */ > + status = osm_vendor_get_all_port_attr(pm->subn->p_osm->p_vendor, > + attr_array, &num_ports); This heavy stuff is performed per redirection request, but it actually uses same data (local port's pkey table). This looks very inefficient. Instead of doing this you can get local port's pkey table only once at PerfMgr process cycle start and do all checks against this already initialized table. Of course only in case when redirection is enabled at all. > + if (status != IB_SUCCESS) { > + OSM_LOG(pm->log, OSM_LOG_ERROR, > + "ERR 4C1E: osm_vendor_get_all_port_attr status 0x%x\n", > + status); > + goto not_found; > + } > + if (num_ports == 0 || pm->local_port > num_ports) { > + OSM_LOG(pm->log, OSM_LOG_ERROR, > + "ERR 4C1F: No local ports detected or local port out of range!\n"); > + goto not_found; > + } > + ib_net16_t *pkey_table = attr_array[pm->local_port].p_pkey_table; > + for (i = 0; i < attr_array[pm->local_port].num_pkeys; i++) > + if (pkey_table[i] == pkey) > + break; > + if (i == attr_array[pm->local_port].num_pkeys) { > + i = 0; > + goto not_found; > + } > + free(attr_array[pm->local_port].p_pkey_table); > + sts = TRUE; > + goto Exit; > + > +not_found: > + if (attr_array[pm->local_port].p_pkey_table) > + free(attr_array[pm->local_port].p_pkey_table); > + sts = FALSE; > +Exit: > + if (pkey_ix) > + *pkey_ix = i; > + OSM_LOG_EXIT(pm->log); > + return sts; > +} > + > /********************************************************************** > * The dispatcher uses a thread pool which will call this function when > * we have a thread available to process our mad received from the wire. > @@ -1082,6 +1207,8 @@ static void osm_pc_rcv_process(void *context, void *data) > perfmgr_db_data_cnt_reading_t data_reading; > cl_map_item_t *p_node; > __monitored_node_t *p_mon_node; > + uint16_t pkey_ix; > + boolean_t invalid = FALSE; > > OSM_LOG_ENTER(pm->log); > > @@ -1105,7 +1232,8 @@ static void osm_pc_rcv_process(void *context, void *data) > p_mad->attr_id == IB_MAD_ATTR_CLASS_PORT_INFO); > > /* Response could also be redirection (IBM eHCA PMA does this) */ > - if (p_mad->attr_id == IB_MAD_ATTR_CLASS_PORT_INFO) { > + if (p_mad->status & IB_MAD_STATUS_REDIRECT && > + p_mad->attr_id == IB_MAD_ATTR_CLASS_PORT_INFO) { > char gid_str[INET6_ADDRSTRLEN]; > ib_class_port_info_t *cpi = > (ib_class_port_info_t *) & > @@ -1119,17 +1247,48 @@ static void osm_pc_rcv_process(void *context, void *data) > sizeof gid_str), > cl_ntoh32(cpi->redir_qp)); > > - /* LID or GID redirection ? */ > - /* For GID redirection, need to get PathRecord from SA */ > + /* valid redirection ? */ > if (cpi->redir_lid == 0) { > - OSM_LOG(pm->log, OSM_LOG_VERBOSE, > - "GID redirection not currently implemented!\n"); > - goto Exit; > + if (!ib_gid_is_notzero(&cpi->redir_gid)) { > + OSM_LOG(pm->log, OSM_LOG_ERROR, > + "ERR 4C17: Invalid redirection " > + "(both redirect LID and GID are zero)\n"); > + invalid = TRUE; > + } > + } > + if (cpi->redir_qp == 0) { > + OSM_LOG(pm->log, OSM_LOG_ERROR, > + "ERR 4C18: Invalid RedirectQP\n"); > + invalid = TRUE; > + } > + if (cpi->redir_pkey == 0) { > + OSM_LOG(pm->log, OSM_LOG_ERROR, > + "ERR 4C19: Invalid RedirectP_Key\n"); > + invalid = TRUE; > + } > + if (cpi->redir_qkey != IB_QP1_WELL_KNOWN_Q_KEY) { > + OSM_LOG(pm->log, OSM_LOG_ERROR, > + "ERR 4C1A: Invalid RedirectQ_Key\n"); > + invalid = TRUE; > + } > + > + if (!validate_redir_pkey(pm, cpi->redir_pkey, &pkey_ix)) { > + OSM_LOG(pm->log, OSM_LOG_ERROR, > + "ERR 4C1B: Index for Pkey 0x%x not found\n", > + cl_ntoh16(cpi->redir_pkey)); > + invalid = TRUE; > } All above are not OpenSM errors, but wrong external data. I think it should be logged as VERBOSE messages. > > if (!pm->subn->opt.perfmgr_redir) { > - OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C16: " > - "redirection requested but disabled\n"); > + OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C16: " > + "redirection requested but disabled\n"); > + invalid = TRUE; > + } This is not an error. BTW, why to bother with verifying redirection info when redirection support is disabled anyway? > + > + if (cpi->redir_lid == 0) { > + /* GID redirection: get PathRecord information */ > + OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C21: " > + "GID redirection not currently supported\n"); > goto Exit; > } > > @@ -1144,14 +1303,23 @@ static void osm_pc_rcv_process(void *context, void *data) > p_mon_node->redir_tbl_size); > goto Exit; > } > + p_mon_node->redir_port[port].redirection = TRUE; > + p_mon_node->redir_port[port].invalid = invalid; > + memcpy(&p_mon_node->redir_port[port].redir_gid, > + &cpi->redir_gid, sizeof(ib_gid_t)); > p_mon_node->redir_port[port].redir_lid = cpi->redir_lid; > p_mon_node->redir_port[port].redir_qp = cpi->redir_qp; > + p_mon_node->redir_port[port].redir_pkey = cpi->redir_pkey; > + p_mon_node->redir_port[port].redir_pkey_ix = pkey_ix; > cl_plock_release(pm->lock); > > + if (invalid) > + goto Exit; > + > /* Finally, reissue the query to the redirected location */ > status = > osm_perfmgr_send_pc_mad(pm, cpi->redir_lid, cpi->redir_qp, > - port, > + pkey_ix, port, > mad_context->perfmgr_context. > mad_method, mad_context); > if (status != IB_SUCCESS) > @@ -1234,6 +1402,7 @@ osm_perfmgr_init(osm_perfmgr_t * const pm, osm_opensm_t *osm, > pm->sweep_time_s = p_opt->perfmgr_sweep_time_s; > pm->max_outstanding_queries = p_opt->perfmgr_max_outstanding_queries; > pm->osm = osm; > + pm->local_port = -1; > > status = cl_timer_init(&pm->sweep_timer, perfmgr_sweep, pm); > if (status != IB_SUCCESS) In general I would suggest to not mix redirection case with main flow (by using better data structures). The Redirection is not something PerfMgr specific and ideally we could have separate redirection handling module. I'm not requesting to do this now (in this implementation), but at least flow separation is very desirable. Sasha From hnrose at comcast.net Wed Apr 15 06:05:18 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 15 Apr 2009 09:05:18 -0400 Subject: [ofa-general] ***SPAM*** [PATCH] infiniband-diags/ibsendtrap.c: Local link integrity is an "urgent" trap Message-ID: <20090415130518.GA23981@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c index 51f2327..d0afca0 100644 --- a/infiniband-diags/src/ibsendtrap.c +++ b/infiniband-diags/src/ibsendtrap.c @@ -67,7 +67,7 @@ static void build_trap144(ib_mad_notice_attr_t * n, uint16_t lid) static void build_trap129(ib_mad_notice_attr_t * n, uint16_t lid) { - n->generic_type = 0x80 | IB_NOTICE_TYPE_INFO; + n->generic_type = 0x80 | IB_NOTICE_TYPE_URGENT; n->g_or_v.generic.prod_type_lsb = cl_hton16(IB_NODE_TYPE_CA); n->g_or_v.generic.trap_num = cl_hton16(129); n->issuer_lid = cl_hton16(lid); From hnrose at comcast.net Wed Apr 15 06:06:08 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 15 Apr 2009 09:06:08 -0400 Subject: [ofa-general] ***SPAM*** [PATCH][TRIVIAL] opensm/osm_sa.c: Cosmetic change to a few log messages Message-ID: <20090415130608.GB23981@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_sa.c b/opensm/opensm/osm_sa.c index 50c4d45..3521132 100644 --- a/opensm/opensm/osm_sa.c +++ b/opensm/opensm/osm_sa.c @@ -709,7 +709,7 @@ static void sa_dump_all_sa(osm_opensm_t * p_osm, FILE * file) dump_context.p_osm = p_osm; dump_context.file = file; - OSM_LOG(&p_osm->log, OSM_LOG_DEBUG, "Dump multicast:\n"); + OSM_LOG(&p_osm->log, OSM_LOG_DEBUG, "Dump multicast\n"); cl_plock_acquire(&p_osm->lock); for (i = 0; i <= p_osm->subn.max_mcast_lid_ho - IB_LID_MCAST_START_HO; i++) { @@ -717,10 +717,10 @@ static void sa_dump_all_sa(osm_opensm_t * p_osm, FILE * file) if (p_mgrp) sa_dump_one_mgrp(p_mgrp, &dump_context); } - OSM_LOG(&p_osm->log, OSM_LOG_DEBUG, "Dump inform:\n"); + OSM_LOG(&p_osm->log, OSM_LOG_DEBUG, "Dump inform\n"); cl_qlist_apply_func(&p_osm->subn.sa_infr_list, sa_dump_one_inform, &dump_context); - OSM_LOG(&p_osm->log, OSM_LOG_DEBUG, "Dump services:\n"); + OSM_LOG(&p_osm->log, OSM_LOG_DEBUG, "Dump services\n"); cl_qlist_apply_func(&p_osm->subn.sa_sr_list, sa_dump_one_service, &dump_context); cl_plock_release(&p_osm->lock); From sashak at voltaire.com Wed Apr 15 06:04:54 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 15 Apr 2009 16:04:54 +0300 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** some problem about the forward tables in up*/down* algorithm In-Reply-To: <91fe68d50904141939s5da57c7bie6cdef8b929f6e21@mail.gmail.com> References: <91fe68d50904141939s5da57c7bie6cdef8b929f6e21@mail.gmail.com> Message-ID: <20090415130454.GG7353@sk> On 10:39 Wed 15 Apr , Jordan wrote: > Recently I have read the source code of the up*/down* routing algorithm. It > seems that this algorithm only updates the > hops[lid_no][port], does not update the lft (linear forward tables). So , > how does the switch forward the packet ? The routing engine in OpenSM has two methods - build_lid_matrices() and ucast_build_fwd_tables(). build_lid_matrices() generates min hops tables (lid matrices), ucast_build_fwd_tables() creates LFTs. You looked only at build_lid_matrices() implementation. > Does the switch look up the > hops[lid_no][port] to forward the packet? No. > Another problem is that there are two arrays , lft and new_lft. I don't know > the difference between these two tables, can anyone tell me ? new_lft is how how LFT is generated by OpenSM, lft keeps a real LFT state how it was received in LFT block set responses. Sasha From tmtalpey at gmail.com Wed Apr 15 06:15:19 2009 From: tmtalpey at gmail.com (Tom Talpey) Date: Wed, 15 Apr 2009 09:15:19 -0400 Subject: [ofa-general] ***SPAM*** SPAM eggs SPAM bacon and SPAM Message-ID: <49e5ddfa.0136640a.2bf4.0fc7@mx.google.com> The openfabrics mail server is flagging every message from domains such as gmail.com and yahoo.com with ***SPAM***, and as a result every message on my screen this morning was advertising the lovely canned meat. Going back a few days and 100 messages, in fact, only 30 of them *weren't* decorated with SPAM. This has to stop. Tom. From hal.rosenstock at gmail.com Wed Apr 15 06:17:19 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 15 Apr 2009 09:17:19 -0400 Subject: [ofa-general] ***SPAM*** Regarding QOS from the HCA side in OFED In-Reply-To: <89c400ad0904150556n2854b2e3w30344a825edda67e@mail.gmail.com> References: <89c400ad0904150556n2854b2e3w30344a825edda67e@mail.gmail.com> Message-ID: On Wed, Apr 15, 2009 at 8:56 AM, Krishnakumar R wrote: > Hi Folks, > > Is there possibility of marking data sent out from an HCA to fall into > certain service level, ie. does OFED currently > support features to the extend that one can setup rules on the host > side to ensure that traffic from a > certain application will be considered to fall into certain SL (given > that global policy part is setup with the opensm features already > available) ? If the ULP uses the SL returned in the PathRecord (or in the case of IPoIB it can use MCMemberRecord for some things), then this will be done. I think RDMA CM based apps get this for "free". IPoIB supports SL. I'm not sure about SRP and SDP. Is this standard ULP or custom based app ? -- Hal > -- > Thanks and Regards, > KK. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hal.rosenstock at gmail.com Wed Apr 15 06:19:06 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 15 Apr 2009 09:19:06 -0400 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** SPAM eggs SPAM bacon and SPAM In-Reply-To: <49e5ddfa.0136640a.2bf4.0fc7@mx.google.com> References: <49e5ddfa.0136640a.2bf4.0fc7@mx.google.com> Message-ID: On Wed, Apr 15, 2009 at 9:15 AM, Tom Talpey wrote: > The openfabrics mail server is flagging every message from domains > such as gmail.com and yahoo.com with ***SPAM***, Some other domains as well :-( > and as a result > every message on my screen this morning was advertising the lovely > canned meat. > > Going back a few days and 100 messages, in fact, only 30 of them > *weren't* decorated with SPAM. > > This has to stop. FWIW I'd also like to see this stop. -- Hal > > Tom. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Wed Apr 15 06:17:46 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 15 Apr 2009 16:17:46 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] infiniband-diags/ibsendtrap.c: Local link integrity is an "urgent" trap In-Reply-To: <20090415130518.GA23981@comcast.net> References: <20090415130518.GA23981@comcast.net> Message-ID: <20090415131746.GH7353@sk> On 09:05 Wed 15 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Makes sense. Applied. Thanks. Sasha From sashak at voltaire.com Wed Apr 15 06:19:40 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 15 Apr 2009 16:19:40 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH][TRIVIAL] opensm/osm_sa.c: Cosmetic change to a few log messages In-Reply-To: <20090415130608.GB23981@comcast.net> References: <20090415130608.GB23981@comcast.net> Message-ID: <20090415131940.GI7353@sk> On 09:06 Wed 15 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Wed Apr 15 06:40:33 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 15 Apr 2009 16:40:33 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm: Add Dell to known vendor list In-Reply-To: <20090414185319.GA4413@comcast.net> References: <20090414185319.GA4413@comcast.net> Message-ID: <20090415134033.GK7353@sk> On 14:53 Tue 14 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Wed Apr 15 07:12:21 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 15 Apr 2009 17:12:21 +0300 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCHv3] libibmad/rpc.c: Handle redirection status In-Reply-To: References: <20090311113059.GB12004@comcast.net> <20090414135600.GD5519@sk> <20090414142002.GF5519@sk> <20090414143254.GH5519@sk> Message-ID: <20090415141221.GM7353@sk> On 13:12 Tue 14 Apr , Hal Rosenstock wrote: > > Huh ? What we agreed on was there was no known use right now Yes, this is what I meant. > but that > doesn't mean a compliance test couldn't detect this. That's what > compliance means to me. In this patch is not "for free", but in cost of fast execution fast path slowing down. I prefer to keep it is now (and of course remember about hypothetical issue). And status string message would be better to generically. Such solution is cleaner, not much harder and should be beneficial and useful in general (and for redirection cases too). Sasha From hnrose at comcast.net Wed Apr 15 07:29:56 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 15 Apr 2009 10:29:56 -0400 Subject: [ofa-general] ***SPAM*** [PATCH] opensm: Improve some snprintf uses Message-ID: <20090415142956.GA28988@comcast.net> Use sizeof rather than hardcoded constant in case buffer size changes in future Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index 00264e5..d351261 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -1426,14 +1426,17 @@ int osm_console(osm_opensm_t * p_osm) if (inet_ntop (AF_INET, &sin.sin_addr, p_oct->client_ip, sizeof(p_oct->client_ip)) == NULL) { - snprintf(p_oct->client_ip, 64, "STRING_UNKNOWN"); + snprintf(p_oct->client_ip, sizeof(p_oct->client_ip), + "STRING_UNKNOWN"); } if ((hent = gethostbyaddr((const char *)&sin.sin_addr, sizeof(struct in_addr), AF_INET)) == NULL) { - snprintf(p_oct->client_hn, 128, "STRING_UNKNOWN"); + snprintf(p_oct->client_hn, sizeof(p_oct->client_hn), + "STRING_UNKNOWN"); } else { - snprintf(p_oct->client_hn, 128, "%s", hent->h_name); + snprintf(p_oct->client_hn, sizeof(p_oct->client_hn), + "%s", hent->h_name); } if (is_authorized(p_oct)) { cio_open(p_oct, new_fd, &p_osm->log); diff --git a/opensm/opensm/osm_event_plugin.c b/opensm/opensm/osm_event_plugin.c index b0dc549..c77494e 100644 --- a/opensm/opensm/osm_event_plugin.c +++ b/opensm/opensm/osm_event_plugin.c @@ -73,7 +73,7 @@ osm_epi_plugin_t *osm_epi_construct(osm_opensm_t *osm, char *plugin_name) return (NULL); /* find the plugin */ - snprintf(lib_name, OSM_PATH_MAX, "lib%s.so", plugin_name); + snprintf(lib_name, sizeof(lib_name), "lib%s.so", plugin_name); rc = malloc(sizeof(*rc)); if (!rc) diff --git a/opensm/opensm/osm_perfmgr_db.c b/opensm/opensm/osm_perfmgr_db.c index b0b2e4a..8be0b6f 100644 --- a/opensm/opensm/osm_perfmgr_db.c +++ b/opensm/opensm/osm_perfmgr_db.c @@ -120,7 +120,7 @@ static _db_node_t *__malloc_node(uint64_t guid, boolean_t esp0, rc->ports[i].err_previous.time = cur_time; rc->ports[i].dc_previous.time = cur_time; } - snprintf(rc->node_name, NODE_NAME_SIZE, "%s", name); + snprintf(rc->node_name, sizeof(rc->node_name), "%s", name); return (rc); From sashak at voltaire.com Wed Apr 15 08:30:03 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 15 Apr 2009 18:30:03 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] Update mad formatting functions. In-Reply-To: <20090311144404.bf15ba8b.weiny2@llnl.gov> References: <20090311144404.bf15ba8b.weiny2@llnl.gov> Message-ID: <20090415153003.GA20857@sk> Hi Ira, On 14:44 Wed 11 Mar , Ira Weiny wrote: > > From: Ira Weiny > Date: Wed, 11 Mar 2009 10:45:25 -0700 > Subject: [PATCH] Update mad formatting functions. > > Add mad_snprintf w/ man page > Add mad_fprintf w/ man page > Add comments to document current functions. > Rename parameters to avoid confusion with other functions which take > "buf" > Mark mad_print_field as deprecated > > Signed-off-by: Ira Weiny Nice stuff! I have some implementation comments below. > --- > libibmad/Makefile.am | 2 + > libibmad/include/infiniband/mad.h | 28 +++- > libibmad/man/mad_fprintf.3 | 82 ++++++++++ > libibmad/man/mad_snprintf.3 | 2 + > libibmad/src/fields.c | 319 ++++++++++++++++++++++++++++++++++++- > libibmad/src/libibmad.map | 2 + > 6 files changed, 428 insertions(+), 7 deletions(-) > create mode 100644 libibmad/man/mad_fprintf.3 > create mode 100644 libibmad/man/mad_snprintf.3 > > diff --git a/libibmad/Makefile.am b/libibmad/Makefile.am > index 4f3ba98..da32899 100644 > --- a/libibmad/Makefile.am > +++ b/libibmad/Makefile.am > @@ -5,6 +5,8 @@ INCLUDES = -I$(srcdir)/include -I$(includedir) > > lib_LTLIBRARIES = libibmad.la > > +man_MANS = man/mad_fprintf.3 man/mad_snprintf.3 > + > libibmad_la_CFLAGS = -Wall > > if HAVE_LD_VERSION_SCRIPT > diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h > index 064cbb7..897b91b 100644 > --- a/libibmad/include/infiniband/mad.h > +++ b/libibmad/include/infiniband/mad.h > @@ -719,9 +719,31 @@ MAD_EXPORT void mad_set_array(void *buf, int base_offs, enum MAD_FIELDS field, v > MAD_EXPORT void mad_get_array(void *buf, int base_offs, enum MAD_FIELDS field, void *val); > MAD_EXPORT void mad_decode_field(uint8_t * buf, enum MAD_FIELDS field, void *val); > MAD_EXPORT void mad_encode_field(uint8_t * buf, enum MAD_FIELDS field, void *val); > -MAD_EXPORT int mad_print_field(enum MAD_FIELDS field, const char *name, void *val); > -MAD_EXPORT char *mad_dump_field(enum MAD_FIELDS field, char *buf, int bufsz, void *val); > -MAD_EXPORT char *mad_dump_val(enum MAD_FIELDS field, char *buf, int bufsz, void *val); > +MAD_EXPORT int mad_print_field(enum MAD_FIELDS field, const char *name, void *val) > + DEPRECATED; > + > +/** > + * The following functions print fields to "s" in various ways > + * > + * mad_dump_[val|field] take a value "val" and use "field" to format it > + * > + * mad_snprint_field takes a data buffer "buf" and uses field to extract and > + * format it. > + * > + * RETURN "s" or NULL on failure > + */ > +MAD_EXPORT char *mad_dump_field(enum MAD_FIELDS field, char *s, int n, void *val); > + /* outputs string ":........" */ > +MAD_EXPORT char *mad_dump_val(enum MAD_FIELDS field, char *s, int n, void *val); > + /* outputs string "" */ > + > +/** > + * printf functions > + * input's "standard" printf parameters except for "buf" which is a mad buffer > + * return the number of actual chars written to "s" or "stream" > + */ > +MAD_EXPORT int mad_snprintf(char *s, size_t n, uint8_t *buf, const char *format, ...); > +MAD_EXPORT int mad_fprintf(FILE *stream, uint8_t *buf, const char *format, ...); > > /* mad.c */ > MAD_EXPORT void *mad_encode(void *buf, ib_rpc_t * rpc, ib_dr_path_t * drpath, > diff --git a/libibmad/man/mad_fprintf.3 b/libibmad/man/mad_fprintf.3 > new file mode 100644 > index 0000000..e69bd44 > --- /dev/null > +++ b/libibmad/man/mad_fprintf.3 > @@ -0,0 +1,82 @@ > +.\" -*- nroff -*- > +.\" > +.TH MAD_FPRINTF 3 "Feb 26, 2009" "OpenIB" "OpenIB Programmer\'s Manual" > +.SH "NAME" > +mad_fprintf, mad_snprintf \- formatted output conversion for mad packets > +.SH "SYNOPSIS" > +.nf > +.B #include > +.sp > +.BI "MAD_EXPORT int mad_snprintf(char " "*s" ", size_t "n ", uint8_t " "*buf" ", const char " "*format" ", ...); > +.BI "MAD_EXPORT int mad_fprintf(FILE " "*stream" ", uint8_t " "*buf" ", const char " "*format" ", ...); > +.fi > +.SH "DESCRIPTION" > +Similar to the printf family of functions. The exception being they do > +.B not > +accept all conversion specifiers and they accept a "buf" parameter which > +represents a mad data buffer. This buffer is used to extract and print fields > +as specified with the > +.B %F > +format specifier. > +.PP > +The following conversion specifiers are > +.B not > +supported. > +.B e, E, f, g, G, a, A, C, S, m > +and > +.B n > +.PP > +.B F > +The %F specifier is used to print out fields decoded from the "buf" data > +buffer. ib_mad_f table also has a name string field. I think it can be useful too - will help to unify outputs. Of course this can be done as subsequent patch. > +.I enum MAD_FIELDS\fR > +values should be used to specify the field to be decoded. > +.PP > +.SH "EXAMPLES" > +.nf > +char portinfo[64]; > +void *pi = portinfo; > +.PP > +if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout)) > +.in +8 > + return -1; > +.in -16 > +.PP > +mad_fprintf(stdout, pi, "Port info (%s):\\n" > +.in +16 > +" %-10s: %F\\n" > +" %-10s: %F\\n" > +" %-10s: %F\\n" > +" %-10s: %F\\n" > +" %-10s: %F\\n" > +" %-10s: %F\\n", > +portid2str(portid), > +"LID", IB_PORT_LID_F, > +"LMC", IB_PORT_LMC_F, > +"state", IB_PORT_STATE_F, > +"physstate", IB_PORT_PHYS_STATE_F, > +"linkwidth", IB_PORT_LINK_WIDTH_ACTIVE_F, > +"linkspeed", IB_PORT_LINK_SPEED_ACTIVE_F > +); > +.in -16 > +.PP > +Results in the output. > +.PP > +Port info (DR path slid 0; dlid 0; 0,1,14): > +.in +3 > +LID : 0x0016 Lids are printed as decimals. > +LMC : 0 > +state : Active > +physstate : LinkUp > +linkwidth : 4X > +linkspeed : 5.0 Gbps > +.in -3 > + > +.SH "RETURN VALUE" > +.B return the number of characters printed. > + > +.SH "SEE ALSO" > +.BR printf (3) > +.SH "AUTHOR" > +.TP > +Ira Weiny > diff --git a/libibmad/man/mad_snprintf.3 b/libibmad/man/mad_snprintf.3 > new file mode 100644 > index 0000000..c004ab9 > --- /dev/null > +++ b/libibmad/man/mad_snprintf.3 > @@ -0,0 +1,2 @@ > +.TH MAD_SNPRINTF 3 "Feb 26, 2009" "OpenIB" "OpenIB Programmer\'s Manual" > +.so man3/mad_fprintf.3 > diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c > index 19c8fc1..acb1180 100644 > --- a/libibmad/src/fields.c > +++ b/libibmad/src/fields.c > @@ -38,7 +38,9 @@ > > #include > #include > +#define _GNU_SOURCE Where is _GNU_SOURCE really used (I didn't find)? > #include > +#include > > #include > > @@ -442,6 +444,9 @@ static const ib_field_t ib_mad_f[] = { > > }; > > +#define MAD_FIELD_MAX_BYTE_LEN (256) > + /* currently "Vendor2Data" increased to the next power of 2 */ > + > static void _set_field64(void *buf, int base_offs, const ib_field_t * f, > uint64_t val) > { > @@ -666,6 +671,7 @@ static int _mad_print_field(const ib_field_t * f, const char *name, void *val, > valsz ? valsz : ALIGN(f->bitlen, 8) / 8); > } > > +/* This function is deprecated use mad_snprint_field or mad_dump_* instead */ > int mad_print_field(enum MAD_FIELDS field, const char *name, void *val) > { > if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) > @@ -673,16 +679,321 @@ int mad_print_field(enum MAD_FIELDS field, const char *name, void *val) > return _mad_print_field(ib_mad_f + field, name, val, 0); > } > > -char *mad_dump_field(enum MAD_FIELDS field, char *buf, int bufsz, void *val) > +char *mad_dump_field(enum MAD_FIELDS field, char *s, int n, void *val) > { > if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) > return 0; > - return _mad_dump_field(ib_mad_f + field, 0, buf, bufsz, val); > + return _mad_dump_field(ib_mad_f + field, 0, s, n, val); > } > > -char *mad_dump_val(enum MAD_FIELDS field, char *buf, int bufsz, void *val) > +char *mad_dump_val(enum MAD_FIELDS field, char *s, int n, void *val) > { > if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) > return 0; > - return _mad_dump_val(ib_mad_f + field, buf, bufsz, val); > + return _mad_dump_val(ib_mad_f + field, s, n, val); > } > + > +#define ZEROPAD 1 /* pad with zero */ > +#define SIGN 2 /* unsigned/signed long */ > +#define PLUS 4 /* show plus */ > +#define SPACE 8 /* space if plus */ > +#define LEFT 16 /* left justified */ > +#define SPECIAL 32 /* 0x */ > +#define LARGE 64 /* use 'ABCDEF' instead of 'abcdef' */ > + > +static char * number(char * str, size_t n, int *rc, > + unsigned long long num, int base, int size, > + int precision, int type) > +{ > + char c,sign,tmp[66]; > + const char *digits="0123456789abcdefghijklmnopqrstuvwxyz"; > + int i; > + > +/* Macro allows for checking length > + * remove 1 to allow for \0 char */ > +#define WRITE_CHAR_RET(c) do { \ > + *str++ = c; \ > + if (++(*rc) >= (n-1)) { \ > + return (str); \ > + } \ > +} while(0) > + > + if (type & LARGE) > + digits = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"; > + if (type & LEFT) > + type &= ~ZEROPAD; > + if (base < 2 || base > 36) > + return 0; > + c = (type & ZEROPAD) ? '0' : ' '; > + sign = 0; > + if (type & SIGN) { > + if ((signed long long)num < 0) { > + sign = '-'; > + num = - (signed long long)num; > + size--; > + } else if (type & PLUS) { > + sign = '+'; > + size--; > + } else if (type & SPACE) { > + sign = ' '; > + size--; > + } > + } > + if (type & SPECIAL) { > + if (base == 16) > + size -= 2; > + else if (base == 8) > + size--; > + } > + i = 0; > + if (num == 0) > + tmp[i++]='0'; > + else { > + while (num >= base) { > + tmp[i++] = digits[num % base]; > + num /= base; > + } > + tmp[i++] = digits[num]; > + } > + if (i > precision) > + precision = i; > + size -= precision; > + if (!(type&(ZEROPAD+LEFT))) > + while(size-->0) > + //*str++ = ' '; > + WRITE_CHAR_RET(' '); > + if (sign) > + //*str++ = sign; > + WRITE_CHAR_RET(sign); > + if (type & SPECIAL) { > + if (base==8) > + WRITE_CHAR_RET('0'); > + else if (base==16) { > + WRITE_CHAR_RET('0'); > + WRITE_CHAR_RET(digits[33]); > + } > + } > + if (!(type & LEFT)) > + while (size-- > 0) > + WRITE_CHAR_RET(c); > + while (i < precision--) > + WRITE_CHAR_RET('0'); > + while (i-- > 0) > + WRITE_CHAR_RET(tmp[i]); > + while (size-- > 0) > + WRITE_CHAR_RET(' '); > + return str; > +} > + > +static int mad_vsnprintf(char *s, size_t n, void *buf, const char *fmt, va_list args) > +{ > + int rc = 0; > + int len; > + unsigned long long num; > + int i, base; > + char *str; > + > +/* Macros allows for bounding length of print to provided buffer > + * remove 1 to allow for \0 char */ > +#define WRITE_CHAR(c) do { \ > + *str++ = c; \ > + if (++rc >= (n-1)) { \ > + goto max_len_hit; \ > + } \ > +} while(0) > +#define WRITE_STR(STR) do { \ > + const char *ls = STR; \ > + len = strlen(ls); \ > + if (precision > 0 && len > precision) \ > + len = precision; \ > + if (!(flags & LEFT)) \ > + while (len < field_width--) \ > + WRITE_CHAR(' '); \ > + for (i = 0; i < len; ++i) \ > + WRITE_CHAR(*ls++); \ > + while (len < field_width--) \ > + WRITE_CHAR(' '); \ > +} while (0); > + > + int flags; > + int field_width; > + int precision; > + int qualifier; > + > + for (str=s ; *fmt ; ++fmt) { > + if (*fmt != '%') { > + //*str++ = *fmt; > + WRITE_CHAR(*fmt); > + continue; > + } > + > + /* process flags */ > + flags = 0; > +repeat: > + ++fmt; /* this also skips first '%' */ > + switch (*fmt) { > + case '-': flags |= LEFT; goto repeat; > + case '+': flags |= PLUS; goto repeat; > + case ' ': flags |= SPACE; goto repeat; > + case '#': flags |= SPECIAL; goto repeat; > + case '0': flags |= ZEROPAD; goto repeat; > + } > + > + /* get field width */ > + field_width = -1; > + if ('0' <= *fmt && *fmt <= '9') { > + int c = 0; > + for (field_width = 0; '0' <= (c = *fmt) && c <= '9'; ++fmt) > + field_width = field_width*10 + c - '0'; > + } else if (*fmt == '*') { > + ++fmt; > + /* it's the next argument */ > + field_width = va_arg(args, int); > + if (field_width < 0) { > + field_width = -field_width; > + flags |= LEFT; > + } > + } > + > + /* get the precision */ > + precision = -1; > + if (*fmt == '.') { > + ++fmt; > + if ('0' <= *fmt && *fmt <= '9') { > + int c = 0; > + for (precision = 0; '0' <= (c = *fmt) && c <= '9'; ++fmt) > + precision = precision*10 + c - '0'; > + } else if (*fmt == '*') { > + ++fmt; > + /* it's the next argument */ > + precision = va_arg(args, int); > + } > + if (precision < 0) > + precision = 0; > + } > + > + /* get the conversion qualifier */ > + qualifier = -1; > + if (*fmt == 'h' || *fmt == 'l' || *fmt == 'L' || *fmt =='z') { > + qualifier = *fmt; > + ++fmt; > + } > + > + /* default base */ > + base = 10; > + > + switch (*fmt) { > + case 'F': > + { > + char s[256]; > + uint8_t val[MAD_FIELD_MAX_BYTE_LEN]; > + int field = va_arg(args, int); > + const ib_field_t *f = ib_mad_f + field; > + > + if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) > + continue; > + > + mad_decode_field(buf, field, val); > + f->def_dump_fn(s, n, val, ALIGN(f->bitlen, 8) / 8); > + WRITE_STR(s); > + continue; > + } > + case 'c': > + if (!(flags & LEFT)) > + while (--field_width > 0) > + WRITE_CHAR(' '); > + WRITE_CHAR((unsigned char) va_arg(args, int)); > + while (--field_width > 0) > + WRITE_CHAR(' '); > + continue; > + > + case 's': > + { > + const char *string = va_arg(args, char *); > + if (!string) > + string = ""; > + > + WRITE_STR(string); > + continue; > + } > + > + case '%': > + WRITE_CHAR('%'); > + continue; > + > + /* integer number formats - set up the flags and "break" */ > + case 'o': > + base = 8; > + break; > + > + case 'p': > + case 'X': > + flags |= LARGE; > + case 'x': > + base = 16; > + break; > + > + case 'd': > + case 'i': > + flags |= SIGN; > + case 'u': > + break; > + > + default: > + WRITE_CHAR('%'); > + if (*fmt) > + WRITE_CHAR(*fmt); > + else > + --fmt; > + continue; > + } > + if (qualifier == 'l') { > + num = va_arg(args, unsigned long); > + if (flags & SIGN) > + num = (signed long) num; > + } else if (qualifier == 'z') { > + num = va_arg(args, size_t); > + } else if (qualifier == 'h') { > + num = (unsigned short) va_arg(args, int); > + if (flags & SIGN) > + num = (signed short) num; > + } else { > + num = va_arg(args, unsigned int); > + if (flags & SIGN) > + num = (signed int) num; > + } > + str = number(str, n-rc, &rc, num, base, field_width, precision, flags); > + if (rc <= 1) > + break; > + } > +max_len_hit: > + *str = '\0'; > + return str-s; > +} Now instead of reimplementing *printf() functions with potential need to follow their extensions/conventions/update/etc wouldn't it be easier (and in long term safer) to just rebuild format string by resolving known %X conversions and then to pass it with rest parameters to standard libc's *printf()? In this way we will support all what *printf()s know + our conversions. Sasha > + > +int mad_snprintf(char *s, size_t n, uint8_t *buf, const char *format, ...) > +{ > + va_list args; > + int i; > + > + va_start(args, format); > + i = mad_vsnprintf(s, n, buf, format, args); > + va_end(args); > + return (i); > +} > + > +int mad_fprintf(FILE *stream, uint8_t *buf, const char *format, ...) > +{ > + char str_buf[1024]; > + va_list args; > + int i,j; > + > + va_start(args, format); > + i = mad_vsnprintf(str_buf, 1024, buf, format, args); > + va_end(args); > + j = fprintf(stream, "%s", str_buf); > + if (i != j) > + IBWARN("mad_vsnprintf and fprintf don't match???\n"); > + return (i); > +} > + > diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map > index 0be7a92..2265b12 100644 > --- a/libibmad/src/libibmad.map > +++ b/libibmad/src/libibmad.map > @@ -4,6 +4,8 @@ IBMAD_1.3 { > mad_dump_field; > mad_dump_val; > mad_print_field; > + mad_snprintf; > + mad_fprintf; > mad_dump_array; > mad_dump_bitfield; > mad_dump_hex; > -- > 1.5.4.5 > From sashak at voltaire.com Wed Apr 15 08:34:41 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 15 Apr 2009 18:34:41 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm: Improve some snprintf uses In-Reply-To: <20090415142956.GA28988@comcast.net> References: <20090415142956.GA28988@comcast.net> Message-ID: <20090415153441.GB20857@sk> On 10:29 Wed 15 Apr , Hal Rosenstock wrote: > > Use sizeof rather than hardcoded constant in case buffer size changes in future > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Wed Apr 15 08:38:10 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 15 Apr 2009 18:38:10 +0300 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] ibsim: Add SMSL support to PortInfo attribute In-Reply-To: References: <20090324182510.GA18072@comcast.net> <20090324182213.GF20085@sashak.voltaire.com> <20090412092502.GG7664@sk> Message-ID: <20090415153810.GC20857@sk> On 15:28 Tue 14 Apr , Hal Rosenstock wrote: > > Perhaps but that seems like a separate cleanup to me. This solves this and all other related issues and eliminates the need to introduce useless fields. I would propose this way - it shouldn't be much harder. Sasha From sashak at voltaire.com Wed Apr 15 09:26:04 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 15 Apr 2009 19:26:04 +0300 Subject: ***SPAM*** Re: [ofa-general] [PATCH v3 0/3] Create a new library libibnetdisc and convert iblinkinfo and ibnetdiscover to that library. In-Reply-To: <20090403170114.dfebcb38.weiny2@llnl.gov> References: <20090403154244.a65227b5.weiny2@llnl.gov> <74883F6A2E3C44958EDA28C22519016E@amr.corp.intel.com> <20090403160807.d185979e.weiny2@llnl.gov> <20090403170114.dfebcb38.weiny2@llnl.gov> Message-ID: <20090415162604.GD20857@sk> On 17:01 Fri 03 Apr , Ira Weiny wrote: > > > > If it's possible, I'd like for Sasha to add these to a branch in his > > management.git tree until I can setup the windows build and verify that > > everything compiles. I should only need a few days to do this. > > Sounds fine to me, I pushed this out as 'pq/ibn3' branch (sorry about delay - I missed this thread somehow). Sasha From sashak at voltaire.com Wed Apr 15 09:40:12 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 15 Apr 2009 19:40:12 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] Fix ibidsverify.pl to use the correct cache file In-Reply-To: <20090403160249.f67e29dd.weiny2@llnl.gov> References: <20090403160249.f67e29dd.weiny2@llnl.gov> Message-ID: <20090415164012.GA9800@sk> On 16:02 Fri 03 Apr , Ira Weiny wrote: > Sasha, > > I found this bug when I was testing the libibnetdisc stuff. > > This applies to the master. > > Ira > > From 656ad88a1f3ca6bcd7601b03da1b3822e4091156 Mon Sep 17 00:00:00 2001 > From: Ira Weiny > Date: Fri, 3 Apr 2009 16:00:46 -0700 > Subject: [PATCH] Fix ibidsverify.pl to use the correct cache file > > In addition add the -C and -P options for specifying a different HCA and port > > Signed-off-by: Ira Weiny Applied (procesed with perltidy.sh too). Thanks. Sasha From weiny2 at llnl.gov Wed Apr 15 14:03:41 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 15 Apr 2009 14:03:41 -0700 Subject: [ofa-general] Re: [PATCH] Update mad formatting functions. In-Reply-To: <20090415153003.GA20857@sk> References: <20090311144404.bf15ba8b.weiny2@llnl.gov> <20090415153003.GA20857@sk> Message-ID: <20090415140341.dd26d8dc.weiny2@llnl.gov> On Wed, 15 Apr 2009 18:30:03 +0300 Sasha Khapyorsky wrote: > Hi Ira, > > On 14:44 Wed 11 Mar , Ira Weiny wrote: > > > > From: Ira Weiny > > Date: Wed, 11 Mar 2009 10:45:25 -0700 > > Subject: [PATCH] Update mad formatting functions. > > > > Add mad_snprintf w/ man page > > Add mad_fprintf w/ man page > > Add comments to document current functions. > > Rename parameters to avoid confusion with other functions which take > > "buf" > > Mark mad_print_field as deprecated > > > > Signed-off-by: Ira Weiny > > Nice stuff! I have some implementation comments below. > [snip] > > diff --git a/libibmad/man/mad_fprintf.3 b/libibmad/man/mad_fprintf.3 > > new file mode 100644 > > index 0000000..e69bd44 > > --- /dev/null > > +++ b/libibmad/man/mad_fprintf.3 > > @@ -0,0 +1,82 @@ > > +.\" -*- nroff -*- > > +.\" > > +.TH MAD_FPRINTF 3 "Feb 26, 2009" "OpenIB" "OpenIB Programmer\'s Manual" > > +.SH "NAME" > > +mad_fprintf, mad_snprintf \- formatted output conversion for mad packets > > +.SH "SYNOPSIS" > > +.nf > > +.B #include > > +.sp > > +.BI "MAD_EXPORT int mad_snprintf(char " "*s" ", size_t "n ", uint8_t " "*buf" ", const char " "*format" ", ...); > > +.BI "MAD_EXPORT int mad_fprintf(FILE " "*stream" ", uint8_t " "*buf" ", const char " "*format" ", ...); > > +.fi > > +.SH "DESCRIPTION" > > +Similar to the printf family of functions. The exception being they do > > +.B not > > +accept all conversion specifiers and they accept a "buf" parameter which > > +represents a mad data buffer. This buffer is used to extract and print fields > > +as specified with the > > +.B %F > > +format specifier. > > +.PP > > +The following conversion specifiers are > > +.B not > > +supported. > > +.B e, E, f, g, G, a, A, C, S, m > > +and > > +.B n > > +.PP > > +.B F > > +The %F specifier is used to print out fields decoded from the "buf" data > > +buffer. > > ib_mad_f table also has a name string field. I think it can be useful > too - will help to unify outputs. Of course this can be done as > subsequent patch. Yes but I don't think we should force users to use any specific output. If they want to print the "name" of a field that should be a separate specifier __not__ automatic. Is this what you mean? > > > +.I enum MAD_FIELDS\fR > > +values should be used to specify the field to be decoded. > > +.PP > > +.SH "EXAMPLES" > > +.nf > > +char portinfo[64]; > > +void *pi = portinfo; > > +.PP > > +if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout)) > > +.in +8 > > + return -1; > > +.in -16 > > +.PP > > +mad_fprintf(stdout, pi, "Port info (%s):\\n" > > +.in +16 > > +" %-10s: %F\\n" > > +" %-10s: %F\\n" > > +" %-10s: %F\\n" > > +" %-10s: %F\\n" > > +" %-10s: %F\\n" > > +" %-10s: %F\\n", > > +portid2str(portid), > > +"LID", IB_PORT_LID_F, > > +"LMC", IB_PORT_LMC_F, > > +"state", IB_PORT_STATE_F, > > +"physstate", IB_PORT_PHYS_STATE_F, > > +"linkwidth", IB_PORT_LINK_WIDTH_ACTIVE_F, > > +"linkspeed", IB_PORT_LINK_SPEED_ACTIVE_F > > +); > > +.in -16 > > +.PP > > +Results in the output. > > +.PP > > +Port info (DR path slid 0; dlid 0; 0,1,14): > > +.in +3 > > +LID : 0x0016 > > Lids are printed as decimals. Well I thought I copied the output from the example but I see that it is printing decimal. So?? :-/ I fixed it. As an aside, not all LID's are decimal. Should we change this? from fields.c ... {BE_OFFS(256, 16), "DrSmpDLID", mad_dump_hex}, {BE_OFFS(272, 16), "DrSmpSLID", mad_dump_hex}, ... {BITSOFFS(224, 16), "RedirectLID", mad_dump_hex}, {BITSOFFS(480, 16), "TrapLID", mad_dump_hex}, ... {BITSOFFS(320, 16), "PathRecDLid", mad_dump_hex}, {BITSOFFS(336, 16), "PathRecSLid", mad_dump_hex}, ... {BITSOFFS(288, 16), "McastMemMLid", mad_dump_hex}, [snip] > > diff --git a/libibmad/man/mad_snprintf.3 b/libibmad/man/mad_snprintf.3 > > new file mode 100644 > > index 0000000..c004ab9 > > --- /dev/null > > +++ b/libibmad/man/mad_snprintf.3 > > @@ -0,0 +1,2 @@ > > +.TH MAD_SNPRINTF 3 "Feb 26, 2009" "OpenIB" "OpenIB Programmer\'s Manual" > > +.so man3/mad_fprintf.3 > > diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c > > index 19c8fc1..acb1180 100644 > > --- a/libibmad/src/fields.c > > +++ b/libibmad/src/fields.c > > @@ -38,7 +38,9 @@ > > > > #include > > #include > > +#define _GNU_SOURCE > > Where is _GNU_SOURCE really used (I didn't find)? Yep you are right, I don't need this. [snip] > > + } > > + str = number(str, n-rc, &rc, num, base, field_width, precision, flags); > > + if (rc <= 1) > > + break; > > + } > > +max_len_hit: > > + *str = '\0'; > > + return str-s; > > +} > > Now instead of reimplementing *printf() functions with potential need > to follow their extensions/conventions/update/etc wouldn't it be easier > (and in long term safer) to just rebuild format string by resolving > known %X conversions and then to pass it with rest parameters to > standard libc's *printf()? > > In this way we will support all what *printf()s know + our conversions. > I thought about that but decided not to do it. I can't remember why though... ;-) So maybe I agree with you, let me try and remember and if I can't I will change it. [snip] Ira From sashak at voltaire.com Wed Apr 15 17:59:44 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 16 Apr 2009 03:59:44 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH 1/2] opensm/iba/ib_types.h: Add MaxCreditHint and LinkRoundTripLatency to PortInfo attribute In-Reply-To: <20090415184400.GA10166@comcast.net> References: <20090415184400.GA10166@comcast.net> Message-ID: <20090416005944.GD10146@sk> On 14:44 Wed 15 Apr , Hal Rosenstock wrote: > > Also, add comment on error thresholds > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Wed Apr 15 18:00:02 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 16 Apr 2009 04:00:02 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH 2/2] opensm/osm_helper.c: Add support for MaxCreditHint and LinkRoundTripLatency to osm_dump_port_info In-Reply-To: <20090415184510.GB10166@comcast.net> References: <20090415184510.GB10166@comcast.net> Message-ID: <20090416010002.GE10146@sk> On 14:45 Wed 15 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sean.hefty at intel.com Wed Apr 15 16:43:27 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Wed, 15 Apr 2009 16:43:27 -0700 Subject: [ofa-general] RDMA over infiniband, diffrences between rdam_cm and libmthca-rdmav2 In-Reply-To: <8d9c773c0904150558n64169bffl8db8445f66c029dc@mail.gmail.com> References: <8d9c773c0904150558n64169bffl8db8445f66c029dc@mail.gmail.com> Message-ID: >I am new in infiniband, and I am doing some research on rdma. >I have found two diffrents way of sending data on infiniband protucts >using rdma. >The first one use rdam_cm module (from kernel source code), and second >one use libmthca-rdmav2/libibverbs. > >If someone can explain me the diffrences between this two types of programming. The library to send data is libibverbs. The rdma_cm (or librdmacm) is one method that can be used to setup the QPs for communication. I.e. exchange the QP numbers, LIDs, etc. You could also setup the QPs using the libibcm or just exchange the data over a standard socket. If you look at the librdmacm code, you will see that it calls the libibverbs functions to allocate and modify the QP. - Sean From hnrose at comcast.net Wed Apr 15 11:45:10 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 15 Apr 2009 14:45:10 -0400 Subject: [ofa-general] ***SPAM*** [PATCH 2/2] opensm/osm_helper.c: Add support for MaxCreditHint and LinkRoundTripLatency to osm_dump_port_info Message-ID: <20090415184510.GB10166@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c index ac4b372..0dc8055 100644 --- a/opensm/opensm/osm_helper.c +++ b/opensm/opensm/osm_helper.c @@ -827,7 +827,9 @@ void osm_dump_port_info(IN osm_log_t * p_log, IN const ib_net64_t node_guid, "\t\t\t\tclient_reregister.......0x%X\n" "\t\t\t\tsubnet_timeout..........0x%X\n" "\t\t\t\tresp_time_value.........0x%X\n" - "\t\t\t\terror_threshold.........0x%X\n", + "\t\t\t\terror_threshold.........0x%X\n" + "\t\t\t\tmax_credit_hint.........0x%X\n" + "\t\t\t\tlink_round_trip_latency.0x%X\n", port_num, cl_ntoh64(node_guid), cl_ntoh64(port_guid), @@ -855,7 +857,8 @@ void osm_dump_port_info(IN osm_log_t * p_log, IN const ib_net64_t node_guid, cl_ntoh16(p_pi->q_key_violations), p_pi->guid_cap, ib_port_info_get_client_rereg(p_pi), ib_port_info_get_timeout(p_pi), p_pi->resp_time_value, - p_pi->error_threshold); + p_pi->error_threshold, cl_ntoh16(p_pi->max_credit_hint), + cl_ntoh32(p_pi->link_rt_latency)); /* show the capabilities mask */ if (p_pi->capability_mask) { From helight.xu at gmail.com Wed Apr 15 18:11:02 2009 From: helight.xu at gmail.com (Zhenwen Xu) Date: Thu, 16 Apr 2009 09:11:02 +0800 Subject: [ofa-general] Re: [PATCH] fix a warning on drivers/infiniband/hw/nes/nes_cm.c:862: In-Reply-To: <20090415143228.d57a0201.akpm@linux-foundation.org> References: <20090412122317.GA4787@helight> <20090415143228.d57a0201.akpm@linux-foundation.org> Message-ID: <20090416011102.GA3419@helight> On Wed, Apr 15, 2009 at 02:32:28PM -0700, Andrew Morton wrote: > On Sun, 12 Apr 2009 20:23:17 +0800 > Zhenwen Xu wrote: > > > Fix this warning: > > drivers/infiniband/hw/nes/nes_cm.c:862: warning: unused variable ___tmp_addr___ > > > > the 'tmp_addr' is defined for debug, so it should be defined in > > CONFIG_INFINIBAND_NES_DEBUG > > > > > > >From 5f67884bcda5450807dcd080378d829628e4db1c Mon Sep 17 00:00:00 2001 > > From: Zhenwen Xu > > Date: Sun, 12 Apr 2009 20:12:18 +0800 > > Subject: [PATCH] fix a warning on drivers/infiniband/hw/nes/nes_cm.c:862: > > > > Signed-off-by: Zhenwen Xu > > --- > > drivers/infiniband/hw/nes/nes_cm.c | 3 ++- > > 1 files changed, 2 insertions(+), 1 deletions(-) > > > > diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c > > index dbd9a75..1bad93b 100644 > > --- a/drivers/infiniband/hw/nes/nes_cm.c > > +++ b/drivers/infiniband/hw/nes/nes_cm.c > > @@ -854,8 +854,9 @@ static struct nes_cm_listener *find_listener(struct nes_cm_core *cm_core, > > { > > unsigned long flags; > > struct nes_cm_listener *listen_node; > > +#ifdef CONFIG_INFINIBAND_NES_DEBUG > > __be32 tmp_addr = cpu_to_be32(dst_addr); > > - > > +#endif > > /* walk list and find cm_node associated with this session ID */ > > spin_lock_irqsave(&cm_core->listen_list_lock, flags); > > list_for_each_entry(listen_node, &cm_core->listen_list.list, list) { > > eek, an ugly ifdef. And we can't just remove tmp_addr because > printk(%p) wants to be passed an address rather than a value. > > It'd be nice if we had a handy macro to squish the warning, like > uninitialized_var. > > As it happens, uninitialized_var() _does_ suppress the unused-var warning: > > --- a/drivers/infiniband/hw/nes/nes_cm.c~drivers-infiniband-hw-nes-nes_cmc-fix-unused-var-warning-cleanup > +++ a/drivers/infiniband/hw/nes/nes_cm.c > @@ -854,9 +854,8 @@ static struct nes_cm_listener *find_list > { > unsigned long flags; > struct nes_cm_listener *listen_node; > -#ifdef CONFIG_INFINIBAND_NES_DEBUG > - __be32 tmp_addr = cpu_to_be32(dst_addr); > -#endif > + __be32 uninitialized_var(tmp_addr) = cpu_to_be32(dst_addr); > + > /* walk list and find cm_node associated with this session ID */ > spin_lock_irqsave(&cm_core->listen_list_lock, flags); > list_for_each_entry(listen_node, &cm_core->listen_list.list, list) { > > > but that seems a bit abusive ;) Thanks! I got. -- --------------------------------- Zhenwen Xu - Open and Free Home Page: http://zhwen.org My Studio: http://dim4.cn From jsquyres at cisco.com Wed Apr 15 13:14:43 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 15 Apr 2009 16:14:43 -0400 Subject: [ofa-general] ofed_info reports wrong information Message-ID: <4F7866A6-271A-4302-AF0E-04703FDF8E9C@cisco.com> (bugzilla is impossible to use right now) In OFED 1.4.1rc3, I see that ofed_info always outputs the MPI information. Indeed, ofed_info is simply a script that invokes cat << EOF ...all the information, including MPI versions... EOF Not all the information is valid if I elected to only install the core OpenFabrics RPMs (i.e., option 1 in the installer). -- Jeff Squyres Cisco Systems From worleys at gmail.com Wed Apr 15 14:12:12 2009 From: worleys at gmail.com (Chris Worley) Date: Wed, 15 Apr 2009 15:12:12 -0600 Subject: [ofa-general] ***SPAM*** Any per-port counter a user can look at? Message-ID: Like RX and TX on an Ethernet interface, are there any IB port I/O counters a user can query, and if so, how? Thanks, Chris From hal.rosenstock at gmail.com Wed Apr 15 13:29:31 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Wed, 15 Apr 2009 16:29:31 -0400 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCHv2] opensm/PerfMgr: Better redirection support In-Reply-To: <20090415125925.GF7353@sk> References: <20090312202134.GC25024@comcast.net> <20090415125925.GF7353@sk> Message-ID: Hi Sasha, On Wed, Apr 15, 2009 at 8:59 AM, Sasha Khapyorsky wrote: > Hi Hal, > > On 15:21 Thu 12 Mar     , Hal Rosenstock wrote: >> >> Handle PKey and QPN redirection information >> GID redirection handling remains >> >> Signed-off-by: Hal Rosenstock >> >> --- >> Changes since v1: >> Added include of osm_helper.h to osm_perfmgr.c >> >> Notes: osm_console redirection patch is unchanged >> Also, relies on ib_gid_is_notzero patch >> >> diff --git a/opensm/include/opensm/osm_perfmgr.h b/opensm/include/opensm/osm_perfmgr.h >> index 45dec54..651ea80 100644 >> --- a/opensm/include/opensm/osm_perfmgr.h >> +++ b/opensm/include/opensm/osm_perfmgr.h >> @@ -92,8 +92,14 @@ typedef enum { >> >>  /* Redirection information */ >>  typedef struct redir { >> +     boolean_t redirection; >> +     boolean_t invalid; > > Why using lid value != 0 is/was bad for redirection invalidation? b/c there are other fields supplied in the redirection which also could be invalid so this one flag summarizes that instead of overloading one of the fields. Previously the only thing looked at was lid (and qp) so using lid 0 as invalid was used. >> +     ib_gid_t redir_gid; >>       ib_net16_t redir_lid; >> +     ib_net16_t redir_pkey; >>       ib_net32_t redir_qp; >> +     uint16_t redir_pkey_ix; > > Don't need to repeat structure name (redir) in field names - it just > makes lines longer - 'redir->pkey_idx' looks pretty clear. I was following the naming already present (redir_lid/qp). I'll change this in the next version. >> +     ib_net16_t orig_lid; >>  } redir_t; >> >>  /* Node to store information about which nodes we are monitoring */ >> @@ -134,6 +140,7 @@ typedef struct osm_perfmgr { >>       uint32_t max_outstanding_queries; >>       cl_qmap_t monitored_map;        /* map the nodes we are tracking */ >>       __monitored_node_t *remove_list; >> +     int16_t local_port; >>  } osm_perfmgr_t; >>  /* >>  * FIELDS >> diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c >> index 4a6f65c..fc1b7cb 100644 >> --- a/opensm/opensm/osm_perfmgr.c >> +++ b/opensm/opensm/osm_perfmgr.c >> @@ -47,7 +47,6 @@ >>  #endif                               /* HAVE_CONFIG_H */ >> >>  #ifdef ENABLE_OSM_PERF_MGR >> - >>  #include >>  #include >>  #include >> @@ -65,8 +64,11 @@ >>  #include >>  #include >>  #include >> +#include >> >>  #define OSM_PERFMGR_INITIAL_TID_VALUE 0xcafe >> +#define MAX_LOCAL_IBPORTS 64 >> +#define MAX_LOCAL_PKEYS   256 >> >>  #if ENABLE_OSM_PERF_MGR_PROFILE >>  struct { >> @@ -118,8 +120,6 @@ static inline void diff_time(struct timeval *before, >>  } >>  #endif >> >> -extern int wait_for_pending_transactions(osm_stats_t * stats); >> - >>  /********************************************************************** >>   * Internal helper functions. >>   **********************************************************************/ >> @@ -203,6 +203,7 @@ osm_perfmgr_mad_send_err_callback(void *bind_context, osm_madw_t * p_madw) >>       uint8_t port = context->perfmgr_context.port; >>       cl_map_item_t *p_node; >>       __monitored_node_t *p_mon_node; >> +     ib_net16_t orig_lid; >> >>       OSM_LOG_ENTER(pm->log); >> >> @@ -233,9 +234,10 @@ osm_perfmgr_mad_send_err_callback(void *bind_context, osm_madw_t * p_madw) >>                               p_mon_node->guid, p_mon_node->redir_tbl_size); >>                       goto Exit; >>               } >> -             /* Clear redirection info */ >> -             p_mon_node->redir_port[port].redir_lid = 0; >> -             p_mon_node->redir_port[port].redir_qp = 0; >> +             /* Clear redirection info for this port except orig_lid */ >> +             orig_lid = p_mon_node->redir_port[port].orig_lid; >> +             memset(&p_mon_node->redir_port[port], 0, sizeof(redir_t)); >> +             p_mon_node->redir_port[port].orig_lid = orig_lid; > > Hmm, why should 'orig_lid' be part of redirection structure and not > placed on original node/port (below I see that it is used in > non-redirected paths)? What are you referring to here ? > I think it would be better to use structures like: > > struct node { >        .... >        uint16_t lid; >        uint16_t pkey_ix; Why would lid and pkey_ix be part of node ? >        unsigned num_ports; >        struct port { >                .... >                struct redir_info { /* or even same 'struct port' */ >                        ... >                } *ri; >        } ports[0]; > } It's like this with redir_tbl_size r.t. num_ports and redir_t redir_port[1] instead of your struct port {} ports[0]. Why would what you propose be better ? >>               cl_plock_release(pm->lock); >>       } >> >> @@ -292,6 +294,40 @@ osm_perfmgr_bind(osm_perfmgr_t * const pm, const ib_net64_t port_guid) >>               goto Exit; >>       } >> >> +     /* if redirection enabled, determine local port from port GUID */ >> +     if (pm->subn->opt.perfmgr_redir) { >> +             ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS]; >> +             uint32_t num_ports = MAX_LOCAL_IBPORTS; >> +             int i; >> + >> +             for (i = 0; i < num_ports; i++) { >> +                     attr_array[i].num_pkeys = 0; >> +                     attr_array[i].p_pkey_table = NULL; >> +             } >> + >> +             /* call transport layer for a list of local port GUIDs */ >> +             status = osm_vendor_get_all_port_attr(pm->subn->p_osm->p_vendor, >> +                                                   attr_array, &num_ports); >> +             if (status != IB_SUCCESS) { >> +                     OSM_LOG(pm->log, OSM_LOG_ERROR, >> +                             "ERR 4C1C: osm_vendor_get_all_port_attr status 0x%x\n", >> +                             status); >> +                     goto Exit; >> +             } >> +             if (num_ports == 0) { >> +                     OSM_LOG(pm->log, OSM_LOG_ERROR, >> +                             "ERR 4C1D: No local ports detected!\n"); >> +                     goto Exit; >> +             } >> + >> +             for (i = 0; i < num_ports; i++) { >> +                     if (port_guid == attr_array[i].port_guid) { >> +                             pm->local_port = attr_array[i].port_num; >> +                             break; >> +                     } >> +             } >> +     } >> + > > PerfMgr is always running over discovered fabric so maybe local port > number should be detected later at start of PerfMgr process cycle just > using OpenSM DB. Why is that better than doing this at bind time of PerfMgr ? > Also see comment below about pkey validation per redirection request. > >>  Exit: >>       OSM_LOG_EXIT(pm->log); >>       return (status); >> @@ -321,25 +357,16 @@ static ib_net32_t get_qp(__monitored_node_t * mon_node, uint8_t port) >> >>       if (mon_node && mon_node->redir_tbl_size && >>           port < mon_node->redir_tbl_size && >> -         mon_node->redir_port[port].redir_lid && >> +         mon_node->redir_port[port].redirection && >>           mon_node->redir_port[port].redir_qp) >>               qp = mon_node->redir_port[port].redir_qp; >> >>       return qp; >>  } >> >> -/********************************************************************** >> - * Given a node, a port, and an optional monitored node, >> - * return the appropriate lid to query that port >> - **********************************************************************/ >>  static ib_net16_t >> -get_lid(osm_node_t * p_node, uint8_t port, __monitored_node_t * mon_node) >> +get_base_lid(osm_node_t * p_node, uint8_t port) >>  { >> -     if (mon_node && mon_node->redir_tbl_size && >> -         port < mon_node->redir_tbl_size && >> -         mon_node->redir_port[port].redir_lid) >> -             return mon_node->redir_port[port].redir_lid; >> - >>       switch (p_node->node_info.node_type) { >>       case IB_NODE_TYPE_CA: >>       case IB_NODE_TYPE_ROUTER: >> @@ -352,12 +379,27 @@ get_lid(osm_node_t * p_node, uint8_t port, __monitored_node_t * mon_node) >>  } >> >>  /********************************************************************** >> + * Given a node, a port, and an optional monitored node, >> + * return the lid appropriate to query that port >> + **********************************************************************/ >> +static ib_net16_t >> +get_lid(osm_node_t * p_node, uint8_t port, __monitored_node_t * mon_node) >> +{ >> +     if (mon_node && mon_node->redir_tbl_size && >> +         port < mon_node->redir_tbl_size && >> +         mon_node->redir_port[port].redir_lid) >> +             return mon_node->redir_port[port].redir_lid; >> + >> +     return get_base_lid(p_node, port); >> +} >> + >> +/********************************************************************** >>   * Form and send the Port Counters MAD for a single port. >>   **********************************************************************/ >>  static ib_api_status_t >>  osm_perfmgr_send_pc_mad(osm_perfmgr_t * perfmgr, ib_net16_t dest_lid, >> -                     ib_net32_t dest_qp, uint8_t port, uint8_t mad_method, >> -                     osm_madw_context_t * const p_context) >> +                     ib_net32_t dest_qp, uint16_t pkey_ix, uint8_t port, >> +                     uint8_t mad_method, osm_madw_context_t * const p_context) >>  { >>       ib_api_status_t status = IB_SUCCESS; >>       ib_port_counters_t *port_counter = NULL; >> @@ -396,8 +438,7 @@ osm_perfmgr_send_pc_mad(osm_perfmgr_t * perfmgr, ib_net16_t dest_lid, >>       p_madw->mad_addr.addr_type.gsi.remote_qp = dest_qp; >>       p_madw->mad_addr.addr_type.gsi.remote_qkey = >>           cl_hton32(IB_QP1_WELL_KNOWN_Q_KEY); >> -     /* FIXME what about other partitions */ >> -     p_madw->mad_addr.addr_type.gsi.pkey_ix = 0; >> +     p_madw->mad_addr.addr_type.gsi.pkey_ix = pkey_ix; >>       p_madw->mad_addr.addr_type.gsi.service_level = 0; >>       p_madw->mad_addr.addr_type.gsi.global_route = FALSE; >>       p_madw->resp_expected = TRUE; >> @@ -432,28 +473,32 @@ static void __collect_guids(cl_map_item_t * const p_map_item, void *context) >>       uint64_t node_guid = cl_ntoh64(node->node_info.node_guid); >>       osm_perfmgr_t *pm = (osm_perfmgr_t *) context; >>       __monitored_node_t *mon_node = NULL; >> -     uint32_t size; >> +     uint32_t num_ports; >> +     int port; >> >>       OSM_LOG_ENTER(pm->log); >> >>       if (cl_qmap_get(&pm->monitored_map, node_guid) >>           == cl_qmap_end(&pm->monitored_map)) { >>               /* if not already in our map add it */ >> -             size = osm_node_get_num_physp(node); >> -             mon_node = malloc(sizeof(*mon_node) + sizeof(redir_t) * size); >> +             num_ports = osm_node_get_num_physp(node); >> +             mon_node = malloc(sizeof(*mon_node) + sizeof(redir_t) * num_ports); >>               if (!mon_node) { >>                       OSM_LOG(pm->log, OSM_LOG_ERROR, "PerfMgr: ERR 4C06: " >>                               "malloc failed: not handling node %s" >>                               "(GUID 0x%" PRIx64 ")\n", node->print_desc, node_guid); >>                       goto Exit; >>               } >> -             memset(mon_node, 0, sizeof(*mon_node) + sizeof(redir_t) * size); >> +             memset(mon_node, 0, sizeof(*mon_node) + sizeof(redir_t) * num_ports); >>               mon_node->guid = node_guid; >>               mon_node->name = strdup(node->print_desc); >> -             mon_node->redir_tbl_size = size; >> +             mon_node->redir_tbl_size = num_ports; >>               /* check for enhanced switch port 0 */ >>               mon_node->esp0 = (node->sw && >>                   ib_switch_info_is_enhanced_port0(&node->sw->switch_info)); >> +             for (port = (mon_node->esp0) ? 0 : 1; port < num_ports; port++) >> +                     mon_node->redir_port[port].orig_lid = get_base_lid(node, port); >> + >>               cl_qmap_insert(&(pm->monitored_map), node_guid, >>                              (cl_map_item_t *) mon_node); >>       } >> @@ -511,6 +556,10 @@ __osm_perfmgr_query_counters(cl_map_item_t * const p_map_item, void *context) >>               if (!osm_node_get_physp_ptr(node, port)) >>                       continue; >> >> +             if (mon_node->redir_port[port].redirection && >> +                 mon_node->redir_port[port].invalid) >> +                     continue; >> + > > Are two flags really needed? Couldn't this be stripped down? Just invalid should be sufficient. I'll change this in the next version. > Also what about letting "chance" for port to refresh redirection info? What do you mean ? >>               lid = get_lid(node, port, mon_node); >>               if (lid == 0) { >>                       OSM_LOG(pm->log, OSM_LOG_DEBUG, "WARN: node 0x%" PRIx64 >> @@ -532,8 +581,10 @@ __osm_perfmgr_query_counters(cl_map_item_t * const p_map_item, void *context) >>                       PRIx64 " port %d (lid %u) (%s)\n", node_guid, port, >>                       cl_ntoh16(lid), node->print_desc); >>               status = >> -                 osm_perfmgr_send_pc_mad(pm, lid, remote_qp, port, >> -                                         IB_MAD_METHOD_GET, &mad_context); >> +                 osm_perfmgr_send_pc_mad(pm, lid, remote_qp, >> +                                         mon_node->redir_port[port].redir_pkey_ix, >> +                                         port, IB_MAD_METHOD_GET, >> +                                         &mad_context); >>               if (status != IB_SUCCESS) >>                       OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C09: " >>                               "Failed to issue port counter query for node 0x%" >> @@ -550,6 +601,7 @@ Exit: >>   * Discovery stuff. >>   * Basically this code should not be here, but merged with main OpenSM >>   **********************************************************************/ >> +extern int wait_for_pending_transactions(osm_stats_t * stats); >>  extern void osm_drop_mgr_process(IN osm_sm_t *sm); >> >>  static int sweep_hop_1(osm_sm_t * sm) >> @@ -980,6 +1032,10 @@ osm_perfmgr_check_overflow(osm_perfmgr_t * pm, __monitored_node_t *mon_node, >>               osm_node_t *p_node = NULL; >>               ib_net16_t lid = 0; >> >> +             if (mon_node->redir_port[port].redirection && >> +                 mon_node->redir_port[port].invalid) >> +                     goto Exit; >> + >>               osm_log(pm->log, OSM_LOG_VERBOSE, >>                       "PerfMgr: Counter overflow: %s (0x%" PRIx64 >>                       ") port %d; clearing counters\n", >> @@ -1004,8 +1060,10 @@ osm_perfmgr_check_overflow(osm_perfmgr_t * pm, __monitored_node_t *mon_node, >>               mad_context.perfmgr_context.mad_method = IB_MAD_METHOD_SET; >>               /* clear port counters */ >>               status = >> -                 osm_perfmgr_send_pc_mad(pm, lid, remote_qp, port, >> -                                         IB_MAD_METHOD_SET, &mad_context); >> +                 osm_perfmgr_send_pc_mad(pm, lid, remote_qp, >> +                                         mon_node->redir_port[port].redir_pkey_ix, >> +                                         port, IB_MAD_METHOD_SET, >> +                                         &mad_context); >>               if (status != IB_SUCCESS) >>                       OSM_LOG(pm->log, OSM_LOG_ERROR, "PerfMgr: ERR 4C11: " >>                               "Failed to send clear counters MAD for %s (0x%" >> @@ -1063,6 +1121,73 @@ osm_perfmgr_log_events(osm_perfmgr_t * pm, __monitored_node_t *mon_node, uint8_t >>                       time_diff, mon_node->name, mon_node->guid, port); >>  } >> >> +static boolean_t validate_redir_pkey(osm_perfmgr_t *pm, ib_net16_t pkey, >> +                                  uint16_t *pkey_ix) >> +{ > > This function can just return pkey index value (or negative in case of > failure). Architecturally that's not true but practically it is as noone implements many pkeys. I'll change in next version. >> +     ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS]; >> +     uint32_t num_ports = MAX_LOCAL_IBPORTS; >> +     ib_api_status_t status; >> +     boolean_t sts = FALSE; >> +     uint16_t i = 0; >> + >> +     OSM_LOG_ENTER(pm->log); >> + >> +     for (i = 0; i < num_ports; i++) { >> +             attr_array[i].num_pkeys = 0; >> +             attr_array[i].p_pkey_table = NULL; >> +     } >> + >> +     /* If local port couldn't be determined previously */ >> +     if (pm->local_port == -1) >> +             goto not_found; >> + >> +     attr_array[pm->local_port].num_pkeys = MAX_LOCAL_PKEYS; >> +     attr_array[pm->local_port].p_pkey_table = >> +                             malloc(MAX_LOCAL_PKEYS * sizeof(ib_net16_t)); >> +     if (!attr_array[pm->local_port].p_pkey_table) { >> +             OSM_LOG(pm->log, OSM_LOG_ERROR, >> +                     "ERR 4C20: No memory for port %d pkey table\n", >> +                     pm->local_port); >> +             goto not_found; >> +     } >> + >> +     /* call the transport layer for a list of local port pkeys */ >> +     status = osm_vendor_get_all_port_attr(pm->subn->p_osm->p_vendor, >> +                                           attr_array, &num_ports); > > This heavy stuff is performed per redirection request, but it actually > uses same data (local port's pkey table). This looks very inefficient. > Instead of doing this you can get local port's pkey table only once at > PerfMgr process cycle start and do all checks against this already > initialized table. Of course only in case when redirection is enabled at > all. Redirection does not occur frequently. Also, the pkey table could change in between and there's no local event support in OpenSM so I don't see a way around this other than polling. >> +     if (status != IB_SUCCESS) { >> +             OSM_LOG(pm->log, OSM_LOG_ERROR, >> +                     "ERR 4C1E: osm_vendor_get_all_port_attr status 0x%x\n", >> +                     status); >> +             goto not_found; >> +     } >> +     if (num_ports == 0 || pm->local_port > num_ports) { >> +             OSM_LOG(pm->log, OSM_LOG_ERROR, >> +                     "ERR 4C1F: No local ports detected or local port out of range!\n"); >> +             goto not_found; >> +     } >> +     ib_net16_t *pkey_table = attr_array[pm->local_port].p_pkey_table; >> +     for (i = 0; i < attr_array[pm->local_port].num_pkeys; i++) >> +             if (pkey_table[i] == pkey) >> +                     break; >> +     if (i == attr_array[pm->local_port].num_pkeys) { >> +             i = 0; >> +             goto not_found; >> +     } >> +     free(attr_array[pm->local_port].p_pkey_table); >> +     sts = TRUE; >> +     goto Exit; >> + >> +not_found: >> +     if (attr_array[pm->local_port].p_pkey_table) >> +             free(attr_array[pm->local_port].p_pkey_table); >> +     sts = FALSE; >> +Exit: >> +     if (pkey_ix) >> +             *pkey_ix = i; >> +     OSM_LOG_EXIT(pm->log); >> +     return sts; >> +} >> + >>  /********************************************************************** >>   * The dispatcher uses a thread pool which will call this function when >>   * we have a thread available to process our mad received from the wire. >> @@ -1082,6 +1207,8 @@ static void osm_pc_rcv_process(void *context, void *data) >>       perfmgr_db_data_cnt_reading_t data_reading; >>       cl_map_item_t *p_node; >>       __monitored_node_t *p_mon_node; >> +     uint16_t pkey_ix; >> +     boolean_t invalid = FALSE; >> >>       OSM_LOG_ENTER(pm->log); >> >> @@ -1105,7 +1232,8 @@ static void osm_pc_rcv_process(void *context, void *data) >>                 p_mad->attr_id == IB_MAD_ATTR_CLASS_PORT_INFO); >> >>       /* Response could also be redirection (IBM eHCA PMA does this) */ >> -     if (p_mad->attr_id == IB_MAD_ATTR_CLASS_PORT_INFO) { >> +     if (p_mad->status & IB_MAD_STATUS_REDIRECT && >> +         p_mad->attr_id == IB_MAD_ATTR_CLASS_PORT_INFO) { >>               char gid_str[INET6_ADDRSTRLEN]; >>               ib_class_port_info_t *cpi = >>                   (ib_class_port_info_t *) & >> @@ -1119,17 +1247,48 @@ static void osm_pc_rcv_process(void *context, void *data) >>                                 sizeof gid_str), >>                       cl_ntoh32(cpi->redir_qp)); >> >> -             /* LID or GID redirection ? */ >> -             /* For GID redirection, need to get PathRecord from SA */ >> +             /* valid redirection ? */ >>               if (cpi->redir_lid == 0) { >> -                     OSM_LOG(pm->log, OSM_LOG_VERBOSE, >> -                             "GID redirection not currently implemented!\n"); >> -                     goto Exit; >> +                     if (!ib_gid_is_notzero(&cpi->redir_gid)) { >> +                             OSM_LOG(pm->log, OSM_LOG_ERROR, >> +                                     "ERR 4C17: Invalid redirection " >> +                                     "(both redirect LID and GID are zero)\n"); >> +                             invalid = TRUE; >> +                     } >> +             } >> +             if (cpi->redir_qp == 0) { >> +                     OSM_LOG(pm->log, OSM_LOG_ERROR, >> +                             "ERR 4C18: Invalid RedirectQP\n"); >> +                     invalid = TRUE; >> +             } >> +             if (cpi->redir_pkey == 0) { >> +                     OSM_LOG(pm->log, OSM_LOG_ERROR, >> +                             "ERR 4C19: Invalid RedirectP_Key\n"); >> +                     invalid = TRUE; >> +             } >> +             if (cpi->redir_qkey != IB_QP1_WELL_KNOWN_Q_KEY) { >> +                     OSM_LOG(pm->log, OSM_LOG_ERROR, >> +                             "ERR 4C1A: Invalid RedirectQ_Key\n"); >> +                     invalid = TRUE; >> +             } >> + >> +             if (!validate_redir_pkey(pm, cpi->redir_pkey, &pkey_ix)) { >> +                     OSM_LOG(pm->log, OSM_LOG_ERROR, >> +                             "ERR 4C1B: Index for Pkey 0x%x not found\n", >> +                             cl_ntoh16(cpi->redir_pkey)); >> +                     invalid = TRUE; >>               } > > All above are not OpenSM errors, but wrong external data. I think it > should be logged as VERBOSE messages. I agree it's wrong external data but it seems serious enough to me to treat as an error. If not, at least INFO rather than VERBOSE so nothing special needs to be done to see these. >> >>               if (!pm->subn->opt.perfmgr_redir) { >> -                             OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C16: " >> -                                    "redirection requested but disabled\n"); >> +                     OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C16: " >> +                             "redirection requested but disabled\n"); >> +                     invalid = TRUE; >> +             } > > This is not an error. Seems like some sort of configuration error to me if this is disabled at the manager but the PMA wants to use it. Other local configuration errors are treated as errors. > BTW, why to bother with verifying redirection info when redirection > support is disabled anyway? I thought it was useful to know the redirection info was invalid rather than getting the disabled notification and then enabling and finding out. It can easily be moved up earlier in the flow if that's better. >> + >> +             if (cpi->redir_lid == 0) { >> +                     /* GID redirection: get PathRecord information */ >> +                     OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C21: " >> +                             "GID redirection not currently supported\n"); >>                       goto Exit; >>               } >> >> @@ -1144,14 +1303,23 @@ static void osm_pc_rcv_process(void *context, void *data) >>                               p_mon_node->redir_tbl_size); >>                       goto Exit; >>               } >> +             p_mon_node->redir_port[port].redirection = TRUE; >> +             p_mon_node->redir_port[port].invalid = invalid; >> +             memcpy(&p_mon_node->redir_port[port].redir_gid, >> +                    &cpi->redir_gid, sizeof(ib_gid_t)); >>               p_mon_node->redir_port[port].redir_lid = cpi->redir_lid; >>               p_mon_node->redir_port[port].redir_qp = cpi->redir_qp; >> +             p_mon_node->redir_port[port].redir_pkey = cpi->redir_pkey; >> +             p_mon_node->redir_port[port].redir_pkey_ix = pkey_ix; >>               cl_plock_release(pm->lock); >> >> +             if (invalid) >> +                     goto Exit; >> + >>               /* Finally, reissue the query to the redirected location */ >>               status = >>                   osm_perfmgr_send_pc_mad(pm, cpi->redir_lid, cpi->redir_qp, >> -                                         port, >> +                                         pkey_ix, port, >>                                           mad_context->perfmgr_context. >>                                           mad_method, mad_context); >>               if (status != IB_SUCCESS) >> @@ -1234,6 +1402,7 @@ osm_perfmgr_init(osm_perfmgr_t * const pm, osm_opensm_t *osm, >>       pm->sweep_time_s = p_opt->perfmgr_sweep_time_s; >>       pm->max_outstanding_queries = p_opt->perfmgr_max_outstanding_queries; >>       pm->osm = osm; >> +     pm->local_port = -1; >> >>       status = cl_timer_init(&pm->sweep_timer, perfmgr_sweep, pm); >>       if (status != IB_SUCCESS) > > In general I would suggest to not mix redirection case with main flow > (by using better data structures). The Redirection is not something > PerfMgr specific and ideally we could have separate redirection handling > module. I'm not requesting to do this now (in this implementation), but > at least flow separation is very desirable. I've done some more of this in subsequent patches not yet submitted. However, there is no other GS manager (and if it did exist would it use redirection ? there are known cases for PerfMgr currently). In any case, this is "poor man's" redirection support given what is currently available in the OpenFabrics IB stack. -- Hal > Sasha > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From jsquyres at cisco.com Wed Apr 15 13:12:44 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 15 Apr 2009 16:12:44 -0400 Subject: [ofa-general] Fwd: [ewg] /etc/init.d/openibd is tailing /var/log/messages References: Message-ID: Re-sending to general list (bugzilla is impossible to use right now): > In OFED 1.4.1rc3, I see the following in /etc/init.d/openibd: > > tail -50 /var/log/messages >> $DEBUG_INFO > > That is not a valid assumption; syslogging may be entirely remote. > For example, I get "tail: cannot open `/var/log/messages' for reading: > No such file or directory" when I run openibd (it failed for another > reason). > -- Jeff Squyres Cisco Systems From hnrose at comcast.net Wed Apr 15 11:44:00 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Wed, 15 Apr 2009 14:44:00 -0400 Subject: [ofa-general] ***SPAM*** [PATCH 1/2] opensm/iba/ib_types.h: Add MaxCreditHint and LinkRoundTripLatency to PortInfo attribute Message-ID: <20090415184400.GA10166@comcast.net> Also, add comment on error thresholds Signed-off-by: Hal Rosenstock --- diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index 71443bd..1be2109 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -4433,7 +4433,9 @@ typedef struct _ib_port_info { uint8_t guid_cap; uint8_t subnet_timeout; /* cli_rereg(1b), resrv(2b), timeout(5b) */ uint8_t resp_time_value; - uint8_t error_threshold; + uint8_t error_threshold; /* local phy errors(4b), overrun errors(4b) */ + ib_net16_t max_credit_hint; + ib_net32_t link_rt_latency; /* reserv(8b), link round trip lat(24b) */ } PACK_SUFFIX ib_port_info_t; #include /************/ From YJia at tmriusa.com Wed Apr 15 15:45:30 2009 From: YJia at tmriusa.com (Yicheng Jia) Date: Wed, 15 Apr 2009 17:45:30 -0500 Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: <20090305055538.0A496F204CF@openfabrics.org> Message-ID: Hello Randy, I am trying to run "ibportstate reset" to reset the switch port on the other side in order to get 4x link. However I get the following error: ibwarn: [19660] mad_rpc: _do_madrpc failed; dport (Lid 7) ibportstate: iberror: failed: smp set portinfo failed And the port status change to DOWN after this. Have you ever tried to run "ibportstate" to reset the switch port? Thanks! Yicheng Jia ------------------------------ Message: 2 Date: Wed, 4 Mar 2009 18:39:54 -0600 From: Randy Halverson Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged switch To: "'general at lists.openfabrics.org'" Message-ID: <88EC963376E93B4DB0F2A69D932F786903D89CAF at MNEXMB2.qlogic.org> Content-Type: text/plain; charset="us-ascii" Hello Yicheng, After checking internally, this appears to be a known problem with older firmware for the 9024FC switches. It appears that you or another person at 'tmriusa.com' has recently opened a case with QLogic Tech Support for this issue. Please continue to work with QLogic Tech Support on firmware upgrade resolution since you probably don't have our FastFabric Tools to manage the 9024FC switches.. Regards, Randy Technical Support QLogic Corporation -------------- next part -------------- _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsquyres at cisco.com Wed Apr 15 17:49:06 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 15 Apr 2009 20:49:06 -0400 Subject: [ofa-general] Purpose of install.pl install option #1? Message-ID: I notice that when selecting to install the OFED 1.4.1rc3 software, option 1 is "build/install the core OpenFabrics stuff". But librdmacm is not included in this set. Given that you need librdmacm to make any iWARP connections, shouldn't it be part of this set? (and/or any other necessary CM-like packages) Indeed, it seems like the only HPC-level packages should really be the MPI packages -- *everything* else is basic functionality for OpenFabrics... Right? -- Jeff Squyres Cisco Systems From sashak at voltaire.com Wed Apr 15 17:54:22 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 16 Apr 2009 03:54:22 +0300 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCHv2] opensm/PerfMgr: Better redirection support In-Reply-To: References: <20090312202134.GC25024@comcast.net> <20090415125925.GF7353@sk> Message-ID: <20090416005422.GC10146@sk> On 16:29 Wed 15 Apr , Hal Rosenstock wrote: > >> > >> ??/* Redirection information */ > >> ??typedef struct redir { > >> + ?? ?? boolean_t redirection; > >> + ?? ?? boolean_t invalid; > > > > Why using lid value != 0 is/was bad for redirection invalidation? > > b/c there are other fields supplied in the redirection which also > could be invalid so this one flag summarizes that instead of > overloading one of the fields. Yes, and you can use lid value as such flag - just simpler. > >> - ?? ?? ?? ?? ?? ?? p_mon_node->redir_port[port].redir_lid = 0; > >> - ?? ?? ?? ?? ?? ?? p_mon_node->redir_port[port].redir_qp = 0; > >> + ?? ?? ?? ?? ?? ?? /* Clear redirection info for this port except orig_lid */ > >> + ?? ?? ?? ?? ?? ?? orig_lid = p_mon_node->redir_port[port].orig_lid; > >> + ?? ?? ?? ?? ?? ?? memset(&p_mon_node->redir_port[port], 0, sizeof(redir_t)); > >> + ?? ?? ?? ?? ?? ?? p_mon_node->redir_port[port].orig_lid = orig_lid; > > > > Hmm, why should 'orig_lid' be part of redirection structure and not > > placed on original node/port (below I see that it is used in > > non-redirected paths)? > > What are you referring to here ? Actually I was wrong - I don't where it is used at all. The comment about using in non-redirected path was about pkey_ix. > > I think it would be better to use structures like: > > > > struct node { > > ?? ?? ?? ??.... > > ?? ?? ?? ??uint16_t lid; > > ?? ?? ?? ??uint16_t pkey_ix; > > Why would lid and pkey_ix be part of node ? You are using this with perfmgr_send_pc_mad() in main flow. > It's like this with redir_tbl_size r.t. num_ports and redir_t > redir_port[1] instead of your struct port {} ports[0]. Why would what > you propose be better ? My point was different - to separate redirection related data from main flow. > > PerfMgr is always running over discovered fabric so maybe local port > > number should be detected later at start of PerfMgr process cycle just > > using OpenSM DB. > > Why is that better than doing this at bind time of PerfMgr ? At least two reasons: faster and less code. > >> ?? ?? ?? } > >> @@ -511,6 +556,10 @@ __osm_perfmgr_query_counters(cl_map_item_t * const p_map_item, void *context) > >> ?? ?? ?? ?? ?? ?? ?? if (!osm_node_get_physp_ptr(node, port)) > >> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? continue; > >> > >> + ?? ?? ?? ?? ?? ?? if (mon_node->redir_port[port].redirection && > >> + ?? ?? ?? ?? ?? ?? ?? ?? mon_node->redir_port[port].invalid) > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? continue; > >> + > > > > Are two flags really needed? Couldn't this be stripped down? > > Just invalid should be sufficient. I'll change this in the next version. > > > Also what about letting "chance" for port to refresh redirection info? > > What do you mean ? When port has invalid redirection data, should you care about attempting to refresh this? > >> + ?? ?? /* call the transport layer for a list of local port pkeys */ > >> + ?? ?? status = osm_vendor_get_all_port_attr(pm->subn->p_osm->p_vendor, > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? attr_array, &num_ports); > > > > This heavy stuff is performed per redirection request, but it actually > > uses same data (local port's pkey table). This looks very inefficient. > > Instead of doing this you can get local port's pkey table only once at > > PerfMgr process cycle start and do all checks against this already > > initialized table. Of course only in case when redirection is enabled at > > all. > > Redirection does not occur frequently. How could we know:) > Also, the pkey table could > change in between When OpenSM is in master mode it cannot change (PerfMgr is synchronized with heavy sweep). It is possible with standby OpenSM, so what - this single request will fail once. > and there's no local event support in OpenSM so I > don't see a way around this other than polling. Polling is not needed there. > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? OSM_LOG(pm->log, OSM_LOG_ERROR, > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? "ERR 4C1B: Index for Pkey 0x%x not found\n", > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? cl_ntoh16(cpi->redir_pkey)); > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? invalid = TRUE; > >> ?? ?? ?? ?? ?? ?? ?? } > > > > All above are not OpenSM errors, but wrong external data. I think it > > should be logged as VERBOSE messages. > > I agree it's wrong external data but it seems serious enough to me to > treat as an error. And some stupid port will be able to put OpenSM in endless error printing. I don't think it is a good idea. > If not, at least INFO rather than VERBOSE so > nothing special needs to be done to see these. IMO VERBOSE level is most appropriate for such things, INFO is something else and it is on by default. > >> - ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??"redirection requested but disabled\n"); > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C16: " > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? "redirection requested but disabled\n"); > >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? invalid = TRUE; > >> + ?? ?? ?? ?? ?? ?? } > > > > This is not an error. > > Seems like some sort of configuration error to me if this is disabled > at the manager but the PMA wants to use it. PMA shouldn't dictate here. > Other local configuration > errors are treated as errors. This is not configuration error - if admin decided to not bother with redirection support that is fine. > > BTW, why to bother with verifying redirection info when redirection > > support is disabled anyway? > > I thought it was useful to know the redirection info was invalid > rather than getting the disabled notification and then enabling and > finding out. For PMAs debug purposes redirection support should be switched "on" obviously. > It can easily be moved up earlier in the flow if that's > better. Would be better IMO. > > In general I would suggest to not mix redirection case with main flow > > (by using better data structures). The Redirection is not something > > PerfMgr specific and ideally we could have separate redirection handling > > module. I'm not requesting to do this now (in this implementation), but > > at least flow separation is very desirable. > > I've done some more of this in subsequent patches not yet submitted. > However, there is no other GS manager (and if it did exist would it > use redirection ? there are known cases for PerfMgr currently). > > In any case, this is "poor man's" redirection support given what is > currently available in the OpenFabrics IB stack. Right. So I'm not requesting "generic redirection module", just to not mix the main and redirected flows. Sasha From akpm at linux-foundation.org Wed Apr 15 14:32:28 2009 From: akpm at linux-foundation.org (Andrew Morton) Date: Wed, 15 Apr 2009 14:32:28 -0700 Subject: [ofa-general] Re: [PATCH] fix a warning on drivers/infiniband/hw/nes/nes_cm.c:862: In-Reply-To: <20090412122317.GA4787@helight> References: <20090412122317.GA4787@helight> Message-ID: <20090415143228.d57a0201.akpm@linux-foundation.org> On Sun, 12 Apr 2009 20:23:17 +0800 Zhenwen Xu wrote: > Fix this warning: > drivers/infiniband/hw/nes/nes_cm.c:862: warning: unused variable ___tmp_addr___ > > the 'tmp_addr' is defined for debug, so it should be defined in > CONFIG_INFINIBAND_NES_DEBUG > > > >From 5f67884bcda5450807dcd080378d829628e4db1c Mon Sep 17 00:00:00 2001 > From: Zhenwen Xu > Date: Sun, 12 Apr 2009 20:12:18 +0800 > Subject: [PATCH] fix a warning on drivers/infiniband/hw/nes/nes_cm.c:862: > > Signed-off-by: Zhenwen Xu > --- > drivers/infiniband/hw/nes/nes_cm.c | 3 ++- > 1 files changed, 2 insertions(+), 1 deletions(-) > > diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c > index dbd9a75..1bad93b 100644 > --- a/drivers/infiniband/hw/nes/nes_cm.c > +++ b/drivers/infiniband/hw/nes/nes_cm.c > @@ -854,8 +854,9 @@ static struct nes_cm_listener *find_listener(struct nes_cm_core *cm_core, > { > unsigned long flags; > struct nes_cm_listener *listen_node; > +#ifdef CONFIG_INFINIBAND_NES_DEBUG > __be32 tmp_addr = cpu_to_be32(dst_addr); > - > +#endif > /* walk list and find cm_node associated with this session ID */ > spin_lock_irqsave(&cm_core->listen_list_lock, flags); > list_for_each_entry(listen_node, &cm_core->listen_list.list, list) { eek, an ugly ifdef. And we can't just remove tmp_addr because printk(%p) wants to be passed an address rather than a value. It'd be nice if we had a handy macro to squish the warning, like uninitialized_var. As it happens, uninitialized_var() _does_ suppress the unused-var warning: --- a/drivers/infiniband/hw/nes/nes_cm.c~drivers-infiniband-hw-nes-nes_cmc-fix-unused-var-warning-cleanup +++ a/drivers/infiniband/hw/nes/nes_cm.c @@ -854,9 +854,8 @@ static struct nes_cm_listener *find_list { unsigned long flags; struct nes_cm_listener *listen_node; -#ifdef CONFIG_INFINIBAND_NES_DEBUG - __be32 tmp_addr = cpu_to_be32(dst_addr); -#endif + __be32 uninitialized_var(tmp_addr) = cpu_to_be32(dst_addr); + /* walk list and find cm_node associated with this session ID */ spin_lock_irqsave(&cm_core->listen_list_lock, flags); list_for_each_entry(listen_node, &cm_core->listen_list.list, list) { but that seems a bit abusive ;) From sashak at voltaire.com Wed Apr 15 17:09:34 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 16 Apr 2009 03:09:34 +0300 Subject: [ofa-general] Re: [PATCH] Update mad formatting functions. In-Reply-To: <20090415140341.dd26d8dc.weiny2@llnl.gov> References: <20090311144404.bf15ba8b.weiny2@llnl.gov> <20090415153003.GA20857@sk> <20090415140341.dd26d8dc.weiny2@llnl.gov> Message-ID: <20090416000934.GB10146@sk> On 14:03 Wed 15 Apr , Ira Weiny wrote: > > > > ib_mad_f table also has a name string field. I think it can be useful > > too - will help to unify outputs. Of course this can be done as > > subsequent patch. > > Yes but I don't think we should force users to use any specific output. If > they want to print the "name" of a field that should be a separate specifier > __not__ automatic. Is this what you mean? Yes, exactly - using separate convesion symbol, not both in %F. > > Lids are printed as decimals. > > Well I thought I copied the output from the example but I see that it is > printing decimal. So?? :-/ I fixed it. > > As an aside, not all LID's are decimal. Should we change this? Yes, it is better to have things unified. > from fields.c > ... > {BE_OFFS(256, 16), "DrSmpDLID", mad_dump_hex}, > {BE_OFFS(272, 16), "DrSmpSLID", mad_dump_hex}, > ... > {BITSOFFS(224, 16), "RedirectLID", mad_dump_hex}, > {BITSOFFS(480, 16), "TrapLID", mad_dump_hex}, > ... > {BITSOFFS(320, 16), "PathRecDLid", mad_dump_hex}, > {BITSOFFS(336, 16), "PathRecSLid", mad_dump_hex}, > ... > {BITSOFFS(288, 16), "McastMemMLid", mad_dump_hex}, Old stuff... > > Now instead of reimplementing *printf() functions with potential need > > to follow their extensions/conventions/update/etc wouldn't it be easier > > (and in long term safer) to just rebuild format string by resolving > > known %X conversions and then to pass it with rest parameters to > > standard libc's *printf()? > > > > In this way we will support all what *printf()s know + our conversions. > > > > I thought about that but decided not to do it. I can't remember why though... > ;-) So maybe I agree with you, let me try and remember and if I can't I will > change it. Thanks. Sasha From jsquyres at cisco.com Wed Apr 15 18:48:52 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 15 Apr 2009 21:48:52 -0400 Subject: [ofa-general] ***SPAM*** SPAM eggs SPAM bacon and SPAM In-Reply-To: <49e5ddfa.0136640a.2bf4.0fc7@mx.google.com> References: <49e5ddfa.0136640a.2bf4.0fc7@mx.google.com> Message-ID: Please note: - Jeff Becker is a volunteer. He sysadmins the OF server in his spare time (in addition to his real job; e.g., he just called me at 9:30pm to help rescue the mail server) - Specifically: Jeff B. does not have the time/cycles/expertise to fix the SPAM label problem (neither do I) - If someone wants to go in and fix the SPAM label problem, please do so I feel compelled to point out: - OpenFabrics could pay someone to sysadmin the box (and get real SSL certificates and ...) - The lists could all be set to only-members-can-post and the SPAM label problem goes away On Apr 15, 2009, at 9:15 AM, Tom Talpey wrote: > The openfabrics mail server is flagging every message from domains > such as gmail.com and yahoo.com with ***SPAM***, and as a result > every message on my screen this morning was advertising the lovely > canned meat. > > Going back a few days and 100 messages, in fact, only 30 of them > *weren't* decorated with SPAM. > > This has to stop. > > Tom. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Jeff Squyres Cisco Systems From bschubert at ddn.com Tue Apr 14 09:57:48 2009 From: bschubert at ddn.com (Bernd Schubert) Date: Tue, 14 Apr 2009 18:57:48 +0200 Subject: ***SPAM*** Re: [ofa-general] mlx4: errors and failures on OOM In-Reply-To: <20090414091223.c7911402.weiny2@llnl.gov> References: <200904112233.51105.bs_lists@aakef.fastmail.fm> <20090414091223.c7911402.weiny2@llnl.gov> Message-ID: <20090414185748.5ea98ae7@beno.local.bs> Hello Ira, please see my answer below. On Tue, 14 Apr 2009 09:12:23 -0700 Ira Weiny wrote: > On Mon, 13 Apr 2009 07:40:33 -0400 > Hal Rosenstock wrote: > > > On Sat, Apr 11, 2009 at 4:33 PM, Bernd Schubert > > wrote: > > > Hello, > > > > > > last week I had issues with Lustre failures, which turned out to be > > > failures of many clients, which run into out-of-memory due to bad user space jobs > > > (and no protection again that by the queuing system). > > > > > > Anyway, I don't think IB is supposed to fail, when the oom killer activates. > > > > > > Errors for 0x001b0d0000008ede "Cisco Switch" > > >   5: [XmtDiscards == 270] > > >         Link info:     38    5[  ]  ==( 4X 5.0 Gbps)==>  0x00188b9097fe2a81    1[  ] "eul0605 HCA-1" > > >   16: [XmtDiscards == 132] > > >         Link info:     38   16[  ]  ==( 4X 5.0 Gbps)==>  0x00188b9097fe2a01    1[  ] "eul0616 HCA-1" > > > > > > I used a script to monitor the fabric for failures every 5 min and just when the oom > > > killer activated on the clients the messages above came up. > > > > XmtDiscards are the total number of outbound packets discarded by the port > > because the port is down or congested. Reasons for this include: > > • Output port is not in the active state > > • Packet length exceeded NeighborMTU > > • Switch Lifetime Limit exceeded > > • Switch HOQ Lifetime Limit exceeded > > This may also include packets discarded while in VLStalled State. > > > For what you are describing this is "normal". "Normal" in the sense that the > HCA is no longer accepting inbound packets and the switch discards them. > > > > > > Below are syslogs from one of these clients > > > > > > Apr  4 08:50:38 eul0605 kernel: Lustre: Request x50173 sent from MGC172.17.31.247 at o2ib to NID 172.17.31.247 at o2ib 51s ago has timed out (limit > > > 300s). > > > Apr  4 08:50:38 eul0605 kernel: Lustre: Skipped 30 previous similar messages > > > Apr  4 08:50:38 eul0605 kernel: LustreError: 166-1: MGC172.17.31.247 at o2ib: Connection to service MGS via nid 172.17.31.247 at o2ib was lost; in > > > progress operations using this service will fail. > > > Apr  4 08:50:38 eul0605 kernel: Lustre: home1-MDT0000-mdc-0000010430fa0800: Connection to service home1-MDT0000 via nid 172.17.31.247 at o2ib was > > > lost; in progress operations using this service will wait for recovery to complete. > > > Apr  4 08:50:38 eul0605 kernel: Lustre: Skipped 7 previous similar messages > > > Apr  4 08:50:38 eul0605 kernel: Lustre: tmp-OST0003-osc-0000010423750000: Connection to service tmp-OST0003 via nid 172.17.31.231 at o2ib was lost; in > > > progress operations using this service will wait for recovery to complete. > > > Apr  4 08:50:38 eul0605 kernel: Lustre: Skipped 29 previous similar messages > > > Apr  4 08:50:38 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  req at 000001041bcbb800 x50205/t0 > > > o250->MGS at 172.17.31.247@o2ib:26/25 lens 304/456 e 0 to 1 dl 1238828031 ref 2 fl Rpc:N/0/0 rc 0/0 > > > Apr  4 08:50:38 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) Skipped 31 previous similar messages > > > Apr  4 08:50:38 eul0605 kernel: Lustre: Request x50205 sent from MGC172.17.31.247 at o2ib to NID 172.17.31.247 at o2ib 51s ago has timed out (limit > > > 300s). > > > > > > ===> So somehow lustre lost the network connection. On the server side the > > > logs simply show this node didn't answer to pings anymore. > > > > > > > > > Apr  4 08:52:58 eul0605 kernel: Lustre: Skipped 31 previous similar messages > > > Apr  4 08:52:59 eul0605 kernel: Lustre: Changing connection for MGC172.17.31.247 at o2ib to MGC172.17.31.247 at o2ib_1/172.17.31.246 at o2ib > > > Apr  4 08:52:59 eul0605 kernel: Lustre: Skipped 61 previous similar messages > > > Apr  4 08:53:00 eul0605 kernel: oom-killer: gfp_mask=0xd2 > > > > > > [...] > > > > > > Apr  4 08:53:05 eul0605 kernel: Out of Memory: Killed process 10612 (gamos). > > > Apr  4 08:53:10 eul0605 kernel: 3212 pages swap cached > > > Apr  4 08:53:10 eul0605 kernel: Out of Memory: Killed process 10292 (tcsh). > > > > > > ===> And here we see, gamos consumed all memory again. > > > > > > Apr  4 08:53:10 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  req at 0000010430f8f800 x50237/t0 > > > o250->MGS at MGC172.17.31.247@o2ib_1:26/25 lens 304/456 e 0 to 1 dl 1238828107 ref 2 fl Rpc:N/0/0 rc 0/0 > > > Apr  4 08:53:10 eul0605 kernel: LustreError: 3799:0:(events.c:66:request_out_callback()) Skipped 31 previous similar messages > > > Apr  4 08:53:10 eul0605 kernel: Lustre: Request x50237 sent from MGC172.17.31.247 at o2ib to NID 172.17.31.246 at o2ib 50s ago has timed out (limit > > > 300s). > > > Apr  4 08:53:10 eul0605 kernel: Lustre: Skipped 31 previous similar messages > > > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > > > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > > > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > > > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > > > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > > > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > > > Apr  4 08:53:10 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > > > Apr  4 08:53:11 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > > > Apr  4 08:53:11 eul0605 kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 > > > > That multicast group looks like the IPv4 broadcast group; -11 is > > EAGAIN. I'm not sure what's causing IPoIB to indicate this but I > > wonder if this is a second level failure due to the previous (Lustre) > > error detected. > > > > -- Hal > > > > > ===> So we see the reason why Lustre lost network connection - infiniband is down. > > > > > > > > > In most cases IB recovers from that situation, not always. If it then entirely > > > fails, ibnetdiscover or ibclearerrors will report that can't resolve the route > > > to these nodes. > > > > > > > > > This with drivers from ofed-1.3.1. Any ideas why OOM causes issues with IB? > > Are you getting any errors on the console from the kernel on these nodes? > Specifically from the HCA (I think it was mlx4) driver? If the nodes > recover I assume that means the ib0 errors go away and lustre reconnects? I get only the logs I already posted above. Most nodes recover, but some do not, I can't login to these anymore, ping on ethX still works, but not on IBX and ibnetdiscover et. al report errors about not being able to resolve the route to these nodes. As soon as I got in touch with this cluster the first time I immediate complained there is no serial console, unfortunately this is not solved yet, so I can't check the status of the 'zombie' nodes. Thanks, Bernd From devel-ofed at morey-chaisemartin.com Wed Apr 15 22:43:11 2009 From: devel-ofed at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Thu, 16 Apr 2009 07:43:11 +0200 Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: References: Message-ID: <49E6C56F.4000502@morey-chaisemartin.com> By any chances have you not reset the port you're on? Have you tried using another node to enable the port again? Nicolas Le 16/04/2009 00:45, Yicheng Jia a écrit : > > Hello Randy, > > I am trying to run "ibportstate reset" to reset the switch port on the > other side in order to get 4x link. However I get the following error: > ibwarn: [19660] mad_rpc: _do_madrpc failed; dport (Lid 7) > ibportstate: iberror: failed: smp set portinfo failed > > And the port status change to DOWN after this. Have you ever tried to > run "ibportstate" to reset the switch port? > > Thanks! > > Yicheng Jia > > > > > > ------------------------------ > > Message: 2 > Date: Wed, 4 Mar 2009 18:39:54 -0600 > From: Randy Halverson > Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged > switch > To: "'general at lists.openfabrics.org'" > Message-ID: > <88EC963376E93B4DB0F2A69D932F786903D89CAF at MNEXMB2.qlogic.org> > Content-Type: text/plain; charset="us-ascii" > > Hello Yicheng, > > After checking internally, this appears to be a known problem with older > firmware for the 9024FC switches. > > It appears that you or another person at 'tmriusa.com' has recently > opened a case with QLogic Tech Support for this issue. Please continue > to work with QLogic Tech Support on firmware upgrade resolution since > you probably don't have our FastFabric Tools to manage the 9024FC switches.. > > Regards, > > Randy > Technical Support > QLogic Corporation > -------------- next part -------------- > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by > MessageLabs. For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by > MessageLabs. For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From bart.vanassche at gmail.com Wed Apr 15 23:04:36 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Thu, 16 Apr 2009 08:04:36 +0200 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** SPAM eggs SPAM bacon and SPAM In-Reply-To: References: <49e5ddfa.0136640a.2bf4.0fc7@mx.google.com> Message-ID: On Thu, Apr 16, 2009 at 3:48 AM, Jeff Squyres wrote: > I feel compelled to point out: > > - OpenFabrics could pay someone to sysadmin the box (and get real SSL > certificates and ...) > - The lists could all be set to only-members-can-post and the SPAM label > problem goes away None of the other mailing lists I know of does any attempt to find out which message is a spam message and which one is not. The simplest measure to take is to configure the software that runs the mailing list such that it does not insert the annoying "SPAM" label. Bart. From bart.vanassche at gmail.com Wed Apr 15 23:06:23 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Thu, 16 Apr 2009 08:06:23 +0200 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Any per-port counter a user can look at? In-Reply-To: References: Message-ID: On Wed, Apr 15, 2009 at 11:12 PM, Chris Worley wrote: > Like RX and TX on an Ethernet interface, are there any IB port I/O > counters a user can query, and if so, how? Something like /sys/class/net/*/statistics/* ? Bart. From ramachandra.kuchimanchi at qlogic.com Wed Apr 15 23:44:40 2009 From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K) Date: Thu, 16 Apr 2009 12:14:40 +0530 Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] ***SPAM*** Any per-port counter a user can look at? In-Reply-To: References: Message-ID: <71d336490904152344u267a65e7t63979bbd0a3360dc@mail.gmail.com> On Thu, Apr 16, 2009 at 11:36 AM, Bart Van Assche wrote: > On Wed, Apr 15, 2009 at 11:12 PM, Chris Worley wrote: >> Like RX and TX on an Ethernet interface, are there any IB port I/O >> counters a user can query, and if so, how? > > Something like /sys/class/net/*/statistics/* ? The above will give IPoIB details. From /sys/class/infiniband/*/ports/*/counters/ you can get link level details. Regards, Ram From weiny2 at llnl.gov Thu Apr 16 00:16:43 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 16 Apr 2009 00:16:43 -0700 Subject: [ofa-general] Re: [PATCH v2] Update mad formatting functions. In-Reply-To: <20090416000934.GB10146@sk> References: <20090311144404.bf15ba8b.weiny2@llnl.gov> <20090415153003.GA20857@sk> <20090415140341.dd26d8dc.weiny2@llnl.gov> <20090416000934.GB10146@sk> Message-ID: <20090416001643.21f63cd6.weiny2@llnl.gov> Ok, v2 has a couple of changes. 1) implements the mad_vsnprintf with vsnprintf. 2) change formatting char to 'm' since "F" is floating point 3) add 'M' for printing the "name" of the field specified. The reason I did not use vsnprintf before was because of this statement in the vsnprintf man page. The functions vprintf(), vfprintf(), vsprintf(), vsnprintf() are equiv- alent to the functions printf(), fprintf(), sprintf(), snprintf(), respectively, except that they are called with a va_list instead of a variable number of arguments. These functions do not call the va_end macro. Consequently, the value of ap is undefined after the call. The ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ application should call va_end(ap) itself afterwards. I have made a comment in the patch where I am unsure of the call. This seems to work just fine on my Linux systems with gcc. Will this work on other systems/compilers? Ira From: Ira Weiny Date: Thu, 16 Apr 2009 00:07:03 -0700 Subject: [PATCH] Update mad formatting functions. Add mad_snprintf w/ man page Add mad_fprintf w/ man page Add comments to document current functions. Rename parameters to avoid confusion with other functions which take "buf" Mark mad_print_field as deprecated Signed-off-by: Ira Weiny --- libibmad/Makefile.am | 2 + libibmad/include/infiniband/mad.h | 28 ++++++++- libibmad/man/mad_fprintf.3 | 76 +++++++++++++++++++++++ libibmad/man/mad_snprintf.3 | 2 + libibmad/src/fields.c | 121 +++++++++++++++++++++++++++++++++++- libibmad/src/libibmad.map | 2 + 6 files changed, 224 insertions(+), 7 deletions(-) create mode 100644 libibmad/man/mad_fprintf.3 create mode 100644 libibmad/man/mad_snprintf.3 diff --git a/libibmad/Makefile.am b/libibmad/Makefile.am index 4f3ba98..da32899 100644 --- a/libibmad/Makefile.am +++ b/libibmad/Makefile.am @@ -5,6 +5,8 @@ INCLUDES = -I$(srcdir)/include -I$(includedir) lib_LTLIBRARIES = libibmad.la +man_MANS = man/mad_fprintf.3 man/mad_snprintf.3 + libibmad_la_CFLAGS = -Wall if HAVE_LD_VERSION_SCRIPT diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index b8290a7..ce840ac 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -719,9 +719,31 @@ MAD_EXPORT void mad_set_array(void *buf, int base_offs, enum MAD_FIELDS field, v MAD_EXPORT void mad_get_array(void *buf, int base_offs, enum MAD_FIELDS field, void *val); MAD_EXPORT void mad_decode_field(uint8_t * buf, enum MAD_FIELDS field, void *val); MAD_EXPORT void mad_encode_field(uint8_t * buf, enum MAD_FIELDS field, void *val); -MAD_EXPORT int mad_print_field(enum MAD_FIELDS field, const char *name, void *val); -MAD_EXPORT char *mad_dump_field(enum MAD_FIELDS field, char *buf, int bufsz, void *val); -MAD_EXPORT char *mad_dump_val(enum MAD_FIELDS field, char *buf, int bufsz, void *val); +MAD_EXPORT int mad_print_field(enum MAD_FIELDS field, const char *name, void *val) + DEPRECATED; + +/** + * The following functions print fields to "s" in various ways + * + * mad_dump_[val|field] take a value "val" and use "field" to format it + * + * mad_snprint_field takes a data buffer "buf" and uses field to extract and + * format it. + * + * RETURN "s" or NULL on failure + */ +MAD_EXPORT char *mad_dump_field(enum MAD_FIELDS field, char *s, int n, void *val); + /* outputs string ":........" */ +MAD_EXPORT char *mad_dump_val(enum MAD_FIELDS field, char *s, int n, void *val); + /* outputs string "" */ + +/** + * printf functions + * input's "standard" printf parameters except for "buf" which is a mad buffer + * return the number of actual chars written to "s" or "stream" + */ +MAD_EXPORT int mad_snprintf(char *s, size_t n, uint8_t *buf, const char *format, ...); +MAD_EXPORT int mad_fprintf(FILE *stream, uint8_t *buf, const char *format, ...); /* mad.c */ MAD_EXPORT void *mad_encode(void *buf, ib_rpc_t * rpc, ib_dr_path_t * drpath, diff --git a/libibmad/man/mad_fprintf.3 b/libibmad/man/mad_fprintf.3 new file mode 100644 index 0000000..f95f799 --- /dev/null +++ b/libibmad/man/mad_fprintf.3 @@ -0,0 +1,76 @@ +.\" -*- nroff -*- +.\" +.TH MAD_FPRINTF 3 "Feb 26, 2009" "OpenIB" "OpenIB Programmer\'s Manual" +.SH "NAME" +mad_fprintf, mad_snprintf \- formatted output conversion for mad packets +.SH "SYNOPSIS" +.nf +.B #include +.sp +.BI "MAD_EXPORT int mad_snprintf(char " "*s" ", size_t "n ", uint8_t " "*buf" ", const char " "*format" ", ...); +.BI "MAD_EXPORT int mad_fprintf(FILE " "*stream" ", uint8_t " "*buf" ", const char " "*format" ", ...); +.fi +.SH "DESCRIPTION" +Similar to the printf family of functions. The exception being they accept a +"buf" parameter which represents a mad data buffer. This buffer is used to +extract and print fields as specified with the +.B %m +and +.B %M +format specifiers. +.PP +.B m +The %m specifier is used to print out fields decoded from the "buf" data +buffer. +.B M +The %M specifier is used to print the name of the field specified. +.I enum MAD_FIELDS\fR +values should be used to specify the field to be decoded. +.PP +.SH "EXAMPLES" +.nf +char portinfo[64]; +void *pi = portinfo; +.PP +if (!smp_query(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout)) +.in +8 + return -1; +.in -16 +.PP +mad_fprintf(stdout, pi, "Port info (%s):\\n" +.in +16 +" %-10s (%M): %m\\n" +" %-10s (%M): %m\\n" +" %-10s (%M): %m\\n" +" %-10s (%M): %m\\n" +" %-10s (%M): %m\\n" +" %-10s (%M): %m\\n", +portid2str(portid), +"LID", IB_PORT_LID_F, IB_PORT_LID_F, +"LMC", IB_PORT_LMC_F, IB_PORT_LMC_F, +"state", IB_PORT_STATE_F, IB_PORT_STATE_F, +"physstate", IB_PORT_PHYS_STATE_F, IB_PORT_PHYS_STATE_F, +"linkwidth", IB_PORT_LINK_WIDTH_ACTIVE_F, IB_PORT_LINK_WIDTH_ACTIVE_F, +"linkspeed", IB_PORT_LINK_SPEED_ACTIVE_F, IB_PORT_LINK_SPEED_ACTIVE_F); +.in -16 +.PP +Results in the output. +.PP +Port info (DR path slid 0; dlid 0; 0,1,8,22): +.in +3 + LID (Lid): 15 + LMC (LMC): 0 + state (LinkState): Active + physstate (PhysLinkState): LinkUp + linkwidth (LinkWidthActive): 4X + linkspeed (LinkSpeedActive): 2.5 Gbps +.in -3 + +.SH "RETURN VALUE" +.B return the number of characters printed. + +.SH "SEE ALSO" +.BR printf (3) +.SH "AUTHOR" +.TP +Ira Weiny diff --git a/libibmad/man/mad_snprintf.3 b/libibmad/man/mad_snprintf.3 new file mode 100644 index 0000000..c004ab9 --- /dev/null +++ b/libibmad/man/mad_snprintf.3 @@ -0,0 +1,2 @@ +.TH MAD_SNPRINTF 3 "Feb 26, 2009" "OpenIB" "OpenIB Programmer\'s Manual" +.so man3/mad_fprintf.3 diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c index df43ceb..af36912 100644 --- a/libibmad/src/fields.c +++ b/libibmad/src/fields.c @@ -39,6 +39,7 @@ #include #include #include +#include #include @@ -442,6 +443,9 @@ static const ib_field_t ib_mad_f[] = { }; +#define MAD_FIELD_MAX_BYTE_LEN (256) + /* currently "Vendor2Data" increased to the next power of 2 */ + static void _set_field64(void *buf, int base_offs, const ib_field_t * f, uint64_t val) { @@ -666,6 +670,7 @@ static int _mad_print_field(const ib_field_t * f, const char *name, void *val, valsz ? valsz : ALIGN(f->bitlen, 8) / 8); } +/* This function is deprecated use mad_snprint_field or mad_dump_* instead */ int mad_print_field(enum MAD_FIELDS field, const char *name, void *val) { if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) @@ -673,16 +678,124 @@ int mad_print_field(enum MAD_FIELDS field, const char *name, void *val) return _mad_print_field(ib_mad_f + field, name, val, 0); } -char *mad_dump_field(enum MAD_FIELDS field, char *buf, int bufsz, void *val) +char *mad_dump_field(enum MAD_FIELDS field, char *s, int n, void *val) { if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) return 0; - return _mad_dump_field(ib_mad_f + field, 0, buf, bufsz, val); + return _mad_dump_field(ib_mad_f + field, 0, s, n, val); } -char *mad_dump_val(enum MAD_FIELDS field, char *buf, int bufsz, void *val) +char *mad_dump_val(enum MAD_FIELDS field, char *s, int n, void *val) { if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) return 0; - return _mad_dump_val(ib_mad_f + field, buf, bufsz, val); + return _mad_dump_val(ib_mad_f + field, s, n, val); } + +static int mad_vsnprintf(char *s, size_t n, void *buf, const char *fmt, va_list args) +{ + int rc = 0; + char *str; + char tmp[256]; + +/* Macros allows for bounding length of print to provided buffer + * remove 1 to allow for \0 char */ +#define WRITE_CHAR(c) do { \ + *str++ = c; \ + if (++rc >= (n-1)) { \ + goto max_len_hit; \ + } \ +} while(0) + +#define WRITE_STR(STR) do { \ + const char *ls; \ + for (ls = STR; *ls != '\0'; ls++) \ + WRITE_CHAR(*ls); \ +} while (0) + + for (str=s ; *fmt ; /* fmt is incremented in body */) { + if (*fmt != '%') { + WRITE_CHAR(*fmt); + ++fmt; + continue; + } + + ++fmt; /* skip '%' */ + switch (*fmt) { + case 'M': + { + /* print our special mad field name */ + int field = va_arg(args, int); + ++fmt; /* consume 'M' */ + if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) { + WRITE_STR(""); + continue; + } + + WRITE_STR(ib_mad_f[field].name); + break; + } + case 'm': + { + /* print our special mad field */ + uint8_t val[MAD_FIELD_MAX_BYTE_LEN]; + int field = va_arg(args, int); + const ib_field_t *f = ib_mad_f + field; + + ++fmt; /* consume 'm' */ + if (field <= IB_NO_FIELD || field >= IB_FIELD_LAST_) { + WRITE_STR(""); + continue; + } + + mad_decode_field(buf, field, val); + f->def_dump_fn(tmp, n, val, ALIGN(f->bitlen, 8) / 8); + WRITE_STR(tmp); + break; + } + default: + { + /* process all other formatting */ + char tf[256]; + int i; + tf[0] = '%'; + for (i = 1; *fmt != '%' || i == 255; fmt++, i++) + tf[i] = *fmt; + tf[i]='\0'; + vsnprintf(tmp, 256, tf, args); /* <== not sure if this is safe??? */ + WRITE_STR(tmp); + break; + } + } + } +max_len_hit: + *str = '\0'; + return str-s; +} + +int mad_snprintf(char *s, size_t n, uint8_t *buf, const char *format, ...) +{ + va_list args; + int i; + + va_start(args, format); + i = mad_vsnprintf(s, n, buf, format, args); + va_end(args); + return (i); +} + +int mad_fprintf(FILE *stream, uint8_t *buf, const char *format, ...) +{ + char str_buf[1024]; + va_list args; + int i,j; + + va_start(args, format); + i = mad_vsnprintf(str_buf, 1024, buf, format, args); + va_end(args); + j = fprintf(stream, "%s", str_buf); + if (i != j) + IBWARN("mad_vsnprintf and fprintf don't match???\n"); + return (i); +} + diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map index 4306dbc..f55027c 100644 --- a/libibmad/src/libibmad.map +++ b/libibmad/src/libibmad.map @@ -4,6 +4,8 @@ IBMAD_1.3 { mad_dump_field; mad_dump_val; mad_print_field; + mad_snprintf; + mad_fprintf; mad_dump_array; mad_dump_bitfield; mad_dump_hex; -- 1.5.4.5 From weiny2 at llnl.gov Thu Apr 16 00:18:10 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 16 Apr 2009 00:18:10 -0700 Subject: [ofa-general] Re: [PATCH v2] NOT For inclusion in master: Simple example of mad_fprintf in action. In-Reply-To: <20090311144406.8cf92ba6.weiny2@llnl.gov> References: <20090311144406.8cf92ba6.weiny2@llnl.gov> Message-ID: <20090416001810.6f874007.weiny2@llnl.gov> Just an example of the use of mad_fprintf against v2 of the patch. Ira From: Ira Weiny Date: Thu, 16 Apr 2009 00:07:04 -0700 Subject: [PATCH] Simple example of mad_fprintf in action. Signed-off-by: Ira Weiny --- infiniband-diags/src/ibnetdiscover.c | 23 +++++++++++++++++++++++ 1 files changed, 23 insertions(+), 0 deletions(-) diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 25c1f7f..71b478c 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -186,6 +186,29 @@ get_node(Node *node, Port *port, ib_portid_t *portid) return -1; decode_port_info(pi, port); + if (ibdebug || ibverbose) { + mad_fprintf(stdout, pi, "Port info (%s):\n" + " %-10s (%M): %m\n" + " %-10s (%M): %m\n" + " %-10s (%M): %m\n" + " %-10s (%M): %m\n" + " %-10s (%M): %m\n" + " %-10s (%M): %m\n", + portid2str(portid), + "LID", IB_PORT_LID_F, IB_PORT_LID_F, + "LMC", IB_PORT_LMC_F, IB_PORT_LMC_F, + "state", IB_PORT_STATE_F, IB_PORT_STATE_F, + "physstate", IB_PORT_PHYS_STATE_F, + IB_PORT_PHYS_STATE_F, + "linkwidth", IB_PORT_LINK_WIDTH_ACTIVE_F, + IB_PORT_LINK_WIDTH_ACTIVE_F, + "linkspeed", IB_PORT_LINK_SPEED_ACTIVE_F, + IB_PORT_LINK_SPEED_ACTIVE_F + ); + mad_fprintf(stdout, pi, "%sd%m%d%m\n", + "hello", IB_PORT_LID_F, 1, IB_PORT_LMC_F); + } + if (node->type != SWITCH_NODE) return 0; -- 1.5.4.5 From ogerlitz at voltaire.com Thu Apr 16 01:26:07 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 16 Apr 2009 11:26:07 +0300 Subject: [ofa-general] Any per-port counter a user can look at? In-Reply-To: References: Message-ID: <49E6EB9F.5000505@voltaire.com> Chris Worley wrote: > Like RX and TX on an Ethernet interface, are there any IB port I/O > counters a user can query, and if so, how? yes, IB L2 port counters can be queried using the perfquery(8) utility which is installed by infiniband-diags, note you would need to modprobe the ib_umad module. You may also want to use ibnetdiscover and/or ibv_devinfo to get the LIDs (IB L2 addresses) whose traffic you want to sample. Or. From ogerlitz at voltaire.com Thu Apr 16 02:21:42 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 16 Apr 2009 12:21:42 +0300 Subject: [ofa-general] eggs SPAM bacon and SPAM In-Reply-To: References: <49e5ddfa.0136640a.2bf4.0fc7@mx.google.com> Message-ID: <49E6F8A6.2060408@voltaire.com> Jeff Squyres wrote: > - Jeff Becker is a volunteer. He sysadmins the OF server in his spare > time (in addition to his real job; e.g., he just called me at 9:30pm > to help rescue the mail server) > - Specifically: Jeff B. does not have the time/cycles/expertise to fix > the SPAM label problem > - If someone wants to go in and fix the SPAM problem, please do so, I > feel compelled to point out: > - OpenFabrics could pay someone to sysadmin the box (and get real SSL > certificates and ...) > - The lists could all be set to only-members-can-post and the SPAM > label problem goes away Roland, You have raised in the past the idea to move the general list to be @vger.kernel.org server, where among other things an excellent spam related expertise exists, I vote for doing it now. Or. > > > > > On Apr 15, 2009, at 9:15 AM, Tom Talpey wrote: > >> The openfabrics mail server is flagging every message from domains >> such as gmail.com and yahoo.com with ***SPAM***, and as a result >> every message on my screen this morning was advertising the lovely >> canned meat. >> >> Going back a few days and 100 messages, in fact, only 30 of them >> *weren't* decorated with SPAM. >> >> This has to stop. >> >> Tom. >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > > > --Jeff Squyres > Cisco Systems > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From vlad at lists.openfabrics.org Thu Apr 16 03:25:12 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 16 Apr 2009 03:25:12 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090416-0200 daily build status Message-ID: <20090416102512.3FDF0E60EE9@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From kliteyn at dev.mellanox.co.il Thu Apr 16 03:52:31 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 16 Apr 2009 13:52:31 +0300 Subject: [ofa-general] [PATCH v2] opensm/osm_ucat_ftree.c Enhance min hops counters usage In-Reply-To: <49E44DD1.1050508@ext.bull.net> References: <49E44DD1.1050508@ext.bull.net> Message-ID: <49E70DEF.1000405@dev.mellanox.co.il> Hi Nicolas, Nicolas Morey Chaisemartin wrote: > This patch enhances the use of the min hop table done in the Fat-Tree algorithm. > Before this patch, the algorithm was using the osm_sw hops table to store the minhop values toward any lid (Switch or not). > As this table is allocated as we need it, it required a lot of malloc calls and quite some time to set the hops values on remote ports. > > This patch corrects this behaviour: > -The osm_sw hops table is only used for switch lid > -ftree_sw_t struct now has its own hop table (only 1 dimensionnal as we don't need to know which port is used) to store its minhop value > > > Signed-off-by: Nicolas Morey-Chaisemartin > --- > Fixed to work after > commit a10b57a2de9ace61455176ad5e43b7ca3d148cfb opensm/osm_ucast_ftree.c: lids are always handled in host order. > Memory allocation fixed (using right byte order + checking if succesfull) > > > opensm/opensm/osm_ucast_ftree.c | 66 +++++++++++++++++++++++++++++++-------- > 1 files changed, 53 insertions(+), 13 deletions(-) > > diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c > index dfe7009..83c901e 100644 > --- a/opensm/opensm/osm_ucast_ftree.c > +++ b/opensm/opensm/osm_ucast_ftree.c > @@ -172,6 +172,7 @@ typedef struct ftree_sw_t_ { > uint8_t up_port_groups_num; > boolean_t is_leaf; > unsigned down_port_groups_idx; > + uint8_t *hops; > } ftree_sw_t; > > /*************************************************** > @@ -554,6 +555,11 @@ static ftree_sw_t *sw_create(IN ftree_fabric_t * p_ftree, > > /* initialize lft buffer */ > memset(p_osm_sw->new_lft, OSM_NO_PATH, IB_LID_UCAST_END_HO + 1); > + p_sw->hops = > + malloc(p_osm_sw->max_lid_ho * sizeof(*(p_sw->hops))); > + if(p_sw->hops == NULL) > + return NULL; > + memset(p_sw->hops, OSM_NO_PATH, p_osm_sw->max_lid_ho); > > return p_sw; > } /* sw_create() */ > @@ -566,6 +572,7 @@ static void sw_destroy(IN ftree_fabric_t * p_ftree, IN ftree_sw_t * p_sw) > > if (!p_sw) > return; > + free(p_sw->hops); > > for (i = 0; i < p_sw->down_port_groups_num; i++) > port_group_destroy(p_sw->down_port_groups[i]); > @@ -693,32 +700,54 @@ static void sw_add_port(IN ftree_sw_t * p_sw, IN uint8_t port_num, > /***************************************************/ > > static inline cl_status_t sw_set_hops(IN ftree_sw_t * p_sw, IN uint16_t lid, > - IN uint8_t port_num, IN uint8_t hops) > + IN uint8_t port_num, IN uint8_t hops, > + IN boolean_t is_target_sw) > { > /* set local min hop table(LID) */ > - return osm_switch_set_hops(p_sw->p_osm_sw, lid, port_num, hops); > + p_sw->hops[lid] = hops; > + if (is_target_sw) > + return osm_switch_set_hops(p_sw->p_osm_sw, lid, port_num, hops); > + return 0; > } > > /***************************************************/ > > static int set_hops_on_remote_sw(IN ftree_port_group_t * p_group, > - IN uint16_t target_lid, IN uint8_t hops) > + IN uint16_t target_lid, IN uint8_t hops, > + IN boolean_t is_target_sw) > { > ftree_port_t *p_port; > uint8_t i, ports_num; > ftree_sw_t *p_remote_sw = p_group->remote_hca_or_sw.p_sw; > > + /* if lid is a switch, we set the min hop table in the osm_switch struct */ > CL_ASSERT(p_group->remote_node_type == IB_NODE_TYPE_SWITCH); > + p_remote_sw->hops[target_lid] = hops; > + > + /* If taget lid is a switch we set the min hop table values > + * for each port on the associated osm_sw struct */ I could be missing something here, but is the following code correct? > + if (!is_target_sw) > + return 0; > + > ports_num = (uint8_t) cl_ptr_vector_get_size(&p_group->ports); > for (i = 0; i < ports_num; i++) { > cl_ptr_vector_at(&p_group->ports, i, (void *)&p_port); > if (sw_set_hops(p_remote_sw, target_lid, > - p_port->remote_port_num, hops)) > + p_port->remote_port_num, hops, is_target_sw)) sw_set_hops() takes care of the hops table - sets local hop count for all types of targets and sets hops on osm_switch_t for switches only, so the "return 0;" above will cause the hops not to be set at all for HCA targets. -- Yevgeny From ogerlitz at voltaire.com Thu Apr 16 04:53:41 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 16 Apr 2009 14:53:41 +0300 Subject: [ofa-general] ***SPAM*** [PATCH] infiniband-diags/vendstat: Update man page and examples for PortXmit/RcvDataSL counter support In-Reply-To: <20090408150444.GA24876@comcast.net> References: <20090408150444.GA24876@comcast.net> Message-ID: <49E71C45.8000903@voltaire.com> Hal Rosenstock wrote: > +vendstat -c 0,1 6,12 # configure IS4 port 12 counter groups for PortXmitDataSL > +vendstat -c 2,8 6,12 # configure IS4 port 12 counter groups for PortRcvDataSL Hal, thanks for working on the patch, the "," sign between the lid (6) and the port (12) is buggy, please remove it. Or. From monis at Voltaire.COM Thu Apr 16 05:04:38 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Thu, 16 Apr 2009 15:04:38 +0300 Subject: [ofa-general] [PATCH] rdma_cm: Add proc entry to monitor rdma_cm connections In-Reply-To: <49D2417D.9000608@Voltaire.COM> References: <49D2417D.9000608@Voltaire.COM> Message-ID: <49E71ED6.5000806@Voltaire.COM> Moni Shoua wrote: > For each rdma_cm_id that is attached to a device print a set of fields > that describe the connection that this id represents. > > Below is an example of the output of 'cat /proc/rdma_cm' > This example is for a host with that runs a rping server and a rping client. > > TP DEV PO NDEV SRC DST PS ST QPN > 0 mthca0 0 0.0.0.0:7174 262 8 0 > 1 mthca0 1 ib0 192.30.3.249:34478 192.30.3.248:7174 262 5 328710 > 1 mthca0 1 ib0 192.30.3.249:7174 192.30.3.248:47625 262 5 328711 > > Signed-off-by: Moni Shoua > After getting no response to my question from netdev (is using procfs would be accepted) and after a little studying I'm convinced that the right place to put these files is debugfs, as Roland suggested. I'll send a patch that does it soon. thanks MoniS From monis at Voltaire.COM Thu Apr 16 05:09:05 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Thu, 16 Apr 2009 15:09:05 +0300 Subject: [ofa-general] [PATCH] rdma_cm: Add debugfs entries to monitor rdma_cm connections Message-ID: <49E71FE1.90102@Voltaire.COM> Create a virtual file under debugfs for each cma device and use it to print information about each rdma_id that is attached to this device. Here is an example of 'cat /sys/kernel/debug/rdma_cm/mthca0_rdma_id'. This example is for a host that runs a rping server (when a remote client is connected to it) and a rping client to a remote server. TP DEV PO NDEV SRC DST PS ST QPN 0 mthca0 0 0.0.0.0:7174 262 8 0 1 mthca0 1 ib0 192.30.3.249:50113 192.30.3.248:7174 262 5 66566 1 mthca0 1 ib0 192.30.3.249:7174 192.30.3.248:42560 262 5 66567 Signed-off-by: Moni Shoua -- drivers/infiniband/core/cma.c | 161 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 161 insertions(+) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 2a2e508..ea16d44 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -51,6 +51,9 @@ #include #include +#include +#include + MODULE_AUTHOR("Sean Hefty"); MODULE_DESCRIPTION("Generic RDMA CM Agent"); MODULE_LICENSE("Dual BSD/GPL"); @@ -59,6 +62,8 @@ MODULE_LICENSE("Dual BSD/GPL"); #define CMA_MAX_CM_RETRIES 15 #define CMA_CM_MRA_SETTING (IB_CM_MRA_FLAG_DELAY | 24) +static struct dentry *cma_root_dentry; + static void cma_add_one(struct ib_device *device); static void cma_remove_one(struct ib_device *device); @@ -86,6 +91,7 @@ struct cma_device { struct completion comp; atomic_t refcount; struct list_head id_list; + struct dentry *rdma_id_dentry; }; enum cma_state { @@ -2850,6 +2856,150 @@ static struct notifier_block cma_nb = { .notifier_call = cma_netdev_callback }; +static void *cma_rdma_id_seq_start(struct seq_file *file, loff_t *pos) +{ + struct cma_device *cma_dev = file->private; + void *ret; + + mutex_lock(&lock); + if (*pos == 0) + return SEQ_START_TOKEN; + ret = seq_list_start_head(&cma_dev->id_list, *pos); + return ret; +} + +static void *cma_rdma_id_seq_next(struct seq_file *file, void *v, loff_t *pos) +{ + void *ret; + struct cma_device *cma_dev = file->private; + if (v == SEQ_START_TOKEN) { + ++*pos; + if (!list_empty(&cma_dev->id_list)) + ret = cma_dev->id_list.next; + else + ret = NULL; + } else { + ret = seq_list_next(v, &cma_dev->id_list, pos); + } + return ret; +} + +static void cma_rdma_id_seq_stop(struct seq_file *file, void *iter_ptr) +{ + mutex_unlock(&lock); +} + +static void format_addr(struct sockaddr *sa, char* buf) +{ + switch (sa->sa_family) { + case AF_INET: { + struct sockaddr_in *sin = (struct sockaddr_in *)sa; + sprintf(buf, "%pI4:%u", &sin->sin_addr.s_addr, + be16_to_cpu(cma_port(sa))); + break; + } + case AF_INET6: { + struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)sa; + sprintf(buf, "%pI6:%u", &sin6->sin6_addr, + be16_to_cpu(cma_port(sa))); + break; + } + default: + buf[0] = 0; + } +} + +static int cma_rdma_id_seq_show(struct seq_file *file, void *v) +{ + struct rdma_id_private *id_priv; + char local_addr[64], remote_addr[64]; + + if (!v) + return 0; + if (v == SEQ_START_TOKEN) { + seq_printf(file, + "%-3s" + "%-8s" + "%-3s" + "%-5s" + "%-52s" + "%-52s" + "%-5s" + "%-3s" + "%-8s" + "\n", + "TP", "DEV", "PO", "NDEV", "SRC", "DST", "PS", "ST", "QPN"); + } else { + id_priv = list_entry(v, struct rdma_id_private, list); + format_addr((struct sockaddr *)&id_priv->id.route.addr.src_addr, + local_addr); + format_addr((struct sockaddr *)&id_priv->id.route.addr.dst_addr, + remote_addr); + + seq_printf(file, + "%-3d" + "%-8s" + "%-3d" + "%-5s" + "%-52s" + "%-52s" + "%-5d" + "%-3d" + "%-8d" + "\n", + id_priv->id.route.addr.dev_addr.dev_type, + (id_priv->id.device) ? id_priv->id.device->name : "", + id_priv->id.port_num, + (id_priv->id.route.addr.dev_addr.src_dev) ? id_priv->id.route.addr.dev_addr.src_dev->name : "", + local_addr, + remote_addr, + id_priv->id.ps, + id_priv->state, + id_priv->qp_num); + } + return 0; +} + +static const struct seq_operations cma_rdma_id_seq_ops = { + .start = cma_rdma_id_seq_start, + .next = cma_rdma_id_seq_next, + .stop = cma_rdma_id_seq_stop, + .show = cma_rdma_id_seq_show, +}; + +static int cma_rdma_id_open(struct inode *inode, struct file *file) +{ + struct seq_file *seq; + int ret; + + ret = seq_open(file, &cma_rdma_id_seq_ops); + if (ret) + return ret; + + seq = file->private_data; + seq->private = inode->i_private; + + return 0; +} + +static const struct file_operations cma_rdma_id_fops = { + .owner = THIS_MODULE, + .open = cma_rdma_id_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release +}; + +void cma_create_debug_files(struct cma_device *cma_dev) +{ + char name[IB_DEVICE_NAME_MAX + sizeof "_rdma_id"]; + snprintf(name, sizeof name, "%s_rdma_id", cma_dev->device->name); + cma_dev->rdma_id_dentry = debugfs_create_file(name, S_IFREG | S_IRUGO, + cma_root_dentry, cma_dev, &cma_rdma_id_fops); + if (!cma_dev->rdma_id_dentry) + printk(KERN_WARNING "RDMA CMA: failed to create debugfs file %s\n", name); +} + static void cma_add_one(struct ib_device *device) { struct cma_device *cma_dev; @@ -2871,6 +3021,7 @@ static void cma_add_one(struct ib_device *device) list_for_each_entry(id_priv, &listen_any_list, list) cma_listen_on_dev(id_priv, cma_dev); mutex_unlock(&lock); + cma_create_debug_files(cma_dev); } static int cma_remove_id_dev(struct rdma_id_private *id_priv) @@ -2905,6 +3056,8 @@ static void cma_process_remove(struct cma_device *cma_dev) int ret; mutex_lock(&lock); + if (cma_dev->rdma_id_dentry) + debugfs_remove(cma_dev->rdma_id_dentry); while (!list_empty(&cma_dev->id_list)) { id_priv = list_entry(cma_dev->id_list.next, struct rdma_id_private, list); @@ -2940,6 +3093,7 @@ static void cma_remove_one(struct ib_device *device) mutex_unlock(&lock); cma_process_remove(cma_dev); + kfree(cma_dev); } @@ -2947,6 +3101,12 @@ static int cma_init(void) { int ret, low, high, remaining; + cma_root_dentry = debugfs_create_dir("rdma_cm", NULL); + if (!cma_root_dentry) { + printk(KERN_ERR "RDMA CMA: failed to create debugfs dir\n"); + return -ENOMEM; + } + get_random_bytes(&next_port, sizeof next_port); inet_get_local_port_range(&low, &high); remaining = (high - low) + 1; @@ -2984,6 +3144,7 @@ static void cma_cleanup(void) idr_destroy(&tcp_ps); idr_destroy(&udp_ps); idr_destroy(&ipoib_ps); + debugfs_remove(cma_root_dentry); } module_init(cma_init); From nicolas.morey-chaisemartin at ext.bull.net Thu Apr 16 05:09:27 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey-Chaisemartin) Date: Thu, 16 Apr 2009 14:09:27 +0200 Subject: [ofa-general] [PATCH v2] opensm/osm_ucat_ftree.c Enhance min hops counters usage In-Reply-To: <49E70DEF.1000405@dev.mellanox.co.il> References: <49E44DD1.1050508@ext.bull.net> <49E70DEF.1000405@dev.mellanox.co.il> Message-ID: <49E71FF7.20903@ext.bull.net> Le 16/04/2009 12:52, Yevgeny Kliteynik a écrit : > Hi Nicolas, > > Nicolas Morey Chaisemartin wrote: >> This patch enhances the use of the min hop table done in the Fat-Tree >> algorithm. .... >> /***************************************************/ >> >> static inline cl_status_t sw_set_hops(IN ftree_sw_t * p_sw, IN >> uint16_t lid, >> - IN uint8_t port_num, IN uint8_t hops) >> + IN uint8_t port_num, IN uint8_t hops, >> + IN boolean_t is_target_sw) >> { >> /* set local min hop table(LID) */ >> - return osm_switch_set_hops(p_sw->p_osm_sw, lid, port_num, hops); >> + p_sw->hops[lid] = hops; >> + if (is_target_sw) >> + return osm_switch_set_hops(p_sw->p_osm_sw, lid, port_num, hops); >> + return 0; >> } >> >> /***************************************************/ >> >> static int set_hops_on_remote_sw(IN ftree_port_group_t * p_group, >> - IN uint16_t target_lid, IN uint8_t hops) >> + IN uint16_t target_lid, IN uint8_t hops, >> + IN boolean_t is_target_sw) >> { >> ftree_port_t *p_port; >> uint8_t i, ports_num; >> ftree_sw_t *p_remote_sw = p_group->remote_hca_or_sw.p_sw; >> >> + /* if lid is a switch, we set the min hop table in the osm_switch >> struct */ >> CL_ASSERT(p_group->remote_node_type == IB_NODE_TYPE_SWITCH); >> + p_remote_sw->hops[target_lid] = hops; >> + >> + /* If taget lid is a switch we set the min hop table values >> + * for each port on the associated osm_sw struct */ > > I could be missing something here, but is the following code correct? > >> + if (!is_target_sw) >> + return 0; >> + >> ports_num = (uint8_t) cl_ptr_vector_get_size(&p_group->ports); >> for (i = 0; i < ports_num; i++) { >> cl_ptr_vector_at(&p_group->ports, i, (void *)&p_port); >> if (sw_set_hops(p_remote_sw, target_lid, >> - p_port->remote_port_num, hops)) >> + p_port->remote_port_num, hops, is_target_sw)) > > sw_set_hops() takes care of the hops table - sets local hop count for > all types of targets and sets hops on osm_switch_t for switches only, > so the "return 0;" above will cause the hops not to be set at all for > HCA targets. > > -- Yevgeny > > > > Actually hops count to HCA is set in the local hop table, not in the OpenSM one. As in the local hop table we have only one entry per switch, we need to set it once only which is done here: >> CL_ASSERT(p_group->remote_node_type == IB_NODE_TYPE_SWITCH); >> + p_remote_sw->hops[target_lid] = hops; Not returning 0 would only do the same thing multiple times in the local table and still not set the OpenSM table. So actual behaviour seems OK to me. Nicolas From hal.rosenstock at gmail.com Thu Apr 16 06:04:59 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 16 Apr 2009 09:04:59 -0400 Subject: [ofa-general] ***SPAM*** [PATCH] infiniband-diags/vendstat: Update man page and examples for PortXmit/RcvDataSL counter support In-Reply-To: <49E71C45.8000903@voltaire.com> References: <20090408150444.GA24876@comcast.net> <49E71C45.8000903@voltaire.com> Message-ID: On Thu, Apr 16, 2009 at 7:53 AM, Or Gerlitz wrote: > Hal Rosenstock wrote: >> >> +vendstat -c 0,1 6,12   # configure IS4 port 12 counter groups for >> PortXmitDataSL >> +vendstat -c 2,8 6,12   # configure IS4 port 12 counter groups for >> PortRcvDataSL > > Hal, thanks for working on the patch, the "," sign between the lid (6) and > the port (12) is buggy, Thanks for catching those typos in the man page examples. > please remove it. I'm curious why not patch it rather than this email ? -- Hal > Or. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From hnrose at comcast.net Thu Apr 16 06:01:45 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Thu, 16 Apr 2009 09:01:45 -0400 Subject: [ofa-general] ***SPAM*** [PATCH] infiniband-diags/man/vendstat.8: Fix PortXmit/RcvDataSL examples Message-ID: <20090416130145.GA31864@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/infiniband-diags/man/vendstat.8 b/infiniband-diags/man/vendstat.8 index a73dcce..803e3b7 100644 --- a/infiniband-diags/man/vendstat.8 +++ b/infiniband-diags/man/vendstat.8 @@ -1,4 +1,4 @@ -.TH VENDSTAT 8 "April 6, 2009" "OpenIB" "OpenIB Diagnostics" +.TH VENDSTAT 8 "April 16, 2009" "OpenIB" "OpenIB Diagnostics" .SH NAME vendstat \- query InfiniBand vendor specific functions @@ -100,9 +100,9 @@ vendstat -w 6 # read IS3 port xmit wait counters .PP vendstat -i 6 12 # read IS4 port 12 counter group info .PP -vendstat -c 0,1 6,12 # configure IS4 port 12 counter groups for PortXmitDataSL +vendstat -c 0,1 6 12 # configure IS4 port 12 counter groups for PortXmitDataSL .PP -vendstat -c 2,8 6,12 # configure IS4 port 12 counter groups for PortRcvDataSL +vendstat -c 2,8 6 12 # configure IS4 port 12 counter groups for PortRcvDataSL .SH AUTHOR .TP From kliteyn at dev.mellanox.co.il Thu Apr 16 06:26:11 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Thu, 16 Apr 2009 16:26:11 +0300 Subject: [ofa-general] [PATCH v2] opensm/osm_ucat_ftree.c Enhance min hops counters usage In-Reply-To: <49E71FF7.20903@ext.bull.net> References: <49E44DD1.1050508@ext.bull.net> <49E70DEF.1000405@dev.mellanox.co.il> <49E71FF7.20903@ext.bull.net> Message-ID: <49E731F3.6000702@dev.mellanox.co.il> Nicolas Morey-Chaisemartin wrote: > Le 16/04/2009 12:52, Yevgeny Kliteynik a écrit : >> Hi Nicolas, >> >> Nicolas Morey Chaisemartin wrote: >>> This patch enhances the use of the min hop table done in the Fat-Tree >>> algorithm. > .... >>> /***************************************************/ >>> >>> static inline cl_status_t sw_set_hops(IN ftree_sw_t * p_sw, IN >>> uint16_t lid, >>> - IN uint8_t port_num, IN uint8_t hops) >>> + IN uint8_t port_num, IN uint8_t hops, >>> + IN boolean_t is_target_sw) >>> { >>> /* set local min hop table(LID) */ >>> - return osm_switch_set_hops(p_sw->p_osm_sw, lid, port_num, hops); >>> + p_sw->hops[lid] = hops; >>> + if (is_target_sw) >>> + return osm_switch_set_hops(p_sw->p_osm_sw, lid, port_num, hops); >>> + return 0; >>> } >>> >>> /***************************************************/ >>> >>> static int set_hops_on_remote_sw(IN ftree_port_group_t * p_group, >>> - IN uint16_t target_lid, IN uint8_t hops) >>> + IN uint16_t target_lid, IN uint8_t hops, >>> + IN boolean_t is_target_sw) >>> { >>> ftree_port_t *p_port; >>> uint8_t i, ports_num; >>> ftree_sw_t *p_remote_sw = p_group->remote_hca_or_sw.p_sw; >>> >>> + /* if lid is a switch, we set the min hop table in the osm_switch >>> struct */ >>> CL_ASSERT(p_group->remote_node_type == IB_NODE_TYPE_SWITCH); >>> + p_remote_sw->hops[target_lid] = hops; >>> + >>> + /* If taget lid is a switch we set the min hop table values >>> + * for each port on the associated osm_sw struct */ >> >> I could be missing something here, but is the following code correct? >> >>> + if (!is_target_sw) >>> + return 0; >>> + >>> ports_num = (uint8_t) cl_ptr_vector_get_size(&p_group->ports); >>> for (i = 0; i < ports_num; i++) { >>> cl_ptr_vector_at(&p_group->ports, i, (void *)&p_port); >>> if (sw_set_hops(p_remote_sw, target_lid, >>> - p_port->remote_port_num, hops)) >>> + p_port->remote_port_num, hops, is_target_sw)) >> >> sw_set_hops() takes care of the hops table - sets local hop count for >> all types of targets and sets hops on osm_switch_t for switches only, >> so the "return 0;" above will cause the hops not to be set at all for >> HCA targets. >> >> -- Yevgeny >> > > Actually hops count to HCA is set in the local hop table, not in the > OpenSM one. Yes, I understood that. > As in the local hop table we have only one entry per switch, we need to > set it once only which is done here: >>> CL_ASSERT(p_group->remote_node_type == IB_NODE_TYPE_SWITCH); >>> + p_remote_sw->hops[target_lid] = hops; Right, I missed this line. I thought that you just return from this function for HCA targets w/o setting the local hop table at all. Thanks. -- Yevgeny From hal.rosenstock at gmail.com Thu Apr 16 06:34:56 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 16 Apr 2009 09:34:56 -0400 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCHv2] opensm/PerfMgr: Better redirection support In-Reply-To: <20090416005422.GC10146@sk> References: <20090312202134.GC25024@comcast.net> <20090415125925.GF7353@sk> <20090416005422.GC10146@sk> Message-ID: On Wed, Apr 15, 2009 at 8:54 PM, Sasha Khapyorsky wrote: > On 16:29 Wed 15 Apr     , Hal Rosenstock wrote: >> >> >> >> ??/* Redirection information */ >> >> ??typedef struct redir { >> >> + ?? ?? boolean_t redirection; >> >> + ?? ?? boolean_t invalid; >> > >> > Why using lid value != 0 is/was bad for redirection invalidation? >> >> b/c there are other fields supplied in the redirection which also >> could be invalid so this one flag summarizes that instead of >> overloading one of the fields. > > Yes, and you can use lid value as such flag - just simpler. When GID redirection is specified by client, LID must be 0 so I don't see this. >> >> - ?? ?? ?? ?? ?? ?? p_mon_node->redir_port[port].redir_lid = 0; >> >> - ?? ?? ?? ?? ?? ?? p_mon_node->redir_port[port].redir_qp = 0; >> >> + ?? ?? ?? ?? ?? ?? /* Clear redirection info for this port except orig_lid */ >> >> + ?? ?? ?? ?? ?? ?? orig_lid = p_mon_node->redir_port[port].orig_lid; >> >> + ?? ?? ?? ?? ?? ?? memset(&p_mon_node->redir_port[port], 0, sizeof(redir_t)); >> >> + ?? ?? ?? ?? ?? ?? p_mon_node->redir_port[port].orig_lid = orig_lid; >> > >> > Hmm, why should 'orig_lid' be part of redirection structure and not >> > placed on original node/port (below I see that it is used in >> > non-redirected paths)? >> >> What are you referring to here ? > > Actually I was wrong - I don't where it is used at all. > The comment about using in non-redirected path was about pkey_ix. > >> > I think it would be better to use structures like: >> > >> > struct node { >> > ?? ?? ?? ??.... >> > ?? ?? ?? ??uint16_t lid; >> > ?? ?? ?? ??uint16_t pkey_ix; >> >> Why would lid and pkey_ix be part of node ? > > You are using this with perfmgr_send_pc_mad() in main flow. > >> It's like this with redir_tbl_size r.t. num_ports and redir_t >> redir_port[1] instead of your struct port {} ports[0]. Why would what >> you propose be better ? > > My point was different - to separate redirection related data from main > flow. I'm still not sure what you mean by this. Encapsulate the redirection data better so it is obtained by some potentially common routine ? >> > PerfMgr is always running over discovered fabric so maybe local port >> > number should be detected later at start of PerfMgr process cycle just >> > using OpenSM DB. >> >> Why is that better than doing this at bind time of PerfMgr ? > > At least two reasons: faster and less code. Are you sure the OpenSM DB accesses will be faster than the vendor calls here ? Is bind performance sensitive anyhow ? The performance comment is clearly relevant to the main flow though. >> >> ?? ?? ?? } >> >> @@ -511,6 +556,10 @@ __osm_perfmgr_query_counters(cl_map_item_t * const p_map_item, void *context) >> >> ?? ?? ?? ?? ?? ?? ?? if (!osm_node_get_physp_ptr(node, port)) >> >> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? continue; >> >> >> >> + ?? ?? ?? ?? ?? ?? if (mon_node->redir_port[port].redirection && >> >> + ?? ?? ?? ?? ?? ?? ?? ?? mon_node->redir_port[port].invalid) >> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? continue; >> >> + >> > >> > Are two flags really needed? Couldn't this be stripped down? >> >> Just invalid should be sufficient. I'll change this in the next version. >> >> > Also what about letting "chance" for port to refresh redirection info? >> >> What do you mean ? > > When port has invalid redirection data, should you care about attempting > to refresh this? If the PMA gives bad redirection data (which BTW is noncompliant), it seems likely to do this again so I'm not sure about the value of this. Do you think that's a better thing to do ? >> >> + ?? ?? /* call the transport layer for a list of local port pkeys */ >> >> + ?? ?? status = osm_vendor_get_all_port_attr(pm->subn->p_osm->p_vendor, >> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? attr_array, &num_ports); >> > >> > This heavy stuff is performed per redirection request, but it actually >> > uses same data (local port's pkey table). This looks very inefficient. >> > Instead of doing this you can get local port's pkey table only once at >> > PerfMgr process cycle start and do all checks against this already >> > initialized table. Of course only in case when redirection is enabled at >> > all. >> >> Redirection does not occur frequently. > > How could we know:) It's the current use case for PerfMgt. >> Also, the pkey table could >> change in between > > When OpenSM is in master mode it cannot change (PerfMgr is synchronized > with heavy sweep). > > It is possible with standby OpenSM, so what - this single request will > fail once. Some recovery for such failure would be needed. Also, what about not active ? >> and there's no local event support in OpenSM so I >> don't see a way around this other than polling. > > Polling is not needed there. As opposed to being event driven based on change; it's timer driven (e.g. polled) regardless of whether vendor layer or OpenSM DB is queried. >> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? OSM_LOG(pm->log, OSM_LOG_ERROR, >> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? "ERR 4C1B: Index for Pkey 0x%x not found\n", >> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? cl_ntoh16(cpi->redir_pkey)); >> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? invalid = TRUE; >> >> ?? ?? ?? ?? ?? ?? ?? } >> > >> > All above are not OpenSM errors, but wrong external data. I think it >> > should be logged as VERBOSE messages. >> >> I agree it's wrong external data but it seems serious enough to me to >> treat as an error. > > And some stupid port will be able to put OpenSM in endless error > printing. I don't think it is a good idea. It would be a non compliant PMA which I would think we'd want to know about sooner rather than later. >> If not, at least INFO rather than VERBOSE so >> nothing special needs to be done to see these. > > IMO VERBOSE level is most appropriate for such things, INFO is something > else and it is on by default. Yes, that's why I suggested INFO so nothing special would need to be done to see that there was a misbehaving PMA. I think this log level is the tradeoff in making it easier to debug problems in the field v. spamming the OpenSM log until the node is found/removed/repaired. Anyhow, it's clear to me that you won't accept it at even INFO, so I'll change these to VERBOSE. >> >> - ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??"redirection requested but disabled\n"); >> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C16: " >> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? "redirection requested but disabled\n"); >> >> + ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? invalid = TRUE; >> >> + ?? ?? ?? ?? ?? ?? } >> > >> > This is not an error. >> >> Seems like some sort of configuration error to me if this is disabled >> at the manager but the PMA wants to use it. > > PMA shouldn't dictate here. PMA does dictate redirection. Manager has no way to shut it off. If manager turns off it's handling of redirection, then it just doesn't work (that port is inaccessible by the manager). This argues for the default to be enabled. The current default is disabled since this code was deemed experimental. >> Other local configuration >> errors are treated as errors. > > This is not configuration error - if admin decided to not bother with > redirection support that is fine. Not really. See above comment. >> > BTW, why to bother with verifying redirection info when redirection >> > support is disabled anyway? >> >> I thought it was useful to know the redirection info was invalid >> rather than getting the disabled notification and then enabling and >> finding out. > > For PMAs debug purposes redirection support should be switched "on" > obviously. Why do you say debug purposes ? Isn't it any purpose ? See above. >> It can easily be moved up earlier in the flow if that's >> better. > > Would be better IMO. I'll do that in the next version. -- Hal >> > In general I would suggest to not mix redirection case with main flow >> > (by using better data structures). The Redirection is not something >> > PerfMgr specific and ideally we could have separate redirection handling >> > module.  I'm not requesting to do this now (in this implementation), but >> > at least flow separation is very desirable. >> >> I've done some more of this in subsequent patches not yet submitted. >> However, there is no other GS manager (and if it did exist would it >> use redirection ? there are known cases for PerfMgr currently). >> >> In any case, this is "poor man's" redirection support given what is >> currently available in the OpenFabrics IB stack. > > Right. So I'm not requesting "generic redirection module", just to not > mix the main and redirected flows. > > Sasha > From hal.rosenstock at gmail.com Thu Apr 16 06:50:46 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 16 Apr 2009 09:50:46 -0400 Subject: ***SPAM*** Re: [ofa-general] Re: [PATCH] Update mad formatting functions. In-Reply-To: <20090415140341.dd26d8dc.weiny2@llnl.gov> References: <20090311144404.bf15ba8b.weiny2@llnl.gov> <20090415153003.GA20857@sk> <20090415140341.dd26d8dc.weiny2@llnl.gov> Message-ID: On Wed, Apr 15, 2009 at 5:03 PM, Ira Weiny wrote: > As an aside, not all LID's are decimal.  Should we change this? > > from fields.c > ... >        {BE_OFFS(256, 16), "DrSmpDLID", mad_dump_hex}, >        {BE_OFFS(272, 16), "DrSmpSLID", mad_dump_hex}, > ... >        {BITSOFFS(224, 16), "RedirectLID", mad_dump_hex}, >        {BITSOFFS(480, 16), "TrapLID", mad_dump_hex}, > ... >        {BITSOFFS(320, 16), "PathRecDLid", mad_dump_hex}, >        {BITSOFFS(336, 16), "PathRecSLid", mad_dump_hex}, > ... >        {BITSOFFS(288, 16), "McastMemMLid", mad_dump_hex}, The agreement was decimal LIDs for unicast and hex LIDs for multicast. Permissive LID is unicast so is decimal. This was missed in libibmad though. -- Hal From ogerlitz at voltaire.com Thu Apr 16 07:43:23 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 16 Apr 2009 17:43:23 +0300 Subject: [ofa-general] Re: iSer tuning guide? In-Reply-To: References: Message-ID: <49E7440B.7070109@voltaire.com> Chris Worley wrote: > The OFED iSer wiki does say "1.4 includes iSer target support"... so rather than using the rpm shown above, or the install ofed.conf also referred to by the twiki, I configured OFED w/ tgt (somebody w/ permission should fix the wiki). It created two conflicting RPMs: > scsi-target-utils-0.1-20080828.x86_64.rpm and tgt-0.1-20080828.x86_64.rpm ... both had the same issues /w iSer as previously reported (one target max). Yes, as of RH 5.3 iser is included in the stgt package provided with the distro - I was thinking that the inclusion of stgt in ofed was wrong from the first place - but my opinion wasn't taken... > In looking around the web and at other mailing lists, it looks like iSer is still in it's infancy and there is no reliable IB implementation, which would be be exemplified by this one-sided conversation. Not that everything is working and in perfect condition - but you have jumped too early to conclusions... at least some of the people that can help you were on their Jewish Passover vacation when this thread started (April 10th) and only today are back in the office - lets get this debugged through the stgt target and open-iscsi initiator maintainers and mailing list - starting with stgt - see you there... Or. From YJia at tmriusa.com Thu Apr 16 08:12:34 2009 From: YJia at tmriusa.com (Yicheng Jia) Date: Thu, 16 Apr 2009 10:12:34 -0500 Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: <49E6C56F.4000502@morey-chaisemartin.com> Message-ID: Hi Nicolas, After this "reset" command, both ports are DOWN forever, I can only get portinfo from local port. I am sure that the port that has been reset is not the local port, otherwise it will prompt "node type not switch" error. I tried to enable this switch port from another port and brought it to POLLING state, but as long as I use "reset", both ports are DOWN. Thanks! Yicheng Jia Nicolas Morey-Chaisemartin 04/16/2009 12:43 AM To Yicheng Jia cc general at lists.openfabrics.org Subject Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch By any chances have you not reset the port you're on? Have you tried using another node to enable the port again? Nicolas Le 16/04/2009 00:45, Yicheng Jia a écrit : > > Hello Randy, > > I am trying to run "ibportstate reset" to reset the switch port on the > other side in order to get 4x link. However I get the following error: > ibwarn: [19660] mad_rpc: _do_madrpc failed; dport (Lid 7) > ibportstate: iberror: failed: smp set portinfo failed > > And the port status change to DOWN after this. Have you ever tried to > run "ibportstate" to reset the switch port? > > Thanks! > > Yicheng Jia > > > > > > ------------------------------ > > Message: 2 > Date: Wed, 4 Mar 2009 18:39:54 -0600 > From: Randy Halverson > Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged > switch > To: "'general at lists.openfabrics.org'" > Message-ID: > <88EC963376E93B4DB0F2A69D932F786903D89CAF at MNEXMB2.qlogic.org> > Content-Type: text/plain; charset="us-ascii" > > Hello Yicheng, > > After checking internally, this appears to be a known problem with older > firmware for the 9024FC switches. > > It appears that you or another person at 'tmriusa.com' has recently > opened a case with QLogic Tech Support for this issue. Please continue > to work with QLogic Tech Support on firmware upgrade resolution since > you probably don't have our FastFabric Tools to manage the 9024FC switches.. > > Regards, > > Randy > Technical Support > QLogic Corporation > -------------- next part -------------- > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by > MessageLabs. For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by > MessageLabs. For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From weiny2 at llnl.gov Thu Apr 16 09:23:29 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 16 Apr 2009 09:23:29 -0700 Subject: [ofa-general] [PATCH] change missed LID conversion functions from hex to uint (WAS: Re: [PATCH] Update mad formatting functions.) In-Reply-To: References: <20090311144404.bf15ba8b.weiny2@llnl.gov> <20090415153003.GA20857@sk> <20090415140341.dd26d8dc.weiny2@llnl.gov> Message-ID: <20090416092329.53b2628a.weiny2@llnl.gov> On Thu, 16 Apr 2009 09:50:46 -0400 Hal Rosenstock wrote: > On Wed, Apr 15, 2009 at 5:03 PM, Ira Weiny wrote: > > > > > As an aside, not all LID's are decimal.  Should we change this? > > > > from fields.c > > ... > >        {BE_OFFS(256, 16), "DrSmpDLID", mad_dump_hex}, > >        {BE_OFFS(272, 16), "DrSmpSLID", mad_dump_hex}, > > ... > >        {BITSOFFS(224, 16), "RedirectLID", mad_dump_hex}, > >        {BITSOFFS(480, 16), "TrapLID", mad_dump_hex}, > > ... > >        {BITSOFFS(320, 16), "PathRecDLid", mad_dump_hex}, > >        {BITSOFFS(336, 16), "PathRecSLid", mad_dump_hex}, > > ... > >        {BITSOFFS(288, 16), "McastMemMLid", mad_dump_hex}, > > The agreement was decimal LIDs for unicast and hex LIDs for multicast. > Permissive LID is unicast so is decimal. This was missed in libibmad > though. > Thanks Hal, Patch below. From: Ira Weiny Date: Thu, 16 Apr 2009 09:15:20 -0700 Subject: [PATCH] change missed LID conversion functions from hex to uint Signed-off-by: Ira Weiny --- libibmad/src/fields.c | 12 ++++++------ 1 files changed, 6 insertions(+), 6 deletions(-) diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c index df43ceb..60faf73 100644 --- a/libibmad/src/fields.c +++ b/libibmad/src/fields.c @@ -91,8 +91,8 @@ static const ib_field_t ib_mad_f[] = { {192, 64, "MadMkey", mad_dump_hex}, /* word 9 (32-37 bytes) */ - {BE_OFFS(256, 16), "DrSmpDLID", mad_dump_hex}, - {BE_OFFS(272, 16), "DrSmpSLID", mad_dump_hex}, + {BE_OFFS(256, 16), "DrSmpDLID", mad_dump_uint}, + {BE_OFFS(272, 16), "DrSmpSLID", mad_dump_uint}, /* word 10,11 (36-43 bytes) */ {288, 64, "SaSMkey", mad_dump_hex}, @@ -301,8 +301,8 @@ static const ib_field_t ib_mad_f[] = { */ {64, 128, "PathRecDGid", mad_dump_array}, {192, 128, "PathRecSGid", mad_dump_array}, - {BITSOFFS(320, 16), "PathRecDLid", mad_dump_hex}, - {BITSOFFS(336, 16), "PathRecSLid", mad_dump_hex}, + {BITSOFFS(320, 16), "PathRecDLid", mad_dump_uint}, + {BITSOFFS(336, 16), "PathRecSLid", mad_dump_uint}, {BITSOFFS(393, 7), "PathRecNumPath", mad_dump_uint}, /* @@ -388,7 +388,7 @@ static const ib_field_t ib_mad_f[] = { {BITSOFFS(192, 8), "RedirectTC", mad_dump_hex}, {BITSOFFS(200, 4), "RedirectSL", mad_dump_uint}, {BITSOFFS(204, 20), "RedirectFL", mad_dump_hex}, - {BITSOFFS(224, 16), "RedirectLID", mad_dump_hex}, + {BITSOFFS(224, 16), "RedirectLID", mad_dump_uint}, {BITSOFFS(240, 16), "RedirectPKey", mad_dump_hex}, {BITSOFFS(264, 24), "RedirectQP", mad_dump_hex}, {288, 32, "RedirectQKey", mad_dump_hex}, @@ -396,7 +396,7 @@ static const ib_field_t ib_mad_f[] = { {BITSOFFS(448, 8), "TrapTC", mad_dump_hex}, {BITSOFFS(456, 4), "TrapSL", mad_dump_uint}, {BITSOFFS(460, 20), "TrapFL", mad_dump_hex}, - {BITSOFFS(480, 16), "TrapLID", mad_dump_hex}, + {BITSOFFS(480, 16), "TrapLID", mad_dump_uint}, {BITSOFFS(496, 16), "TrapPKey", mad_dump_hex}, {BITSOFFS(512, 8), "TrapHL", mad_dump_uint}, {BITSOFFS(520, 24), "TrapQP", mad_dump_hex}, -- 1.5.4.5 From jgunthorpe at obsidianresearch.com Thu Apr 16 09:53:11 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 16 Apr 2009 10:53:11 -0600 Subject: [ofa-general] Re: [PATCH v2] Update mad formatting functions. In-Reply-To: <20090416001643.21f63cd6.weiny2@llnl.gov> References: <20090311144404.bf15ba8b.weiny2@llnl.gov> <20090415153003.GA20857@sk> <20090415140341.dd26d8dc.weiny2@llnl.gov> <20090416000934.GB10146@sk> <20090416001643.21f63cd6.weiny2@llnl.gov> Message-ID: <20090416165310.GJ9167@obsidianresearch.com> On Thu, Apr 16, 2009 at 12:16:43AM -0700, Ira Weiny wrote: > Ok, v2 has a couple of changes. > > 1) implements the mad_vsnprintf with vsnprintf. > 2) change formatting char to 'm' since "F" is floating point > 3) add 'M' for printing the "name" of the field specified. > > The reason I did not use vsnprintf before was because of this statement in the > vsnprintf man page. > > The functions vprintf(), vfprintf(), vsprintf(), vsnprintf() are equiv- > alent to the functions printf(), fprintf(), sprintf(), snprintf(), > respectively, except that they are called with a va_list instead of a > variable number of arguments. These functions do not call the va_end > macro. Consequently, the value of ap is undefined after the call. The > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > application should call va_end(ap) itself afterwards. > > I have made a comment in the patch where I am unsure of the call. This seems > to work just fine on my Linux systems with gcc. Will this work on other > systems/compilers? The rules for va's are funny, what the above is saying is that there is no guarentee what va_arg(ap) will return after vsnprintf. So to do what you are trying the proper use is something like: va_copy(tmpva,args); vsnprintf(tmp,256,tf,tmpva); va_end(tmpva); va_arg(args,??); Where it is somewhat challenging to compute ?? Relying on vsnprintf to advance args by exactly one is not portable. Jason From hal.rosenstock at gmail.com Thu Apr 16 11:53:46 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 16 Apr 2009 14:53:46 -0400 Subject: ***SPAM*** Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: References: <49E6C56F.4000502@morey-chaisemartin.com> Message-ID: On Thu, Apr 16, 2009 at 11:12 AM, Yicheng Jia wrote: > > Hi Nicolas, > > After this "reset" command, both ports are DOWN forever, I can only get > portinfo from local port. > > I am sure that the port that has been reset is not the local port, otherwise > it will prompt "node type not switch" error. > > I tried to enable this switch port from another port and brought it to > POLLING state, but as long as I use "reset", both ports are DOWN. What are the peer port's LinkDownDefaultStates ? Sounds like one or more must be Sleeping rather than Polling for some reason. -- Hal > Thanks! > > Yicheng Jia > > > > > Nicolas Morey-Chaisemartin > > 04/16/2009 12:43 AM > > To > Yicheng Jia > cc > general at lists.openfabrics.org > Subject > Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch > > > > > By any chances have you not reset the port you're on? > Have you tried using another node to enable the port again? > > Nicolas > > Le 16/04/2009 00:45, Yicheng Jia a écrit : >> >> Hello Randy, >> >> I am trying to run "ibportstate reset" to reset the switch port on the >> other side in order to get 4x link. However I get the following error: >> ibwarn: [19660] mad_rpc: _do_madrpc failed; dport (Lid 7) >> ibportstate: iberror: failed: smp set portinfo failed >> >> And the port status change to DOWN after this. Have you ever tried to >> run "ibportstate" to reset the switch port? >> >> Thanks! >> >> Yicheng Jia >> >> >> >> >> >> ------------------------------ >> >> Message: 2 >> Date: Wed, 4 Mar 2009 18:39:54 -0600 >> From: Randy Halverson >> Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged >> switch >> To: "'general at lists.openfabrics.org'" >> Message-ID: >> <88EC963376E93B4DB0F2A69D932F786903D89CAF at MNEXMB2.qlogic.org> >> Content-Type: text/plain; charset="us-ascii" >> >> Hello Yicheng, >> >> After checking internally, this appears to be a known problem with older >> firmware for the 9024FC switches. >> >> It appears that you or another person at 'tmriusa.com' has recently >> opened a case with QLogic Tech Support for this issue. Please continue >> to work with QLogic Tech Support on firmware upgrade resolution since >> you probably don't have our FastFabric Tools to manage the 9024FC >> switches.. >> >> Regards, >> >> Randy >> Technical Support >> QLogic Corporation >> -------------- next part -------------- >> >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by >> MessageLabs. For more information please visit http://www.ers.ibm.com >> >> _____________________________________________________________________________ >> >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by >> MessageLabs. For more information please visit http://www.ers.ibm.com >> >> _____________________________________________________________________________ >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > From YJia at tmriusa.com Thu Apr 16 12:18:57 2009 From: YJia at tmriusa.com (Yicheng Jia) Date: Thu, 16 Apr 2009 14:18:57 -0500 Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: Message-ID: They both are POLLING before "reset". Thanks! Yicheng Jia Hal Rosenstock 04/16/2009 01:53 PM To Yicheng Jia cc Nicolas Morey-Chaisemartin , general at lists.openfabrics.org Subject Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch On Thu, Apr 16, 2009 at 11:12 AM, Yicheng Jia wrote: > > Hi Nicolas, > > After this "reset" command, both ports are DOWN forever, I can only get > portinfo from local port. > > I am sure that the port that has been reset is not the local port, otherwise > it will prompt "node type not switch" error. > > I tried to enable this switch port from another port and brought it to > POLLING state, but as long as I use "reset", both ports are DOWN. What are the peer port's LinkDownDefaultStates ? Sounds like one or more must be Sleeping rather than Polling for some reason. -- Hal > Thanks! > > Yicheng Jia > > > > > Nicolas Morey-Chaisemartin > > 04/16/2009 12:43 AM > > To > Yicheng Jia > cc > general at lists.openfabrics.org > Subject > Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch > > > > > By any chances have you not reset the port you're on? > Have you tried using another node to enable the port again? > > Nicolas > > Le 16/04/2009 00:45, Yicheng Jia a écrit : >> >> Hello Randy, >> >> I am trying to run "ibportstate reset" to reset the switch port on the >> other side in order to get 4x link. However I get the following error: >> ibwarn: [19660] mad_rpc: _do_madrpc failed; dport (Lid 7) >> ibportstate: iberror: failed: smp set portinfo failed >> >> And the port status change to DOWN after this. Have you ever tried to >> run "ibportstate" to reset the switch port? >> >> Thanks! >> >> Yicheng Jia >> >> >> >> >> >> ------------------------------ >> >> Message: 2 >> Date: Wed, 4 Mar 2009 18:39:54 -0600 >> From: Randy Halverson >> Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged >> switch >> To: "'general at lists.openfabrics.org'" >> Message-ID: >> <88EC963376E93B4DB0F2A69D932F786903D89CAF at MNEXMB2.qlogic.org> >> Content-Type: text/plain; charset="us-ascii" >> >> Hello Yicheng, >> >> After checking internally, this appears to be a known problem with older >> firmware for the 9024FC switches. >> >> It appears that you or another person at 'tmriusa.com' has recently >> opened a case with QLogic Tech Support for this issue. Please continue >> to work with QLogic Tech Support on firmware upgrade resolution since >> you probably don't have our FastFabric Tools to manage the 9024FC >> switches.. >> >> Regards, >> >> Randy >> Technical Support >> QLogic Corporation >> -------------- next part -------------- >> >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by >> MessageLabs. For more information please visit http://www.ers.ibm.com >> >> _____________________________________________________________________________ >> >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by >> MessageLabs. For more information please visit http://www.ers.ibm.com >> >> _____________________________________________________________________________ >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Thu Apr 16 12:20:58 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 16 Apr 2009 15:20:58 -0400 Subject: ***SPAM*** Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: References: Message-ID: On Thu, Apr 16, 2009 at 3:18 PM, Yicheng Jia wrote: > > They both are POLLING before "reset". Then they _should_ come back to INIT. What does the local LDDS value say after reset ? Any way to get the switch port LDDS value ? -- Hal > Thanks! > Yicheng Jia > > > > > Hal Rosenstock > > 04/16/2009 01:53 PM > > To > Yicheng Jia > cc > Nicolas Morey-Chaisemartin , > general at lists.openfabrics.org > Subject > Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch > > > > > On Thu, Apr 16, 2009 at 11:12 AM, Yicheng Jia wrote: >> >> Hi Nicolas, >> >> After this "reset" command, both ports are DOWN forever, I can only get >> portinfo from local port. >> >> I am sure that the port that has been reset is not the local port, >> otherwise >> it will prompt "node type not switch" error. >> >> I tried to enable this switch port from another port and brought it to >> POLLING state, but as long as I use "reset", both ports are DOWN. > > What are the peer port's LinkDownDefaultStates ? Sounds like one or > more must be Sleeping rather than Polling for some reason. > > -- Hal > >> Thanks! >> >> Yicheng Jia >> >> >> >> >> Nicolas Morey-Chaisemartin >> >> 04/16/2009 12:43 AM >> >> To >> Yicheng Jia >> cc >> general at lists.openfabrics.org >> Subject >> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >> >> >> >> >> By any chances have you not reset the port you're on? >> Have you tried using another node to enable the port again? >> >> Nicolas >> >> Le 16/04/2009 00:45, Yicheng Jia a écrit : >>> >>> Hello Randy, >>> >>> I am trying to run "ibportstate reset" to reset the switch port on the >>> other side in order to get 4x link. However I get the following error: >>> ibwarn: [19660] mad_rpc: _do_madrpc failed; dport (Lid 7) >>> ibportstate: iberror: failed: smp set portinfo failed >>> >>> And the port status change to DOWN after this. Have you ever tried to >>> run "ibportstate" to reset the switch port? >>> >>> Thanks! >>> >>> Yicheng Jia >>> >>> >>> >>> >>> >>> ------------------------------ >>> >>> Message: 2 >>> Date: Wed, 4 Mar 2009 18:39:54 -0600 >>> From: Randy Halverson >>> Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged >>> switch >>> To: "'general at lists.openfabrics.org'" >>> Message-ID: >>> <88EC963376E93B4DB0F2A69D932F786903D89CAF at MNEXMB2.qlogic.org> >>> Content-Type: text/plain; charset="us-ascii" >>> >>> Hello Yicheng, >>> >>> After checking internally, this appears to be a known problem with older >>> firmware for the 9024FC switches. >>> >>> It appears that you or another person at 'tmriusa.com' has recently >>> opened a case with QLogic Tech Support for this issue. Please continue >>> to work with QLogic Tech Support on firmware upgrade resolution since >>> you probably don't have our FastFabric Tools to manage the 9024FC >>> switches.. >>> >>> Regards, >>> >>> Randy >>> Technical Support >>> QLogic Corporation >>> -------------- next part -------------- >>> >>> >>> >>> >>> _____________________________________________________________________________ >>> Scanned by IBM Email Security Management Services powered by >>> MessageLabs. For more information please visit http://www.ers.ibm.com >>> >>> >>> _____________________________________________________________________________ >>> >>> >>> >>> >>> _____________________________________________________________________________ >>> Scanned by IBM Email Security Management Services powered by >>> MessageLabs. For more information please visit http://www.ers.ibm.com >>> >>> >>> _____________________________________________________________________________ >>> >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >> >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by MessageLabs. >> For more information please visit http://www.ers.ibm.com >> >> _____________________________________________________________________________ >> >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by MessageLabs. >> For more information please visit http://www.ers.ibm.com >> >> _____________________________________________________________________________ >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > From hal.rosenstock at gmail.com Thu Apr 16 12:29:33 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 16 Apr 2009 15:29:33 -0400 Subject: ***SPAM*** Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: References: Message-ID: On Thu, Apr 16, 2009 at 3:20 PM, Hal Rosenstock wrote: > On Thu, Apr 16, 2009 at 3:18 PM, Yicheng Jia wrote: >> >> They both are POLLING before "reset". > > Then they _should_ come back to INIT. > > What does the local LDDS value say after reset ? Any way to get the > switch port LDDS value ? Are you resetting the switch from the peer HCA port or some other port ? That's what Nicolas asked but I might have missed the answer. Also, try disable (wait) and then enable and see if that works. If I recall correctly, you had those links which are taking a long time to initialize. If the link stays down forever after disable, this won't work but I want to be sure. -- Hal > -- Hal > >> Thanks! >> Yicheng Jia >> >> >> >> >> Hal Rosenstock >> >> 04/16/2009 01:53 PM >> >> To >> Yicheng Jia >> cc >> Nicolas Morey-Chaisemartin , >> general at lists.openfabrics.org >> Subject >> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >> >> >> >> >> On Thu, Apr 16, 2009 at 11:12 AM, Yicheng Jia wrote: >>> >>> Hi Nicolas, >>> >>> After this "reset" command, both ports are DOWN forever, I can only get >>> portinfo from local port. >>> >>> I am sure that the port that has been reset is not the local port, >>> otherwise >>> it will prompt "node type not switch" error. >>> >>> I tried to enable this switch port from another port and brought it to >>> POLLING state, but as long as I use "reset", both ports are DOWN. >> >> What are the peer port's LinkDownDefaultStates ? Sounds like one or >> more must be Sleeping rather than Polling for some reason. >> >> -- Hal >> >>> Thanks! >>> >>> Yicheng Jia >>> >>> >>> >>> >>> Nicolas Morey-Chaisemartin >>> >>> 04/16/2009 12:43 AM >>> >>> To >>> Yicheng Jia >>> cc >>> general at lists.openfabrics.org >>> Subject >>> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >>> >>> >>> >>> >>> By any chances have you not reset the port you're on? >>> Have you tried using another node to enable the port again? >>> >>> Nicolas >>> >>> Le 16/04/2009 00:45, Yicheng Jia a écrit : >>>> >>>> Hello Randy, >>>> >>>> I am trying to run "ibportstate reset" to reset the switch port on the >>>> other side in order to get 4x link. However I get the following error: >>>> ibwarn: [19660] mad_rpc: _do_madrpc failed; dport (Lid 7) >>>> ibportstate: iberror: failed: smp set portinfo failed >>>> >>>> And the port status change to DOWN after this. Have you ever tried to >>>> run "ibportstate" to reset the switch port? >>>> >>>> Thanks! >>>> >>>> Yicheng Jia >>>> >>>> >>>> >>>> >>>> >>>> ------------------------------ >>>> >>>> Message: 2 >>>> Date: Wed, 4 Mar 2009 18:39:54 -0600 >>>> From: Randy Halverson >>>> Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged >>>> switch >>>> To: "'general at lists.openfabrics.org'" >>>> Message-ID: >>>> <88EC963376E93B4DB0F2A69D932F786903D89CAF at MNEXMB2.qlogic.org> >>>> Content-Type: text/plain; charset="us-ascii" >>>> >>>> Hello Yicheng, >>>> >>>> After checking internally, this appears to be a known problem with older >>>> firmware for the 9024FC switches. >>>> >>>> It appears that you or another person at 'tmriusa.com' has recently >>>> opened a case with QLogic Tech Support for this issue. Please continue >>>> to work with QLogic Tech Support on firmware upgrade resolution since >>>> you probably don't have our FastFabric Tools to manage the 9024FC >>>> switches.. >>>> >>>> Regards, >>>> >>>> Randy >>>> Technical Support >>>> QLogic Corporation >>>> -------------- next part -------------- >>>> >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> Scanned by IBM Email Security Management Services powered by >>>> MessageLabs. For more information please visit http://www.ers.ibm.com >>>> >>>> >>>> _____________________________________________________________________________ >>>> >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> Scanned by IBM Email Security Management Services powered by >>>> MessageLabs. For more information please visit http://www.ers.ibm.com >>>> >>>> >>>> _____________________________________________________________________________ >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>>> http://openib.org/mailman/listinfo/openib-general >>> >>> >>> >>> _____________________________________________________________________________ >>> Scanned by IBM Email Security Management Services powered by MessageLabs. >>> For more information please visit http://www.ers.ibm.com >>> >>> _____________________________________________________________________________ >>> >>> >>> >>> _____________________________________________________________________________ >>> Scanned by IBM Email Security Management Services powered by MessageLabs. >>> For more information please visit http://www.ers.ibm.com >>> >>> _____________________________________________________________________________ >>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by MessageLabs. >> For more information please visit http://www.ers.ibm.com >> _____________________________________________________________________________ >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by MessageLabs. >> For more information please visit http://www.ers.ibm.com >> _____________________________________________________________________________ >> > From YJia at tmriusa.com Thu Apr 16 12:35:28 2009 From: YJia at tmriusa.com (Yicheng Jia) Date: Thu, 16 Apr 2009 14:35:28 -0500 Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: Message-ID: > Then they _should_ come back to INIT. No, they don't. > What does the local LDDS value say after reset ? Any way to get the > switch port LDDS value ? How can I check the LDDS value? Yicheng Jia Hal Rosenstock 04/16/2009 02:21 PM To Yicheng Jia cc Nicolas Morey-Chaisemartin , general at lists.openfabrics.org Subject Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch On Thu, Apr 16, 2009 at 3:18 PM, Yicheng Jia wrote: > > They both are POLLING before "reset". Then they _should_ come back to INIT. What does the local LDDS value say after reset ? Any way to get the switch port LDDS value ? -- Hal > Thanks! > Yicheng Jia > > > > > Hal Rosenstock > > 04/16/2009 01:53 PM > > To > Yicheng Jia > cc > Nicolas Morey-Chaisemartin , > general at lists.openfabrics.org > Subject > Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch > > > > > On Thu, Apr 16, 2009 at 11:12 AM, Yicheng Jia wrote: >> >> Hi Nicolas, >> >> After this "reset" command, both ports are DOWN forever, I can only get >> portinfo from local port. >> >> I am sure that the port that has been reset is not the local port, >> otherwise >> it will prompt "node type not switch" error. >> >> I tried to enable this switch port from another port and brought it to >> POLLING state, but as long as I use "reset", both ports are DOWN. > > What are the peer port's LinkDownDefaultStates ? Sounds like one or > more must be Sleeping rather than Polling for some reason. > > -- Hal > >> Thanks! >> >> Yicheng Jia >> >> >> >> >> Nicolas Morey-Chaisemartin >> >> 04/16/2009 12:43 AM >> >> To >> Yicheng Jia >> cc >> general at lists.openfabrics.org >> Subject >> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >> >> >> >> >> By any chances have you not reset the port you're on? >> Have you tried using another node to enable the port again? >> >> Nicolas >> >> Le 16/04/2009 00:45, Yicheng Jia a écrit : >>> >>> Hello Randy, >>> >>> I am trying to run "ibportstate reset" to reset the switch port on the >>> other side in order to get 4x link. However I get the following error: >>> ibwarn: [19660] mad_rpc: _do_madrpc failed; dport (Lid 7) >>> ibportstate: iberror: failed: smp set portinfo failed >>> >>> And the port status change to DOWN after this. Have you ever tried to >>> run "ibportstate" to reset the switch port? >>> >>> Thanks! >>> >>> Yicheng Jia >>> >>> >>> >>> >>> >>> ------------------------------ >>> >>> Message: 2 >>> Date: Wed, 4 Mar 2009 18:39:54 -0600 >>> From: Randy Halverson >>> Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged >>> switch >>> To: "'general at lists.openfabrics.org'" >>> Message-ID: >>> <88EC963376E93B4DB0F2A69D932F786903D89CAF at MNEXMB2.qlogic.org> >>> Content-Type: text/plain; charset="us-ascii" >>> >>> Hello Yicheng, >>> >>> After checking internally, this appears to be a known problem with older >>> firmware for the 9024FC switches. >>> >>> It appears that you or another person at 'tmriusa.com' has recently >>> opened a case with QLogic Tech Support for this issue. Please continue >>> to work with QLogic Tech Support on firmware upgrade resolution since >>> you probably don't have our FastFabric Tools to manage the 9024FC >>> switches.. >>> >>> Regards, >>> >>> Randy >>> Technical Support >>> QLogic Corporation >>> -------------- next part -------------- >>> >>> >>> >>> >>> _____________________________________________________________________________ >>> Scanned by IBM Email Security Management Services powered by >>> MessageLabs. For more information please visit http://www.ers.ibm.com >>> >>> >>> _____________________________________________________________________________ >>> >>> >>> >>> >>> _____________________________________________________________________________ >>> Scanned by IBM Email Security Management Services powered by >>> MessageLabs. For more information please visit http://www.ers.ibm.com >>> >>> >>> _____________________________________________________________________________ >>> >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >> >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by MessageLabs. >> For more information please visit http://www.ers.ibm.com >> >> _____________________________________________________________________________ >> >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by MessageLabs. >> For more information please visit http://www.ers.ibm.com >> >> _____________________________________________________________________________ >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Thu Apr 16 12:37:42 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 16 Apr 2009 15:37:42 -0400 Subject: ***SPAM*** Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: References: Message-ID: On Thu, Apr 16, 2009 at 3:35 PM, Yicheng Jia wrote: > >> Then they _should_ come back to INIT. > > No, they don't. > >> What does the local LDDS value say after reset ? Any way to get the >> switch port LDDS value ? > > How can I check the LDDS value? ^^^^^^ LinkDownDefaultState Thought you said they were both POLLING prior to reset. -- Hal > Yicheng Jia > > > > > Hal Rosenstock > > 04/16/2009 02:21 PM > > To > Yicheng Jia > cc > Nicolas Morey-Chaisemartin , > general at lists.openfabrics.org > Subject > Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch > > > > > On Thu, Apr 16, 2009 at 3:18 PM, Yicheng Jia wrote: >> >> They both are POLLING before "reset". > > Then they _should_ come back to INIT. > > What does the local LDDS value say after reset ? Any way to get the > switch port LDDS value ? > > -- Hal > >> Thanks! >> Yicheng Jia >> >> >> >> >> Hal Rosenstock >> >> 04/16/2009 01:53 PM >> >> To >> Yicheng Jia >> cc >> Nicolas Morey-Chaisemartin , >> general at lists.openfabrics.org >> Subject >> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >> >> >> >> >> On Thu, Apr 16, 2009 at 11:12 AM, Yicheng Jia wrote: >>> >>> Hi Nicolas, >>> >>> After this "reset" command, both ports are DOWN forever, I can only get >>> portinfo from local port. >>> >>> I am sure that the port that has been reset is not the local port, >>> otherwise >>> it will prompt "node type not switch" error. >>> >>> I tried to enable this switch port from another port and brought it to >>> POLLING state, but as long as I use "reset", both ports are DOWN. >> >> What are the peer port's LinkDownDefaultStates ? Sounds like one or >> more must be Sleeping rather than Polling for some reason. >> >> -- Hal >> >>> Thanks! >>> >>> Yicheng Jia >>> >>> >>> >>> >>> Nicolas Morey-Chaisemartin >>> >>> 04/16/2009 12:43 AM >>> >>> To >>> Yicheng Jia >>> cc >>> general at lists.openfabrics.org >>> Subject >>> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >>> >>> >>> >>> >>> By any chances have you not reset the port you're on? >>> Have you tried using another node to enable the port again? >>> >>> Nicolas >>> >>> Le 16/04/2009 00:45, Yicheng Jia a écrit : >>>> >>>> Hello Randy, >>>> >>>> I am trying to run "ibportstate reset" to reset the switch port on the >>>> other side in order to get 4x link. However I get the following error: >>>> ibwarn: [19660] mad_rpc: _do_madrpc failed; dport (Lid 7) >>>> ibportstate: iberror: failed: smp set portinfo failed >>>> >>>> And the port status change to DOWN after this. Have you ever tried to >>>> run "ibportstate" to reset the switch port? >>>> >>>> Thanks! >>>> >>>> Yicheng Jia >>>> >>>> >>>> >>>> >>>> >>>> ------------------------------ >>>> >>>> Message: 2 >>>> Date: Wed, 4 Mar 2009 18:39:54 -0600 >>>> From: Randy Halverson >>>> Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged >>>> switch >>>> To: "'general at lists.openfabrics.org'" >>>> Message-ID: >>>> <88EC963376E93B4DB0F2A69D932F786903D89CAF at MNEXMB2.qlogic.org> >>>> Content-Type: text/plain; charset="us-ascii" >>>> >>>> Hello Yicheng, >>>> >>>> After checking internally, this appears to be a known problem with older >>>> firmware for the 9024FC switches. >>>> >>>> It appears that you or another person at 'tmriusa.com' has recently >>>> opened a case with QLogic Tech Support for this issue. Please continue >>>> to work with QLogic Tech Support on firmware upgrade resolution since >>>> you probably don't have our FastFabric Tools to manage the 9024FC >>>> switches.. >>>> >>>> Regards, >>>> >>>> Randy >>>> Technical Support >>>> QLogic Corporation >>>> -------------- next part -------------- >>>> >>>> >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> Scanned by IBM Email Security Management Services powered by >>>> MessageLabs. For more information please visit http://www.ers.ibm.com >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> >>>> >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> Scanned by IBM Email Security Management Services powered by >>>> MessageLabs. For more information please visit http://www.ers.ibm.com >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>>> http://openib.org/mailman/listinfo/openib-general >>> >>> >>> >>> >>> _____________________________________________________________________________ >>> Scanned by IBM Email Security Management Services powered by MessageLabs. >>> For more information please visit http://www.ers.ibm.com >>> >>> >>> _____________________________________________________________________________ >>> >>> >>> >>> >>> _____________________________________________________________________________ >>> Scanned by IBM Email Security Management Services powered by MessageLabs. >>> For more information please visit http://www.ers.ibm.com >>> >>> >>> _____________________________________________________________________________ >>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by MessageLabs. >> For more information please visit http://www.ers.ibm.com >> >> _____________________________________________________________________________ >> >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by MessageLabs. >> For more information please visit http://www.ers.ibm.com >> >> _____________________________________________________________________________ >> > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > From YJia at tmriusa.com Thu Apr 16 12:47:33 2009 From: YJia at tmriusa.com (Yicheng Jia) Date: Thu, 16 Apr 2009 14:47:33 -0500 Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: Message-ID: > Are you resetting the switch from the peer HCA port or some other port > ? That's what Nicolas asked but I might have missed the answer. Yes, I am trying to reset from the peer HCA port. Is anything wrong with this? > Also, try disable (wait) and then enable and see if that works. It remains the same, the switch port is DOWN forever. No SMP massage could get to the switch port. > If I recall correctly, you had those links which are taking a long time to > initialize. If the link stays down forever after disable, this won't > work but I want to be sure. This is seperate issue. The "reset" command is tested on a single port HCA directly connected with Qlogic siwth. The HCA is plugged into a Linux machine. It is the simplest test environment. Thanks! Yicheng Jia Hal Rosenstock 04/16/2009 02:29 PM To Yicheng Jia cc Nicolas Morey-Chaisemartin , general at lists.openfabrics.org Subject Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch On Thu, Apr 16, 2009 at 3:20 PM, Hal Rosenstock wrote: > On Thu, Apr 16, 2009 at 3:18 PM, Yicheng Jia wrote: >> >> They both are POLLING before "reset". > > Then they _should_ come back to INIT. > > What does the local LDDS value say after reset ? Any way to get the > switch port LDDS value ? Are you resetting the switch from the peer HCA port or some other port ? That's what Nicolas asked but I might have missed the answer. Also, try disable (wait) and then enable and see if that works. If I recall correctly, you had those links which are taking a long time to initialize. If the link stays down forever after disable, this won't work but I want to be sure. -- Hal > -- Hal > >> Thanks! >> Yicheng Jia >> >> >> >> >> Hal Rosenstock >> >> 04/16/2009 01:53 PM >> >> To >> Yicheng Jia >> cc >> Nicolas Morey-Chaisemartin , >> general at lists.openfabrics.org >> Subject >> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >> >> >> >> >> On Thu, Apr 16, 2009 at 11:12 AM, Yicheng Jia wrote: >>> >>> Hi Nicolas, >>> >>> After this "reset" command, both ports are DOWN forever, I can only get >>> portinfo from local port. >>> >>> I am sure that the port that has been reset is not the local port, >>> otherwise >>> it will prompt "node type not switch" error. >>> >>> I tried to enable this switch port from another port and brought it to >>> POLLING state, but as long as I use "reset", both ports are DOWN. >> >> What are the peer port's LinkDownDefaultStates ? Sounds like one or >> more must be Sleeping rather than Polling for some reason. >> >> -- Hal >> >>> Thanks! >>> >>> Yicheng Jia >>> >>> >>> >>> >>> Nicolas Morey-Chaisemartin >>> >>> 04/16/2009 12:43 AM >>> >>> To >>> Yicheng Jia >>> cc >>> general at lists.openfabrics.org >>> Subject >>> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >>> >>> >>> >>> >>> By any chances have you not reset the port you're on? >>> Have you tried using another node to enable the port again? >>> >>> Nicolas >>> >>> Le 16/04/2009 00:45, Yicheng Jia a écrit : >>>> >>>> Hello Randy, >>>> >>>> I am trying to run "ibportstate reset" to reset the switch port on the >>>> other side in order to get 4x link. However I get the following error: >>>> ibwarn: [19660] mad_rpc: _do_madrpc failed; dport (Lid 7) >>>> ibportstate: iberror: failed: smp set portinfo failed >>>> >>>> And the port status change to DOWN after this. Have you ever tried to >>>> run "ibportstate" to reset the switch port? >>>> >>>> Thanks! >>>> >>>> Yicheng Jia >>>> >>>> >>>> >>>> >>>> >>>> ------------------------------ >>>> >>>> Message: 2 >>>> Date: Wed, 4 Mar 2009 18:39:54 -0600 >>>> From: Randy Halverson >>>> Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged >>>> switch >>>> To: "'general at lists.openfabrics.org'" >>>> Message-ID: >>>> <88EC963376E93B4DB0F2A69D932F786903D89CAF at MNEXMB2.qlogic.org> >>>> Content-Type: text/plain; charset="us-ascii" >>>> >>>> Hello Yicheng, >>>> >>>> After checking internally, this appears to be a known problem with older >>>> firmware for the 9024FC switches. >>>> >>>> It appears that you or another person at 'tmriusa.com' has recently >>>> opened a case with QLogic Tech Support for this issue. Please continue >>>> to work with QLogic Tech Support on firmware upgrade resolution since >>>> you probably don't have our FastFabric Tools to manage the 9024FC >>>> switches.. >>>> >>>> Regards, >>>> >>>> Randy >>>> Technical Support >>>> QLogic Corporation >>>> -------------- next part -------------- >>>> >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> Scanned by IBM Email Security Management Services powered by >>>> MessageLabs. For more information please visit http://www.ers.ibm.com >>>> >>>> >>>> _____________________________________________________________________________ >>>> >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> Scanned by IBM Email Security Management Services powered by >>>> MessageLabs. For more information please visit http://www.ers.ibm.com >>>> >>>> >>>> _____________________________________________________________________________ >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>>> http://openib.org/mailman/listinfo/openib-general >>> >>> >>> >>> _____________________________________________________________________________ >>> Scanned by IBM Email Security Management Services powered by MessageLabs. >>> For more information please visit http://www.ers.ibm.com >>> >>> _____________________________________________________________________________ >>> >>> >>> >>> _____________________________________________________________________________ >>> Scanned by IBM Email Security Management Services powered by MessageLabs. >>> For more information please visit http://www.ers.ibm.com >>> >>> _____________________________________________________________________________ >>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by MessageLabs. >> For more information please visit http://www.ers.ibm.com >> _____________________________________________________________________________ >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by MessageLabs. >> For more information please visit http://www.ers.ibm.com >> _____________________________________________________________________________ >> > _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From YJia at tmriusa.com Thu Apr 16 12:49:39 2009 From: YJia at tmriusa.com (Yicheng Jia) Date: Thu, 16 Apr 2009 14:49:39 -0500 Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: Message-ID: > Thought you said they were both POLLING prior to reset. Yes, they are. I don't know what "LDDS" stands for in the first place:) Thanks! Yicheng Jia Hal Rosenstock 04/16/2009 02:37 PM To Yicheng Jia cc Nicolas Morey-Chaisemartin , general at lists.openfabrics.org Subject Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch On Thu, Apr 16, 2009 at 3:35 PM, Yicheng Jia wrote: > >> Then they _should_ come back to INIT. > > No, they don't. > >> What does the local LDDS value say after reset ? Any way to get the >> switch port LDDS value ? > > How can I check the LDDS value? ^^^^^^ LinkDownDefaultState Thought you said they were both POLLING prior to reset. -- Hal > Yicheng Jia > > > > > Hal Rosenstock > > 04/16/2009 02:21 PM > > To > Yicheng Jia > cc > Nicolas Morey-Chaisemartin , > general at lists.openfabrics.org > Subject > Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch > > > > > On Thu, Apr 16, 2009 at 3:18 PM, Yicheng Jia wrote: >> >> They both are POLLING before "reset". > > Then they _should_ come back to INIT. > > What does the local LDDS value say after reset ? Any way to get the > switch port LDDS value ? > > -- Hal > >> Thanks! >> Yicheng Jia >> >> >> >> >> Hal Rosenstock >> >> 04/16/2009 01:53 PM >> >> To >> Yicheng Jia >> cc >> Nicolas Morey-Chaisemartin , >> general at lists.openfabrics.org >> Subject >> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >> >> >> >> >> On Thu, Apr 16, 2009 at 11:12 AM, Yicheng Jia wrote: >>> >>> Hi Nicolas, >>> >>> After this "reset" command, both ports are DOWN forever, I can only get >>> portinfo from local port. >>> >>> I am sure that the port that has been reset is not the local port, >>> otherwise >>> it will prompt "node type not switch" error. >>> >>> I tried to enable this switch port from another port and brought it to >>> POLLING state, but as long as I use "reset", both ports are DOWN. >> >> What are the peer port's LinkDownDefaultStates ? Sounds like one or >> more must be Sleeping rather than Polling for some reason. >> >> -- Hal >> >>> Thanks! >>> >>> Yicheng Jia >>> >>> >>> >>> >>> Nicolas Morey-Chaisemartin >>> >>> 04/16/2009 12:43 AM >>> >>> To >>> Yicheng Jia >>> cc >>> general at lists.openfabrics.org >>> Subject >>> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >>> >>> >>> >>> >>> By any chances have you not reset the port you're on? >>> Have you tried using another node to enable the port again? >>> >>> Nicolas >>> >>> Le 16/04/2009 00:45, Yicheng Jia a écrit : >>>> >>>> Hello Randy, >>>> >>>> I am trying to run "ibportstate reset" to reset the switch port on the >>>> other side in order to get 4x link. However I get the following error: >>>> ibwarn: [19660] mad_rpc: _do_madrpc failed; dport (Lid 7) >>>> ibportstate: iberror: failed: smp set portinfo failed >>>> >>>> And the port status change to DOWN after this. Have you ever tried to >>>> run "ibportstate" to reset the switch port? >>>> >>>> Thanks! >>>> >>>> Yicheng Jia >>>> >>>> >>>> >>>> >>>> >>>> ------------------------------ >>>> >>>> Message: 2 >>>> Date: Wed, 4 Mar 2009 18:39:54 -0600 >>>> From: Randy Halverson >>>> Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged >>>> switch >>>> To: "'general at lists.openfabrics.org'" >>>> Message-ID: >>>> <88EC963376E93B4DB0F2A69D932F786903D89CAF at MNEXMB2.qlogic.org> >>>> Content-Type: text/plain; charset="us-ascii" >>>> >>>> Hello Yicheng, >>>> >>>> After checking internally, this appears to be a known problem with older >>>> firmware for the 9024FC switches. >>>> >>>> It appears that you or another person at 'tmriusa.com' has recently >>>> opened a case with QLogic Tech Support for this issue. Please continue >>>> to work with QLogic Tech Support on firmware upgrade resolution since >>>> you probably don't have our FastFabric Tools to manage the 9024FC >>>> switches.. >>>> >>>> Regards, >>>> >>>> Randy >>>> Technical Support >>>> QLogic Corporation >>>> -------------- next part -------------- >>>> >>>> >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> Scanned by IBM Email Security Management Services powered by >>>> MessageLabs. For more information please visit http://www.ers.ibm.com >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> >>>> >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> Scanned by IBM Email Security Management Services powered by >>>> MessageLabs. For more information please visit http://www.ers.ibm.com >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>>> http://openib.org/mailman/listinfo/openib-general >>> >>> >>> >>> >>> _____________________________________________________________________________ >>> Scanned by IBM Email Security Management Services powered by MessageLabs. >>> For more information please visit http://www.ers.ibm.com >>> >>> >>> _____________________________________________________________________________ >>> >>> >>> >>> >>> _____________________________________________________________________________ >>> Scanned by IBM Email Security Management Services powered by MessageLabs. >>> For more information please visit http://www.ers.ibm.com >>> >>> >>> _____________________________________________________________________________ >>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by MessageLabs. >> For more information please visit http://www.ers.ibm.com >> >> _____________________________________________________________________________ >> >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by MessageLabs. >> For more information please visit http://www.ers.ibm.com >> >> _____________________________________________________________________________ >> > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Thu Apr 16 13:22:14 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 16 Apr 2009 16:22:14 -0400 Subject: ***SPAM*** Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: References: Message-ID: On Thu, Apr 16, 2009 at 3:49 PM, Yicheng Jia wrote: > >> Thought you said they were both POLLING prior to reset. > > Yes, they are. I don't know what "LDDS" stands for in the first place:) Then what is in POLLING ? Are you referring to PortPhysicalState and not LinkDownDefaultState (LDDS). I use LDDS because I tire of typing the whole thing. It is a component (field) in PortInfo. -- Hal > Thanks! > Yicheng Jia > > > > > Hal Rosenstock > > 04/16/2009 02:37 PM > > To > Yicheng Jia > cc > Nicolas Morey-Chaisemartin , > general at lists.openfabrics.org > Subject > Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch > > > > > On Thu, Apr 16, 2009 at 3:35 PM, Yicheng Jia wrote: >> >>> Then they _should_ come back to INIT. >> >> No, they don't. >> >>> What does the local LDDS value say after reset ? Any way to get the >>> switch port LDDS value ? >> >> How can I check the LDDS value? >                                  ^^^^^^ >                        LinkDownDefaultState > > Thought you said they were both POLLING prior to reset. > > -- Hal > >> Yicheng Jia >> >> >> >> >> Hal Rosenstock >> >> 04/16/2009 02:21 PM >> >> To >> Yicheng Jia >> cc >> Nicolas Morey-Chaisemartin , >> general at lists.openfabrics.org >> Subject >> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >> >> >> >> >> On Thu, Apr 16, 2009 at 3:18 PM, Yicheng Jia wrote: >>> >>> They both are POLLING before "reset". >> >> Then they _should_ come back to INIT. >> >> What does the local LDDS value say after reset ? Any way to get the >> switch port LDDS value ? >> >> -- Hal >> >>> Thanks! >>> Yicheng Jia >>> >>> >>> >>> >>> Hal Rosenstock >>> >>> 04/16/2009 01:53 PM >>> >>> To >>> Yicheng Jia >>> cc >>> Nicolas Morey-Chaisemartin , >>> general at lists.openfabrics.org >>> Subject >>> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >>> >>> >>> >>> >>> On Thu, Apr 16, 2009 at 11:12 AM, Yicheng Jia wrote: >>>> >>>> Hi Nicolas, >>>> >>>> After this "reset" command, both ports are DOWN forever, I can only get >>>> portinfo from local port. >>>> >>>> I am sure that the port that has been reset is not the local port, >>>> otherwise >>>> it will prompt "node type not switch" error. >>>> >>>> I tried to enable this switch port from another port and brought it to >>>> POLLING state, but as long as I use "reset", both ports are DOWN. >>> >>> What are the peer port's LinkDownDefaultStates ? Sounds like one or >>> more must be Sleeping rather than Polling for some reason. >>> >>> -- Hal >>> >>>> Thanks! >>>> >>>> Yicheng Jia >>>> >>>> >>>> >>>> >>>> Nicolas Morey-Chaisemartin >>>> >>>> 04/16/2009 12:43 AM >>>> >>>> To >>>> Yicheng Jia >>>> cc >>>> general at lists.openfabrics.org >>>> Subject >>>> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >>>> >>>> >>>> >>>> >>>> By any chances have you not reset the port you're on? >>>> Have you tried using another node to enable the port again? >>>> >>>> Nicolas >>>> >>>> Le 16/04/2009 00:45, Yicheng Jia a écrit : >>>>> >>>>> Hello Randy, >>>>> >>>>> I am trying to run "ibportstate reset" to reset the switch port on the >>>>> other side in order to get 4x link. However I get the following error: >>>>> ibwarn: [19660] mad_rpc: _do_madrpc failed; dport (Lid 7) >>>>> ibportstate: iberror: failed: smp set portinfo failed >>>>> >>>>> And the port status change to DOWN after this. Have you ever tried to >>>>> run "ibportstate" to reset the switch port? >>>>> >>>>> Thanks! >>>>> >>>>> Yicheng Jia >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ------------------------------ >>>>> >>>>> Message: 2 >>>>> Date: Wed, 4 Mar 2009 18:39:54 -0600 >>>>> From: Randy Halverson >>>>> Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged >>>>> switch >>>>> To: "'general at lists.openfabrics.org'" >>>>> Message-ID: >>>>> <88EC963376E93B4DB0F2A69D932F786903D89CAF at MNEXMB2.qlogic.org> >>>>> Content-Type: text/plain; charset="us-ascii" >>>>> >>>>> Hello Yicheng, >>>>> >>>>> After checking internally, this appears to be a known problem with >>>>> older >>>>> firmware for the 9024FC switches. >>>>> >>>>> It appears that you or another person at 'tmriusa.com' has recently >>>>> opened a case with QLogic Tech Support for this issue. Please continue >>>>> to work with QLogic Tech Support on firmware upgrade resolution since >>>>> you probably don't have our FastFabric Tools to manage the 9024FC >>>>> switches.. >>>>> >>>>> Regards, >>>>> >>>>> Randy >>>>> Technical Support >>>>> QLogic Corporation >>>>> -------------- next part -------------- >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> Scanned by IBM Email Security Management Services powered by >>>>> MessageLabs. For more information please visit http://www.ers.ibm.com >>>>> >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> Scanned by IBM Email Security Management Services powered by >>>>> MessageLabs. For more information please visit http://www.ers.ibm.com >>>>> >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------------ >>>>> >>>>> _______________________________________________ >>>>> general mailing list >>>>> general at lists.openfabrics.org >>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>> >>>>> To unsubscribe, please visit >>>>> http://openib.org/mailman/listinfo/openib-general >>>> >>>> >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> Scanned by IBM Email Security Management Services powered by >>>> MessageLabs. >>>> For more information please visit http://www.ers.ibm.com >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> >>>> >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> Scanned by IBM Email Security Management Services powered by >>>> MessageLabs. >>>> For more information please visit http://www.ers.ibm.com >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>>> http://openib.org/mailman/listinfo/openib-general >>>> >>> >>> >>> >>> _____________________________________________________________________________ >>> Scanned by IBM Email Security Management Services powered by MessageLabs. >>> For more information please visit http://www.ers.ibm.com >>> >>> >>> _____________________________________________________________________________ >>> >>> >>> >>> >>> _____________________________________________________________________________ >>> Scanned by IBM Email Security Management Services powered by MessageLabs. >>> For more information please visit http://www.ers.ibm.com >>> >>> >>> _____________________________________________________________________________ >>> >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by MessageLabs. >> For more information please visit http://www.ers.ibm.com >> >> _____________________________________________________________________________ >> >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by MessageLabs. >> For more information please visit http://www.ers.ibm.com >> >> _____________________________________________________________________________ >> > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > From hal.rosenstock at gmail.com Thu Apr 16 13:26:54 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 16 Apr 2009 16:26:54 -0400 Subject: ***SPAM*** Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: References: Message-ID: On Thu, Apr 16, 2009 at 3:47 PM, Yicheng Jia wrote: > >> Are you resetting the switch from the peer HCA port or some other port >> ? That's what Nicolas asked but I might have missed the answer. > > Yes, I am trying to reset from the peer HCA port. Is anything wrong with > this? There's a race condition here that I was asking about. If the link initialization takes too long and doesn't complete (gets to init) prior to the enable trying to be sent to the switch, then you could see these results but since it's DOWN until reboot it's something different. >> Also, try disable (wait) and then enable and see if that works. > > It remains the same, the switch port is DOWN forever. No SMP massage could > get to the switch port. Right; in down, the SMP can't be sent. >> If I recall correctly, you had those links which are taking a long time to >> initialize. If the link stays down forever after disable, this won't >> work but I want to be sure. > > This is seperate issue. Since the link stays down yes. If the disable/wait/enable worked that would've been another story. -- Hal > The "reset" command is tested on a single port HCA > directly connected with Qlogic siwth. The HCA is plugged into a Linux > machine. It is the simplest test environment. > > Thanks! > Yicheng Jia > > > > > Hal Rosenstock > > 04/16/2009 02:29 PM > > To > Yicheng Jia > cc > Nicolas Morey-Chaisemartin , > general at lists.openfabrics.org > Subject > Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch > > > > > On Thu, Apr 16, 2009 at 3:20 PM, Hal Rosenstock > wrote: >> On Thu, Apr 16, 2009 at 3:18 PM, Yicheng Jia wrote: >>> >>> They both are POLLING before "reset". >> >> Then they _should_ come back to INIT. >> >> What does the local LDDS value say after reset ? Any way to get the >> switch port LDDS value ? > > Are you resetting the switch from the peer HCA port or some other port > ? That's what Nicolas asked but I might have missed the answer. > > Also, try disable (wait) and then enable and see if that works. If I > recall correctly, you had those links which are taking a long time to > initialize. If the link stays down forever after disable, this won't > work but I want to be sure. > > -- Hal > >> -- Hal >> >>> Thanks! >>> Yicheng Jia >>> >>> >>> >>> >>> Hal Rosenstock >>> >>> 04/16/2009 01:53 PM >>> >>> To >>> Yicheng Jia >>> cc >>> Nicolas Morey-Chaisemartin , >>> general at lists.openfabrics.org >>> Subject >>> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >>> >>> >>> >>> >>> On Thu, Apr 16, 2009 at 11:12 AM, Yicheng Jia wrote: >>>> >>>> Hi Nicolas, >>>> >>>> After this "reset" command, both ports are DOWN forever, I can only get >>>> portinfo from local port. >>>> >>>> I am sure that the port that has been reset is not the local port, >>>> otherwise >>>> it will prompt "node type not switch" error. >>>> >>>> I tried to enable this switch port from another port and brought it to >>>> POLLING state, but as long as I use "reset", both ports are DOWN. >>> >>> What are the peer port's LinkDownDefaultStates ? Sounds like one or >>> more must be Sleeping rather than Polling for some reason. >>> >>> -- Hal >>> >>>> Thanks! >>>> >>>> Yicheng Jia >>>> >>>> >>>> >>>> >>>> Nicolas Morey-Chaisemartin >>>> >>>> 04/16/2009 12:43 AM >>>> >>>> To >>>> Yicheng Jia >>>> cc >>>> general at lists.openfabrics.org >>>> Subject >>>> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >>>> >>>> >>>> >>>> >>>> By any chances have you not reset the port you're on? >>>> Have you tried using another node to enable the port again? >>>> >>>> Nicolas >>>> >>>> Le 16/04/2009 00:45, Yicheng Jia a écrit : >>>>> >>>>> Hello Randy, >>>>> >>>>> I am trying to run "ibportstate reset" to reset the switch port on the >>>>> other side in order to get 4x link. However I get the following error: >>>>> ibwarn: [19660] mad_rpc: _do_madrpc failed; dport (Lid 7) >>>>> ibportstate: iberror: failed: smp set portinfo failed >>>>> >>>>> And the port status change to DOWN after this. Have you ever tried to >>>>> run "ibportstate" to reset the switch port? >>>>> >>>>> Thanks! >>>>> >>>>> Yicheng Jia >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ------------------------------ >>>>> >>>>> Message: 2 >>>>> Date: Wed, 4 Mar 2009 18:39:54 -0600 >>>>> From: Randy Halverson >>>>> Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged >>>>> switch >>>>> To: "'general at lists.openfabrics.org'" >>>>> Message-ID: >>>>> <88EC963376E93B4DB0F2A69D932F786903D89CAF at MNEXMB2.qlogic.org> >>>>> Content-Type: text/plain; charset="us-ascii" >>>>> >>>>> Hello Yicheng, >>>>> >>>>> After checking internally, this appears to be a known problem with >>>>> older >>>>> firmware for the 9024FC switches. >>>>> >>>>> It appears that you or another person at 'tmriusa.com' has recently >>>>> opened a case with QLogic Tech Support for this issue. Please continue >>>>> to work with QLogic Tech Support on firmware upgrade resolution since >>>>> you probably don't have our FastFabric Tools to manage the 9024FC >>>>> switches.. >>>>> >>>>> Regards, >>>>> >>>>> Randy >>>>> Technical Support >>>>> QLogic Corporation >>>>> -------------- next part -------------- >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> Scanned by IBM Email Security Management Services powered by >>>>> MessageLabs. For more information please visit http://www.ers.ibm.com >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> Scanned by IBM Email Security Management Services powered by >>>>> MessageLabs. For more information please visit http://www.ers.ibm.com >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------------ >>>>> >>>>> _______________________________________________ >>>>> general mailing list >>>>> general at lists.openfabrics.org >>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>> >>>>> To unsubscribe, please visit >>>>> http://openib.org/mailman/listinfo/openib-general >>>> >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> Scanned by IBM Email Security Management Services powered by >>>> MessageLabs. >>>> For more information please visit http://www.ers.ibm.com >>>> >>>> >>>> _____________________________________________________________________________ >>>> >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> Scanned by IBM Email Security Management Services powered by >>>> MessageLabs. >>>> For more information please visit http://www.ers.ibm.com >>>> >>>> >>>> _____________________________________________________________________________ >>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>>> http://openib.org/mailman/listinfo/openib-general >>>> >>> >>> >>> _____________________________________________________________________________ >>> Scanned by IBM Email Security Management Services powered by MessageLabs. >>> For more information please visit http://www.ers.ibm.com >>> >>> _____________________________________________________________________________ >>> >>> >>> >>> _____________________________________________________________________________ >>> Scanned by IBM Email Security Management Services powered by MessageLabs. >>> For more information please visit http://www.ers.ibm.com >>> >>> _____________________________________________________________________________ >>> >> > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > From YJia at tmriusa.com Thu Apr 16 14:56:05 2009 From: YJia at tmriusa.com (Yicheng Jia) Date: Thu, 16 Apr 2009 16:56:05 -0500 Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: Message-ID: > Then what is in POLLING ? Are you referring to PortPhysicalState and > not LinkDownDefaultState (LDDS). The LinkDownDefaultState is POLLING on both sides of the link before "reset", and after "reset" is performed, the PhysLinkState on local port is POLLING, and the port state is DOWN. Thanks! Yicheng Jia Hal Rosenstock 04/16/2009 03:22 PM To Yicheng Jia cc Nicolas Morey-Chaisemartin , general at lists.openfabrics.org Subject Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch On Thu, Apr 16, 2009 at 3:49 PM, Yicheng Jia wrote: > >> Thought you said they were both POLLING prior to reset. > > Yes, they are. I don't know what "LDDS" stands for in the first place:) Then what is in POLLING ? Are you referring to PortPhysicalState and not LinkDownDefaultState (LDDS). I use LDDS because I tire of typing the whole thing. It is a component (field) in PortInfo. -- Hal > Thanks! > Yicheng Jia > > > > > Hal Rosenstock > > 04/16/2009 02:37 PM > > To > Yicheng Jia > cc > Nicolas Morey-Chaisemartin , > general at lists.openfabrics.org > Subject > Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch > > > > > On Thu, Apr 16, 2009 at 3:35 PM, Yicheng Jia wrote: >> >>> Then they _should_ come back to INIT. >> >> No, they don't. >> >>> What does the local LDDS value say after reset ? Any way to get the >>> switch port LDDS value ? >> >> How can I check the LDDS value? > ^^^^^^ > LinkDownDefaultState > > Thought you said they were both POLLING prior to reset. > > -- Hal > >> Yicheng Jia >> >> >> >> >> Hal Rosenstock >> >> 04/16/2009 02:21 PM >> >> To >> Yicheng Jia >> cc >> Nicolas Morey-Chaisemartin , >> general at lists.openfabrics.org >> Subject >> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >> >> >> >> >> On Thu, Apr 16, 2009 at 3:18 PM, Yicheng Jia wrote: >>> >>> They both are POLLING before "reset". >> >> Then they _should_ come back to INIT. >> >> What does the local LDDS value say after reset ? Any way to get the >> switch port LDDS value ? >> >> -- Hal >> >>> Thanks! >>> Yicheng Jia >>> >>> >>> >>> >>> Hal Rosenstock >>> >>> 04/16/2009 01:53 PM >>> >>> To >>> Yicheng Jia >>> cc >>> Nicolas Morey-Chaisemartin , >>> general at lists.openfabrics.org >>> Subject >>> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >>> >>> >>> >>> >>> On Thu, Apr 16, 2009 at 11:12 AM, Yicheng Jia wrote: >>>> >>>> Hi Nicolas, >>>> >>>> After this "reset" command, both ports are DOWN forever, I can only get >>>> portinfo from local port. >>>> >>>> I am sure that the port that has been reset is not the local port, >>>> otherwise >>>> it will prompt "node type not switch" error. >>>> >>>> I tried to enable this switch port from another port and brought it to >>>> POLLING state, but as long as I use "reset", both ports are DOWN. >>> >>> What are the peer port's LinkDownDefaultStates ? Sounds like one or >>> more must be Sleeping rather than Polling for some reason. >>> >>> -- Hal >>> >>>> Thanks! >>>> >>>> Yicheng Jia >>>> >>>> >>>> >>>> >>>> Nicolas Morey-Chaisemartin >>>> >>>> 04/16/2009 12:43 AM >>>> >>>> To >>>> Yicheng Jia >>>> cc >>>> general at lists.openfabrics.org >>>> Subject >>>> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >>>> >>>> >>>> >>>> >>>> By any chances have you not reset the port you're on? >>>> Have you tried using another node to enable the port again? >>>> >>>> Nicolas >>>> >>>> Le 16/04/2009 00:45, Yicheng Jia a écrit : >>>>> >>>>> Hello Randy, >>>>> >>>>> I am trying to run "ibportstate reset" to reset the switch port on the >>>>> other side in order to get 4x link. However I get the following error: >>>>> ibwarn: [19660] mad_rpc: _do_madrpc failed; dport (Lid 7) >>>>> ibportstate: iberror: failed: smp set portinfo failed >>>>> >>>>> And the port status change to DOWN after this. Have you ever tried to >>>>> run "ibportstate" to reset the switch port? >>>>> >>>>> Thanks! >>>>> >>>>> Yicheng Jia >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ------------------------------ >>>>> >>>>> Message: 2 >>>>> Date: Wed, 4 Mar 2009 18:39:54 -0600 >>>>> From: Randy Halverson >>>>> Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged >>>>> switch >>>>> To: "'general at lists.openfabrics.org'" >>>>> Message-ID: >>>>> <88EC963376E93B4DB0F2A69D932F786903D89CAF at MNEXMB2.qlogic.org> >>>>> Content-Type: text/plain; charset="us-ascii" >>>>> >>>>> Hello Yicheng, >>>>> >>>>> After checking internally, this appears to be a known problem with >>>>> older >>>>> firmware for the 9024FC switches. >>>>> >>>>> It appears that you or another person at 'tmriusa.com' has recently >>>>> opened a case with QLogic Tech Support for this issue. Please continue >>>>> to work with QLogic Tech Support on firmware upgrade resolution since >>>>> you probably don't have our FastFabric Tools to manage the 9024FC >>>>> switches.. >>>>> >>>>> Regards, >>>>> >>>>> Randy >>>>> Technical Support >>>>> QLogic Corporation >>>>> -------------- next part -------------- >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> Scanned by IBM Email Security Management Services powered by >>>>> MessageLabs. For more information please visit http://www.ers.ibm.com >>>>> >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> Scanned by IBM Email Security Management Services powered by >>>>> MessageLabs. For more information please visit http://www.ers.ibm.com >>>>> >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------------ >>>>> >>>>> _______________________________________________ >>>>> general mailing list >>>>> general at lists.openfabrics.org >>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>> >>>>> To unsubscribe, please visit >>>>> http://openib.org/mailman/listinfo/openib-general >>>> >>>> >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> Scanned by IBM Email Security Management Services powered by >>>> MessageLabs. >>>> For more information please visit http://www.ers.ibm.com >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> >>>> >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> Scanned by IBM Email Security Management Services powered by >>>> MessageLabs. >>>> For more information please visit http://www.ers.ibm.com >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>>> http://openib.org/mailman/listinfo/openib-general >>>> >>> >>> >>> >>> _____________________________________________________________________________ >>> Scanned by IBM Email Security Management Services powered by MessageLabs. >>> For more information please visit http://www.ers.ibm.com >>> >>> >>> _____________________________________________________________________________ >>> >>> >>> >>> >>> _____________________________________________________________________________ >>> Scanned by IBM Email Security Management Services powered by MessageLabs. >>> For more information please visit http://www.ers.ibm.com >>> >>> >>> _____________________________________________________________________________ >>> >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by MessageLabs. >> For more information please visit http://www.ers.ibm.com >> >> _____________________________________________________________________________ >> >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by MessageLabs. >> For more information please visit http://www.ers.ibm.com >> >> _____________________________________________________________________________ >> > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From YJia at tmriusa.com Thu Apr 16 15:06:00 2009 From: YJia at tmriusa.com (Yicheng Jia) Date: Thu, 16 Apr 2009 17:06:00 -0500 Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: Message-ID: > There's a race condition here that I was asking about. If the link > initialization takes too long and doesn't complete (gets to init) > prior to the enable trying to be sent to the switch, then you could > see these results but since it's DOWN until reboot it's something > different. I did the "reset" when ports on both side of the link are in INIT state and LinkUp phys state. > If the disable/wait/enable worked that would've been another story. It fails too. Both ports go to DOWN after disable is issued and never come back. How long am I supposed to wait? Thanks! Yicheng Jia Hal Rosenstock 04/16/2009 03:26 PM To Yicheng Jia cc Nicolas Morey-Chaisemartin , general at lists.openfabrics.org Subject Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch On Thu, Apr 16, 2009 at 3:47 PM, Yicheng Jia wrote: > >> Are you resetting the switch from the peer HCA port or some other port >> ? That's what Nicolas asked but I might have missed the answer. > > Yes, I am trying to reset from the peer HCA port. Is anything wrong with > this? There's a race condition here that I was asking about. If the link initialization takes too long and doesn't complete (gets to init) prior to the enable trying to be sent to the switch, then you could see these results but since it's DOWN until reboot it's something different. >> Also, try disable (wait) and then enable and see if that works. > > It remains the same, the switch port is DOWN forever. No SMP massage could > get to the switch port. Right; in down, the SMP can't be sent. >> If I recall correctly, you had those links which are taking a long time to >> initialize. If the link stays down forever after disable, this won't >> work but I want to be sure. > > This is seperate issue. Since the link stays down yes. If the disable/wait/enable worked that would've been another story. -- Hal > The "reset" command is tested on a single port HCA > directly connected with Qlogic siwth. The HCA is plugged into a Linux > machine. It is the simplest test environment. > > Thanks! > Yicheng Jia > > > > > Hal Rosenstock > > 04/16/2009 02:29 PM > > To > Yicheng Jia > cc > Nicolas Morey-Chaisemartin , > general at lists.openfabrics.org > Subject > Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch > > > > > On Thu, Apr 16, 2009 at 3:20 PM, Hal Rosenstock > wrote: >> On Thu, Apr 16, 2009 at 3:18 PM, Yicheng Jia wrote: >>> >>> They both are POLLING before "reset". >> >> Then they _should_ come back to INIT. >> >> What does the local LDDS value say after reset ? Any way to get the >> switch port LDDS value ? > > Are you resetting the switch from the peer HCA port or some other port > ? That's what Nicolas asked but I might have missed the answer. > > Also, try disable (wait) and then enable and see if that works. If I > recall correctly, you had those links which are taking a long time to > initialize. If the link stays down forever after disable, this won't > work but I want to be sure. > > -- Hal > >> -- Hal >> >>> Thanks! >>> Yicheng Jia >>> >>> >>> >>> >>> Hal Rosenstock >>> >>> 04/16/2009 01:53 PM >>> >>> To >>> Yicheng Jia >>> cc >>> Nicolas Morey-Chaisemartin , >>> general at lists.openfabrics.org >>> Subject >>> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >>> >>> >>> >>> >>> On Thu, Apr 16, 2009 at 11:12 AM, Yicheng Jia wrote: >>>> >>>> Hi Nicolas, >>>> >>>> After this "reset" command, both ports are DOWN forever, I can only get >>>> portinfo from local port. >>>> >>>> I am sure that the port that has been reset is not the local port, >>>> otherwise >>>> it will prompt "node type not switch" error. >>>> >>>> I tried to enable this switch port from another port and brought it to >>>> POLLING state, but as long as I use "reset", both ports are DOWN. >>> >>> What are the peer port's LinkDownDefaultStates ? Sounds like one or >>> more must be Sleeping rather than Polling for some reason. >>> >>> -- Hal >>> >>>> Thanks! >>>> >>>> Yicheng Jia >>>> >>>> >>>> >>>> >>>> Nicolas Morey-Chaisemartin >>>> >>>> 04/16/2009 12:43 AM >>>> >>>> To >>>> Yicheng Jia >>>> cc >>>> general at lists.openfabrics.org >>>> Subject >>>> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >>>> >>>> >>>> >>>> >>>> By any chances have you not reset the port you're on? >>>> Have you tried using another node to enable the port again? >>>> >>>> Nicolas >>>> >>>> Le 16/04/2009 00:45, Yicheng Jia a écrit : >>>>> >>>>> Hello Randy, >>>>> >>>>> I am trying to run "ibportstate reset" to reset the switch port on the >>>>> other side in order to get 4x link. However I get the following error: >>>>> ibwarn: [19660] mad_rpc: _do_madrpc failed; dport (Lid 7) >>>>> ibportstate: iberror: failed: smp set portinfo failed >>>>> >>>>> And the port status change to DOWN after this. Have you ever tried to >>>>> run "ibportstate" to reset the switch port? >>>>> >>>>> Thanks! >>>>> >>>>> Yicheng Jia >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ------------------------------ >>>>> >>>>> Message: 2 >>>>> Date: Wed, 4 Mar 2009 18:39:54 -0600 >>>>> From: Randy Halverson >>>>> Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged >>>>> switch >>>>> To: "'general at lists.openfabrics.org'" >>>>> Message-ID: >>>>> <88EC963376E93B4DB0F2A69D932F786903D89CAF at MNEXMB2.qlogic.org> >>>>> Content-Type: text/plain; charset="us-ascii" >>>>> >>>>> Hello Yicheng, >>>>> >>>>> After checking internally, this appears to be a known problem with >>>>> older >>>>> firmware for the 9024FC switches. >>>>> >>>>> It appears that you or another person at 'tmriusa.com' has recently >>>>> opened a case with QLogic Tech Support for this issue. Please continue >>>>> to work with QLogic Tech Support on firmware upgrade resolution since >>>>> you probably don't have our FastFabric Tools to manage the 9024FC >>>>> switches.. >>>>> >>>>> Regards, >>>>> >>>>> Randy >>>>> Technical Support >>>>> QLogic Corporation >>>>> -------------- next part -------------- >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> Scanned by IBM Email Security Management Services powered by >>>>> MessageLabs. For more information please visit http://www.ers.ibm.com >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> Scanned by IBM Email Security Management Services powered by >>>>> MessageLabs. For more information please visit http://www.ers.ibm.com >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------------ >>>>> >>>>> _______________________________________________ >>>>> general mailing list >>>>> general at lists.openfabrics.org >>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>> >>>>> To unsubscribe, please visit >>>>> http://openib.org/mailman/listinfo/openib-general >>>> >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> Scanned by IBM Email Security Management Services powered by >>>> MessageLabs. >>>> For more information please visit http://www.ers.ibm.com >>>> >>>> >>>> _____________________________________________________________________________ >>>> >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> Scanned by IBM Email Security Management Services powered by >>>> MessageLabs. >>>> For more information please visit http://www.ers.ibm.com >>>> >>>> >>>> _____________________________________________________________________________ >>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>>> http://openib.org/mailman/listinfo/openib-general >>>> >>> >>> >>> _____________________________________________________________________________ >>> Scanned by IBM Email Security Management Services powered by MessageLabs. >>> For more information please visit http://www.ers.ibm.com >>> >>> _____________________________________________________________________________ >>> >>> >>> >>> _____________________________________________________________________________ >>> Scanned by IBM Email Security Management Services powered by MessageLabs. >>> For more information please visit http://www.ers.ibm.com >>> >>> _____________________________________________________________________________ >>> >> > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From hal.rosenstock at gmail.com Thu Apr 16 16:12:00 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Thu, 16 Apr 2009 19:12:00 -0400 Subject: ***SPAM*** Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: References: Message-ID: On Thu, Apr 16, 2009 at 6:06 PM, Yicheng Jia wrote: > >> There's a race condition here that I was asking about. If the link >> initialization takes too long and doesn't complete (gets to init) >> prior to the enable trying to be sent to the switch, then you could >> see these results but since it's DOWN until reboot it's something >> different. > > I did the "reset" when ports on both side of the link are in INIT state and > LinkUp phys state. > >> If the disable/wait/enable worked that would've been another story. > > It fails too. Both ports go to DOWN after disable is issued and never come > back. How long am I supposed to wait? Ideally you would see init before doing the enable but sounds like that's not occuring. Either you need low level debug to see why the link does not initialize at that point or get support from your CA/switch vendor(s). What's your CA device ? -- Hal > Thanks! > Yicheng Jia > > > > > Hal Rosenstock > > 04/16/2009 03:26 PM > > To > Yicheng Jia > cc > Nicolas Morey-Chaisemartin , > general at lists.openfabrics.org > Subject > Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch > > > > > On Thu, Apr 16, 2009 at 3:47 PM, Yicheng Jia wrote: >> >>> Are you resetting the switch from the peer HCA port or some other port >>> ? That's what Nicolas asked but I might have missed the answer. >> >> Yes, I am trying to reset from the peer HCA port. Is anything wrong with >> this? > > There's a race condition here that I was asking about. If the link > initialization takes too long and doesn't complete (gets to init) > prior to the enable trying to be sent to the switch, then you could > see these results but since it's DOWN until reboot it's something > different. > >>> Also, try disable (wait) and then enable and see if that works. >> >> It remains the same, the switch port is DOWN forever. No SMP massage could >> get to the switch port. > > Right; in down, the SMP can't be sent. > >>> If I recall correctly, you had those links which are taking a long time >>> to >>> initialize. If the link stays down forever after disable, this won't >>> work but I want to be sure. >> >> This is seperate issue. > > Since the link stays down yes. If the disable/wait/enable worked that > would've been another story. > > -- Hal > >> The "reset" command is tested on a single port HCA >> directly connected with Qlogic siwth. The HCA is plugged into a Linux >> machine. It is the simplest test environment. >> >> Thanks! >> Yicheng Jia >> >> >> >> >> Hal Rosenstock >> >> 04/16/2009 02:29 PM >> >> To >> Yicheng Jia >> cc >> Nicolas Morey-Chaisemartin , >> general at lists.openfabrics.org >> Subject >> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >> >> >> >> >> On Thu, Apr 16, 2009 at 3:20 PM, Hal Rosenstock >> wrote: >>> On Thu, Apr 16, 2009 at 3:18 PM, Yicheng Jia wrote: >>>> >>>> They both are POLLING before "reset". >>> >>> Then they _should_ come back to INIT. >>> >>> What does the local LDDS value say after reset ? Any way to get the >>> switch port LDDS value ? >> >> Are you resetting the switch from the peer HCA port or some other port >> ? That's what Nicolas asked but I might have missed the answer. >> >> Also, try disable (wait) and then enable and see if that works. If I >> recall correctly, you had those links which are taking a long time to >> initialize. If the link stays down forever after disable, this won't >> work but I want to be sure. >> >> -- Hal >> >>> -- Hal >>> >>>> Thanks! >>>> Yicheng Jia >>>> >>>> >>>> >>>> >>>> Hal Rosenstock >>>> >>>> 04/16/2009 01:53 PM >>>> >>>> To >>>> Yicheng Jia >>>> cc >>>> Nicolas Morey-Chaisemartin , >>>> general at lists.openfabrics.org >>>> Subject >>>> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >>>> >>>> >>>> >>>> >>>> On Thu, Apr 16, 2009 at 11:12 AM, Yicheng Jia wrote: >>>>> >>>>> Hi Nicolas, >>>>> >>>>> After this "reset" command, both ports are DOWN forever, I can only get >>>>> portinfo from local port. >>>>> >>>>> I am sure that the port that has been reset is not the local port, >>>>> otherwise >>>>> it will prompt "node type not switch" error. >>>>> >>>>> I tried to enable this switch port from another port and brought it to >>>>> POLLING state, but as long as I use "reset", both ports are DOWN. >>>> >>>> What are the peer port's LinkDownDefaultStates ? Sounds like one or >>>> more must be Sleeping rather than Polling for some reason. >>>> >>>> -- Hal >>>> >>>>> Thanks! >>>>> >>>>> Yicheng Jia >>>>> >>>>> >>>>> >>>>> >>>>> Nicolas Morey-Chaisemartin >>>>> >>>>> 04/16/2009 12:43 AM >>>>> >>>>> To >>>>> Yicheng Jia >>>>> cc >>>>> general at lists.openfabrics.org >>>>> Subject >>>>> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >>>>> >>>>> >>>>> >>>>> >>>>> By any chances have you not reset the port you're on? >>>>> Have you tried using another node to enable the port again? >>>>> >>>>> Nicolas >>>>> >>>>> Le 16/04/2009 00:45, Yicheng Jia a écrit : >>>>>> >>>>>> Hello Randy, >>>>>> >>>>>> I am trying to run "ibportstate reset" to reset the switch port on the >>>>>> other side in order to get 4x link. However I get the following error: >>>>>> ibwarn: [19660] mad_rpc: _do_madrpc failed; dport (Lid 7) >>>>>> ibportstate: iberror: failed: smp set portinfo failed >>>>>> >>>>>> And the port status change to DOWN after this. Have you ever tried to >>>>>> run "ibportstate" to reset the switch port? >>>>>> >>>>>> Thanks! >>>>>> >>>>>> Yicheng Jia >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------ >>>>>> >>>>>> Message: 2 >>>>>> Date: Wed, 4 Mar 2009 18:39:54 -0600 >>>>>> From: Randy Halverson >>>>>> Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged >>>>>> switch >>>>>> To: "'general at lists.openfabrics.org'" >>>>>> Message-ID: >>>>>> <88EC963376E93B4DB0F2A69D932F786903D89CAF at MNEXMB2.qlogic.org> >>>>>> Content-Type: text/plain; charset="us-ascii" >>>>>> >>>>>> Hello Yicheng, >>>>>> >>>>>> After checking internally, this appears to be a known problem with >>>>>> older >>>>>> firmware for the 9024FC switches. >>>>>> >>>>>> It appears that you or another person at 'tmriusa.com' has recently >>>>>> opened a case with QLogic Tech Support for this issue. Please continue >>>>>> to work with QLogic Tech Support on firmware upgrade resolution since >>>>>> you probably don't have our FastFabric Tools to manage the 9024FC >>>>>> switches.. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Randy >>>>>> Technical Support >>>>>> QLogic Corporation >>>>>> -------------- next part -------------- >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _____________________________________________________________________________ >>>>>> Scanned by IBM Email Security Management Services powered by >>>>>> MessageLabs. For more information please visit http://www.ers.ibm.com >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _____________________________________________________________________________ >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _____________________________________________________________________________ >>>>>> Scanned by IBM Email Security Management Services powered by >>>>>> MessageLabs. For more information please visit http://www.ers.ibm.com >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _____________________________________________________________________________ >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------ >>>>>> >>>>>> _______________________________________________ >>>>>> general mailing list >>>>>> general at lists.openfabrics.org >>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>>> >>>>>> To unsubscribe, please visit >>>>>> http://openib.org/mailman/listinfo/openib-general >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> Scanned by IBM Email Security Management Services powered by >>>>> MessageLabs. >>>>> For more information please visit http://www.ers.ibm.com >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> Scanned by IBM Email Security Management Services powered by >>>>> MessageLabs. >>>>> For more information please visit http://www.ers.ibm.com >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> >>>>> _______________________________________________ >>>>> general mailing list >>>>> general at lists.openfabrics.org >>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>> >>>>> To unsubscribe, please visit >>>>> http://openib.org/mailman/listinfo/openib-general >>>>> >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> Scanned by IBM Email Security Management Services powered by >>>> MessageLabs. >>>> For more information please visit http://www.ers.ibm.com >>>> >>>> >>>> _____________________________________________________________________________ >>>> >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> Scanned by IBM Email Security Management Services powered by >>>> MessageLabs. >>>> For more information please visit http://www.ers.ibm.com >>>> >>>> >>>> _____________________________________________________________________________ >>>> >>> >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by MessageLabs. >> For more information please visit http://www.ers.ibm.com >> >> _____________________________________________________________________________ >> >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by MessageLabs. >> For more information please visit http://www.ers.ibm.com >> >> _____________________________________________________________________________ >> > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > From weiny2 at llnl.gov Thu Apr 16 17:53:03 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 16 Apr 2009 17:53:03 -0700 Subject: ***SPAM*** Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: References: Message-ID: <20090416175303.2942bf40.weiny2@llnl.gov> Yicheng, I am hoping your system is small when I ask; could you send the output from: "iblinkinfo.pl -R" When everything is up and running. Also ibstat from the node you are attempting the reset on. As well as the reset command you are using? Thanks, Ira On Thu, 16 Apr 2009 19:12:00 -0400 Hal Rosenstock wrote: > On Thu, Apr 16, 2009 at 6:06 PM, Yicheng Jia wrote: > > > >> There's a race condition here that I was asking about. If the link > >> initialization takes too long and doesn't complete (gets to init) > >> prior to the enable trying to be sent to the switch, then you could > >> see these results but since it's DOWN until reboot it's something > >> different. > > > > I did the "reset" when ports on both side of the link are in INIT state and > > LinkUp phys state. > > > >> If the disable/wait/enable worked that would've been another story. > > > > It fails too. Both ports go to DOWN after disable is issued and never come > > back. How long am I supposed to wait? > > Ideally you would see init before doing the enable but sounds like > that's not occuring. Either you need low level debug to see why the > link does not initialize at that point or get support from your > CA/switch vendor(s). What's your CA device ? > > -- Hal > > > Thanks! > > Yicheng Jia > > > > > > > > > > Hal Rosenstock > > > > 04/16/2009 03:26 PM > > > > To > > Yicheng Jia > > cc > > Nicolas Morey-Chaisemartin , > > general at lists.openfabrics.org > > Subject > > Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch > > > > > > > > > > On Thu, Apr 16, 2009 at 3:47 PM, Yicheng Jia wrote: > >> > >>> Are you resetting the switch from the peer HCA port or some other port > >>> ? That's what Nicolas asked but I might have missed the answer. > >> > >> Yes, I am trying to reset from the peer HCA port. Is anything wrong with > >> this? > > > > There's a race condition here that I was asking about. If the link > > initialization takes too long and doesn't complete (gets to init) > > prior to the enable trying to be sent to the switch, then you could > > see these results but since it's DOWN until reboot it's something > > different. > > > >>> Also, try disable (wait) and then enable and see if that works. > >> > >> It remains the same, the switch port is DOWN forever. No SMP massage could > >> get to the switch port. > > > > Right; in down, the SMP can't be sent. > > > >>> If I recall correctly, you had those links which are taking a long time > >>> to > >>> initialize. If the link stays down forever after disable, this won't > >>> work but I want to be sure. > >> > >> This is seperate issue. > > > > Since the link stays down yes. If the disable/wait/enable worked that > > would've been another story. > > > > -- Hal > > > >> The "reset" command is tested on a single port HCA > >> directly connected with Qlogic siwth. The HCA is plugged into a Linux > >> machine. It is the simplest test environment. > >> > >> Thanks! > >> Yicheng Jia > >> > >> > >> > >> > >> Hal Rosenstock > >> > >> 04/16/2009 02:29 PM > >> > >> To > >> Yicheng Jia > >> cc > >> Nicolas Morey-Chaisemartin , > >> general at lists.openfabrics.org > >> Subject > >> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch > >> > >> > >> > >> > >> On Thu, Apr 16, 2009 at 3:20 PM, Hal Rosenstock > >> wrote: > >>> On Thu, Apr 16, 2009 at 3:18 PM, Yicheng Jia wrote: > >>>> > >>>> They both are POLLING before "reset". > >>> > >>> Then they _should_ come back to INIT. > >>> > >>> What does the local LDDS value say after reset ? Any way to get the > >>> switch port LDDS value ? > >> > >> Are you resetting the switch from the peer HCA port or some other port > >> ? That's what Nicolas asked but I might have missed the answer. > >> > >> Also, try disable (wait) and then enable and see if that works. If I > >> recall correctly, you had those links which are taking a long time to > >> initialize. If the link stays down forever after disable, this won't > >> work but I want to be sure. > >> > >> -- Hal > >> > >>> -- Hal > >>> > >>>> Thanks! > >>>> Yicheng Jia > >>>> > >>>> > >>>> > >>>> > >>>> Hal Rosenstock > >>>> > >>>> 04/16/2009 01:53 PM > >>>> > >>>> To > >>>> Yicheng Jia > >>>> cc > >>>> Nicolas Morey-Chaisemartin , > >>>> general at lists.openfabrics.org > >>>> Subject > >>>> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch > >>>> > >>>> > >>>> > >>>> > >>>> On Thu, Apr 16, 2009 at 11:12 AM, Yicheng Jia wrote: > >>>>> > >>>>> Hi Nicolas, > >>>>> > >>>>> After this "reset" command, both ports are DOWN forever, I can only get > >>>>> portinfo from local port. > >>>>> > >>>>> I am sure that the port that has been reset is not the local port, > >>>>> otherwise > >>>>> it will prompt "node type not switch" error. > >>>>> > >>>>> I tried to enable this switch port from another port and brought it to > >>>>> POLLING state, but as long as I use "reset", both ports are DOWN. > >>>> > >>>> What are the peer port's LinkDownDefaultStates ? Sounds like one or > >>>> more must be Sleeping rather than Polling for some reason. > >>>> > >>>> -- Hal > >>>> > >>>>> Thanks! > >>>>> > >>>>> Yicheng Jia > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> Nicolas Morey-Chaisemartin > >>>>> > >>>>> 04/16/2009 12:43 AM > >>>>> > >>>>> To > >>>>> Yicheng Jia > >>>>> cc > >>>>> general at lists.openfabrics.org > >>>>> Subject > >>>>> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> By any chances have you not reset the port you're on? > >>>>> Have you tried using another node to enable the port again? > >>>>> > >>>>> Nicolas > >>>>> > >>>>> Le 16/04/2009 00:45, Yicheng Jia a écrit : > >>>>>> > >>>>>> Hello Randy, > >>>>>> > >>>>>> I am trying to run "ibportstate reset" to reset the switch port on the > >>>>>> other side in order to get 4x link. However I get the following error: > >>>>>> ibwarn: [19660] mad_rpc: _do_madrpc failed; dport (Lid 7) > >>>>>> ibportstate: iberror: failed: smp set portinfo failed > >>>>>> > >>>>>> And the port status change to DOWN after this. Have you ever tried to > >>>>>> run "ibportstate" to reset the switch port? > >>>>>> > >>>>>> Thanks! > >>>>>> > >>>>>> Yicheng Jia > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> ------------------------------ > >>>>>> > >>>>>> Message: 2 > >>>>>> Date: Wed, 4 Mar 2009 18:39:54 -0600 > >>>>>> From: Randy Halverson > >>>>>> Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged > >>>>>> switch > >>>>>> To: "'general at lists.openfabrics.org'" > >>>>>> Message-ID: > >>>>>> <88EC963376E93B4DB0F2A69D932F786903D89CAF at MNEXMB2.qlogic.org> > >>>>>> Content-Type: text/plain; charset="us-ascii" > >>>>>> > >>>>>> Hello Yicheng, > >>>>>> > >>>>>> After checking internally, this appears to be a known problem with > >>>>>> older > >>>>>> firmware for the 9024FC switches. > >>>>>> > >>>>>> It appears that you or another person at 'tmriusa.com' has recently > >>>>>> opened a case with QLogic Tech Support for this issue. Please continue > >>>>>> to work with QLogic Tech Support on firmware upgrade resolution since > >>>>>> you probably don't have our FastFabric Tools to manage the 9024FC > >>>>>> switches.. > >>>>>> > >>>>>> Regards, > >>>>>> > >>>>>> Randy > >>>>>> Technical Support > >>>>>> QLogic Corporation > >>>>>> -------------- next part -------------- > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> _____________________________________________________________________________ > >>>>>> Scanned by IBM Email Security Management Services powered by > >>>>>> MessageLabs. For more information please visit http:// www. ers.ibm.com > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> _____________________________________________________________________________ > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> _____________________________________________________________________________ > >>>>>> Scanned by IBM Email Security Management Services powered by > >>>>>> MessageLabs. For more information please visit http:// www. ers.ibm.com > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> _____________________________________________________________________________ > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> ------------------------------------------------------------------------ > >>>>>> > >>>>>> _______________________________________________ > >>>>>> general mailing list > >>>>>> general at lists.openfabrics.org > >>>>>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>>>>> > >>>>>> To unsubscribe, please visit > >>>>>> http:// openib.org/mailman/listinfo/openib-general > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> _____________________________________________________________________________ > >>>>> Scanned by IBM Email Security Management Services powered by > >>>>> MessageLabs. > >>>>> For more information please visit http:// www. ers.ibm.com > >>>>> > >>>>> > >>>>> > >>>>> _____________________________________________________________________________ > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> _____________________________________________________________________________ > >>>>> Scanned by IBM Email Security Management Services powered by > >>>>> MessageLabs. > >>>>> For more information please visit http:// www. ers.ibm.com > >>>>> > >>>>> > >>>>> > >>>>> _____________________________________________________________________________ > >>>>> > >>>>> _______________________________________________ > >>>>> general mailing list > >>>>> general at lists.openfabrics.org > >>>>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>>>> > >>>>> To unsubscribe, please visit > >>>>> http:// openib.org/mailman/listinfo/openib-general > >>>>> > >>>> > >>>> > >>>> > >>>> _____________________________________________________________________________ > >>>> Scanned by IBM Email Security Management Services powered by > >>>> MessageLabs. > >>>> For more information please visit http:// www. ers.ibm.com > >>>> > >>>> > >>>> _____________________________________________________________________________ > >>>> > >>>> > >>>> > >>>> > >>>> _____________________________________________________________________________ > >>>> Scanned by IBM Email Security Management Services powered by > >>>> MessageLabs. > >>>> For more information please visit http:// www. ers.ibm.com > >>>> > >>>> > >>>> _____________________________________________________________________________ > >>>> > >>> > >> > >> > >> _____________________________________________________________________________ > >> Scanned by IBM Email Security Management Services powered by MessageLabs. > >> For more information please visit http:// www. ers.ibm.com > >> > >> _____________________________________________________________________________ > >> > >> > >> > >> _____________________________________________________________________________ > >> Scanned by IBM Email Security Management Services powered by MessageLabs. > >> For more information please visit http:// www. ers.ibm.com > >> > >> _____________________________________________________________________________ > >> > > > > _____________________________________________________________________________ > > Scanned by IBM Email Security Management Services powered by MessageLabs. > > For more information please visit http:// www. ers.ibm.com > > _____________________________________________________________________________ > > > > > > _____________________________________________________________________________ > > Scanned by IBM Email Security Management Services powered by MessageLabs. > > For more information please visit http:// www. ers.ibm.com > > _____________________________________________________________________________ > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab weiny2 at llnl.gov From YJia at tmriusa.com Thu Apr 16 21:33:05 2009 From: YJia at tmriusa.com (Yicheng Jia) Date: Thu, 16 Apr 2009 23:33:05 -0500 Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: Message-ID: The HCA is Mellanox MHES18-XSC. Thanks! Yicheng Jia Hal Rosenstock 04/16/2009 06:12 PM To Yicheng Jia cc Nicolas Morey-Chaisemartin , general at lists.openfabrics.org Subject Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch On Thu, Apr 16, 2009 at 6:06 PM, Yicheng Jia wrote: > >> There's a race condition here that I was asking about. If the link >> initialization takes too long and doesn't complete (gets to init) >> prior to the enable trying to be sent to the switch, then you could >> see these results but since it's DOWN until reboot it's something >> different. > > I did the "reset" when ports on both side of the link are in INIT state and > LinkUp phys state. > >> If the disable/wait/enable worked that would've been another story. > > It fails too. Both ports go to DOWN after disable is issued and never come > back. How long am I supposed to wait? Ideally you would see init before doing the enable but sounds like that's not occuring. Either you need low level debug to see why the link does not initialize at that point or get support from your CA/switch vendor(s). What's your CA device ? -- Hal > Thanks! > Yicheng Jia > > > > > Hal Rosenstock > > 04/16/2009 03:26 PM > > To > Yicheng Jia > cc > Nicolas Morey-Chaisemartin , > general at lists.openfabrics.org > Subject > Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch > > > > > On Thu, Apr 16, 2009 at 3:47 PM, Yicheng Jia wrote: >> >>> Are you resetting the switch from the peer HCA port or some other port >>> ? That's what Nicolas asked but I might have missed the answer. >> >> Yes, I am trying to reset from the peer HCA port. Is anything wrong with >> this? > > There's a race condition here that I was asking about. If the link > initialization takes too long and doesn't complete (gets to init) > prior to the enable trying to be sent to the switch, then you could > see these results but since it's DOWN until reboot it's something > different. > >>> Also, try disable (wait) and then enable and see if that works. >> >> It remains the same, the switch port is DOWN forever. No SMP massage could >> get to the switch port. > > Right; in down, the SMP can't be sent. > >>> If I recall correctly, you had those links which are taking a long time >>> to >>> initialize. If the link stays down forever after disable, this won't >>> work but I want to be sure. >> >> This is seperate issue. > > Since the link stays down yes. If the disable/wait/enable worked that > would've been another story. > > -- Hal > >> The "reset" command is tested on a single port HCA >> directly connected with Qlogic siwth. The HCA is plugged into a Linux >> machine. It is the simplest test environment. >> >> Thanks! >> Yicheng Jia >> >> >> >> >> Hal Rosenstock >> >> 04/16/2009 02:29 PM >> >> To >> Yicheng Jia >> cc >> Nicolas Morey-Chaisemartin , >> general at lists.openfabrics.org >> Subject >> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >> >> >> >> >> On Thu, Apr 16, 2009 at 3:20 PM, Hal Rosenstock >> wrote: >>> On Thu, Apr 16, 2009 at 3:18 PM, Yicheng Jia wrote: >>>> >>>> They both are POLLING before "reset". >>> >>> Then they _should_ come back to INIT. >>> >>> What does the local LDDS value say after reset ? Any way to get the >>> switch port LDDS value ? >> >> Are you resetting the switch from the peer HCA port or some other port >> ? That's what Nicolas asked but I might have missed the answer. >> >> Also, try disable (wait) and then enable and see if that works. If I >> recall correctly, you had those links which are taking a long time to >> initialize. If the link stays down forever after disable, this won't >> work but I want to be sure. >> >> -- Hal >> >>> -- Hal >>> >>>> Thanks! >>>> Yicheng Jia >>>> >>>> >>>> >>>> >>>> Hal Rosenstock >>>> >>>> 04/16/2009 01:53 PM >>>> >>>> To >>>> Yicheng Jia >>>> cc >>>> Nicolas Morey-Chaisemartin , >>>> general at lists.openfabrics.org >>>> Subject >>>> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >>>> >>>> >>>> >>>> >>>> On Thu, Apr 16, 2009 at 11:12 AM, Yicheng Jia wrote: >>>>> >>>>> Hi Nicolas, >>>>> >>>>> After this "reset" command, both ports are DOWN forever, I can only get >>>>> portinfo from local port. >>>>> >>>>> I am sure that the port that has been reset is not the local port, >>>>> otherwise >>>>> it will prompt "node type not switch" error. >>>>> >>>>> I tried to enable this switch port from another port and brought it to >>>>> POLLING state, but as long as I use "reset", both ports are DOWN. >>>> >>>> What are the peer port's LinkDownDefaultStates ? Sounds like one or >>>> more must be Sleeping rather than Polling for some reason. >>>> >>>> -- Hal >>>> >>>>> Thanks! >>>>> >>>>> Yicheng Jia >>>>> >>>>> >>>>> >>>>> >>>>> Nicolas Morey-Chaisemartin >>>>> >>>>> 04/16/2009 12:43 AM >>>>> >>>>> To >>>>> Yicheng Jia >>>>> cc >>>>> general at lists.openfabrics.org >>>>> Subject >>>>> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch >>>>> >>>>> >>>>> >>>>> >>>>> By any chances have you not reset the port you're on? >>>>> Have you tried using another node to enable the port again? >>>>> >>>>> Nicolas >>>>> >>>>> Le 16/04/2009 00:45, Yicheng Jia a écrit : >>>>>> >>>>>> Hello Randy, >>>>>> >>>>>> I am trying to run "ibportstate reset" to reset the switch port on the >>>>>> other side in order to get 4x link. However I get the following error: >>>>>> ibwarn: [19660] mad_rpc: _do_madrpc failed; dport (Lid 7) >>>>>> ibportstate: iberror: failed: smp set portinfo failed >>>>>> >>>>>> And the port status change to DOWN after this. Have you ever tried to >>>>>> run "ibportstate" to reset the switch port? >>>>>> >>>>>> Thanks! >>>>>> >>>>>> Yicheng Jia >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------ >>>>>> >>>>>> Message: 2 >>>>>> Date: Wed, 4 Mar 2009 18:39:54 -0600 >>>>>> From: Randy Halverson >>>>>> Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged >>>>>> switch >>>>>> To: "'general at lists.openfabrics.org'" >>>>>> Message-ID: >>>>>> <88EC963376E93B4DB0F2A69D932F786903D89CAF at MNEXMB2.qlogic.org> >>>>>> Content-Type: text/plain; charset="us-ascii" >>>>>> >>>>>> Hello Yicheng, >>>>>> >>>>>> After checking internally, this appears to be a known problem with >>>>>> older >>>>>> firmware for the 9024FC switches. >>>>>> >>>>>> It appears that you or another person at 'tmriusa.com' has recently >>>>>> opened a case with QLogic Tech Support for this issue. Please continue >>>>>> to work with QLogic Tech Support on firmware upgrade resolution since >>>>>> you probably don't have our FastFabric Tools to manage the 9024FC >>>>>> switches.. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Randy >>>>>> Technical Support >>>>>> QLogic Corporation >>>>>> -------------- next part -------------- >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _____________________________________________________________________________ >>>>>> Scanned by IBM Email Security Management Services powered by >>>>>> MessageLabs. For more information please visit http://www.ers.ibm.com >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _____________________________________________________________________________ >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _____________________________________________________________________________ >>>>>> Scanned by IBM Email Security Management Services powered by >>>>>> MessageLabs. For more information please visit http://www.ers.ibm.com >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _____________________________________________________________________________ >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------ >>>>>> >>>>>> _______________________________________________ >>>>>> general mailing list >>>>>> general at lists.openfabrics.org >>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>>> >>>>>> To unsubscribe, please visit >>>>>> http://openib.org/mailman/listinfo/openib-general >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> Scanned by IBM Email Security Management Services powered by >>>>> MessageLabs. >>>>> For more information please visit http://www.ers.ibm.com >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> Scanned by IBM Email Security Management Services powered by >>>>> MessageLabs. >>>>> For more information please visit http://www.ers.ibm.com >>>>> >>>>> >>>>> >>>>> _____________________________________________________________________________ >>>>> >>>>> _______________________________________________ >>>>> general mailing list >>>>> general at lists.openfabrics.org >>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>> >>>>> To unsubscribe, please visit >>>>> http://openib.org/mailman/listinfo/openib-general >>>>> >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> Scanned by IBM Email Security Management Services powered by >>>> MessageLabs. >>>> For more information please visit http://www.ers.ibm.com >>>> >>>> >>>> _____________________________________________________________________________ >>>> >>>> >>>> >>>> >>>> _____________________________________________________________________________ >>>> Scanned by IBM Email Security Management Services powered by >>>> MessageLabs. >>>> For more information please visit http://www.ers.ibm.com >>>> >>>> >>>> _____________________________________________________________________________ >>>> >>> >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by MessageLabs. >> For more information please visit http://www.ers.ibm.com >> >> _____________________________________________________________________________ >> >> >> >> _____________________________________________________________________________ >> Scanned by IBM Email Security Management Services powered by MessageLabs. >> For more information please visit http://www.ers.ibm.com >> >> _____________________________________________________________________________ >> > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http://www.ers.ibm.com > _____________________________________________________________________________ > _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From devel-ofed at morey-chaisemartin.com Fri Apr 17 02:26:32 2009 From: devel-ofed at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Fri, 17 Apr 2009 11:26:32 +0200 Subject: [ofa-general] Probable bug in mlx4 driver Message-ID: <49E84B48.9050702@morey-chaisemartin.com> Hi, I've been trying in the last days to get ofa_kernel 1.4.1 rc3 working with a 2.6.29 kernel (FC11-beta). I picked some patch from Linus' tree so it works. While getting some updates on mlx4_enable_msi_x (which seems too big for regular 2.6.29 kernel stack), I think I found out a couple of errors. I'm quite new to kernel programming so I may be wrong however, I'm pretty sure, entries should be freed. I'm not sure if the second correction is necessary but it seems more coherent with the rest of the function and previous implementation. This patch goes on Linus' tree (or ROland's Infiniband tree on kernel.org). Moreover, I was wondering what are the differences between Infiniband in Linus' and in OFED's tree? I'm a bit lost between all the backport / kernel patches plus git patches which are on one side but not the other... Regards Nicolas -------------- next part -------------- A non-text attachment was scrubbed... Name: msi_x.patch Type: application/mbox Size: 517 bytes Desc: not available URL: From vlad at lists.openfabrics.org Fri Apr 17 03:22:25 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 17 Apr 2009 03:22:25 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090417-0200 daily build status Message-ID: <20090417102225.70838E6119D@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From or.gerlitz at gmail.com Fri Apr 17 04:05:25 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Fri, 17 Apr 2009 14:05:25 +0300 Subject: [ofa-general] Probable bug in mlx4 driver In-Reply-To: <49E84B48.9050702@morey-chaisemartin.com> References: <49E84B48.9050702@morey-chaisemartin.com> Message-ID: <15ddcffd0904170405i2435c9c1wcfe90bbb9d251771@mail.gmail.com> Nicolas Morey-Chaisemartin wrote: > This patch goes on Linus' tree (or ROland's Infiniband tree on kernel.org) > Moreover, I was wondering what are the differences between Infiniband in Linus' and in > OFED's tree? I'm a bit lost between all the backport / kernel patches plus git patches > which are on one side but not the other... Many of the patches applied by ofed are were never reviewed nor submitted for acceptance and I think this speaks for itself. You have mentioned that you use 2.6.29 - may I ask what makes you work with ofed and not with the upstream IB bits? please note that the mainline IB drivers are compatible with the user space libraries installed through the distro and/or ofed. Or. From nicolas.morey-chaisemartin at ext.bull.net Fri Apr 17 04:09:45 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey-Chaisemartin) Date: Fri, 17 Apr 2009 13:09:45 +0200 Subject: [ofa-general] Probable bug in mlx4 driver In-Reply-To: <15ddcffd0904170405i2435c9c1wcfe90bbb9d251771@mail.gmail.com> References: <49E84B48.9050702@morey-chaisemartin.com> <15ddcffd0904170405i2435c9c1wcfe90bbb9d251771@mail.gmail.com> Message-ID: <49E86379.1000604@ext.bull.net> Le 17/04/2009 13:05, Or Gerlitz a écrit : > Nicolas Morey-Chaisemartin wrote: > >> This patch goes on Linus' tree (or ROland's Infiniband tree on kernel.org) >> Moreover, I was wondering what are the differences between Infiniband in Linus' and in >> OFED's tree? I'm a bit lost between all the backport / kernel patches plus git patches >> which are on one side but not the other... > > Many of the patches applied by ofed are were never reviewed nor > submitted for acceptance and I think this speaks for itself. You have > mentioned that you use 2.6.29 - may I ask what makes you work with > ofed and not with the upstream IB bits? please note that the mainline > IB drivers are compatible with the user space libraries installed > through the distro and/or ofed. > > Or. > > Two main reasons: -Main one is we also work with RHEL (4 and 5) and as we repackage ofa_kernel we would rather use only one version accross all distributions. - SDP is missing from Linux mainstream Nicolas From or.gerlitz at gmail.com Fri Apr 17 04:38:26 2009 From: or.gerlitz at gmail.com (Or Gerlitz) Date: Fri, 17 Apr 2009 14:38:26 +0300 Subject: ***SPAM*** Re: [ofa-general] Probable bug in mlx4 driver In-Reply-To: <49E86379.1000604@ext.bull.net> References: <49E84B48.9050702@morey-chaisemartin.com> <15ddcffd0904170405i2435c9c1wcfe90bbb9d251771@mail.gmail.com> <49E86379.1000604@ext.bull.net> Message-ID: <15ddcffd0904170438w9228613te36683c796958130@mail.gmail.com> Nicolas Morey-Chaisemartin wrote: > SDP is missing from Linux mainstream I wonder what SDP buys you vs IPoIB connected mode? Or. From sashak at voltaire.com Fri Apr 17 05:03:04 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 17 Apr 2009 15:03:04 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] change missed LID conversion functions from hex to uint (WAS: Re: [PATCH] Update mad formatting functions.) In-Reply-To: <20090416092329.53b2628a.weiny2@llnl.gov> References: <20090311144404.bf15ba8b.weiny2@llnl.gov> <20090415153003.GA20857@sk> <20090415140341.dd26d8dc.weiny2@llnl.gov> <20090416092329.53b2628a.weiny2@llnl.gov> Message-ID: <20090417120304.GC17631@sk> On 09:23 Thu 16 Apr , Ira Weiny wrote: > > From: Ira Weiny > Date: Thu, 16 Apr 2009 09:15:20 -0700 > Subject: [PATCH] change missed LID conversion functions from hex to uint > > > Signed-off-by: Ira Weiny Applied. Thanks. Sasha From sashak at voltaire.com Fri Apr 17 05:07:10 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 17 Apr 2009 15:07:10 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] infiniband-diags/man/vendstat.8: Fix PortXmit/RcvDataSL examples In-Reply-To: <20090416130145.GA31864@comcast.net> References: <20090416130145.GA31864@comcast.net> Message-ID: <20090417120710.GD17631@sk> On 09:01 Thu 16 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From nicolas.morey-chaisemartin at ext.bull.net Fri Apr 17 07:17:31 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey Chaisemartin) Date: Fri, 17 Apr 2009 16:17:31 +0200 Subject: [ofa-general] [PATCH] Fixed capability mask problem in ibsim introduec by commit 722b6c6428c9e4921a81f4a6db2838bcee660bb7 Message-ID: <49E88F7B.5010900@ext.bull.net> Signed-off-by: Nicolas Morey-Chaisemartin --- I don't know if compilation on WinOF is still working with this patch as I have no way to test it but it fixes the problem for Linux. If it doesn't work anymore, ntohll result should be shift of 32 bits right (>>32) before being cast to unsigned. infiniband-diags/src/ibstat.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/src/ibstat.c b/infiniband-diags/src/ibstat.c index 7985be1..99af9a8 100644 --- a/infiniband-diags/src/ibstat.c +++ b/infiniband-diags/src/ibstat.c @@ -111,7 +111,7 @@ port_dump(umad_port_t *port, int alone) printf("%sBase lid: %d\n", pre, port->base_lid); printf("%sLMC: %d\n", pre, port->lmc); printf("%sSM lid: %d\n", pre, port->sm_lid); - printf("%sCapability mask: 0x%08x\n", pre, (unsigned)ntohll(port->capmask)); + printf("%sCapability mask: 0x%08x\n", pre, (unsigned)(ntohl((uint32_t)(port->capmask)))); printf("%sPort GUID: 0x%016llx\n", pre, (long long unsigned)ntohll(port->port_guid)); return 0; } -- 1.6.2-rc2.GIT From sashak at voltaire.com Fri Apr 17 07:57:17 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 17 Apr 2009 17:57:17 +0300 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCHv2] opensm/PerfMgr: Better redirection support In-Reply-To: References: <20090312202134.GC25024@comcast.net> <20090415125925.GF7353@sk> <20090416005422.GC10146@sk> Message-ID: <20090417145717.GG17631@sk> On 09:34 Thu 16 Apr , Hal Rosenstock wrote: > > > > Yes, and you can use lid value as such flag - just simpler. > > When GID redirection is specified by client, LID must be 0 so I don't see this. 1. GID redirection is not implemented in this patch. 2. In any case you will need to resolve LID value (using GID) in order to send MAD. So LID = 0 can be used as invalid redirection data flag. But this was a minor comment. > > My point was different - to separate redirection related data from main > > flow. > > I'm still not sure what you mean by this. Encapsulate the redirection > data better so it is obtained by some potentially common routine ? Yes. And also to not use "fake" redirection fields (specifically pkey_ix) in non-redirected flow - this is why I think you need 'port' structure. > >> > PerfMgr is always running over discovered fabric so maybe local port > >> > number should be detected later at start of PerfMgr process cycle just > >> > using OpenSM DB. > >> > >> Why is that better than doing this at bind time of PerfMgr ? > > > > At least two reasons: faster and less code. > > Are you sure the OpenSM DB accesses will be faster than the vendor calls here ? Yes, it is direct memory read against opening and parsing many files (+ memory allocations, etc.). > Is bind performance sensitive anyhow ? Not at all, but all what you need here is just local port number - and 40 (or so) lines of the code (which is 80% duplicated with pkey validation) for doing this looks like overkill for me (not in sense of performance). > The performance comment is > clearly relevant to the main flow though. Sure, but there you just need to read a value. > >> > Also what about letting "chance" for port to refresh redirection info? > >> > >> What do you mean ? > > > > When port has invalid redirection data, should you care about attempting > > to refresh this? > > If the PMA gives bad redirection data (which BTW is noncompliant), it > seems likely to do this again so I'm not sure about the value of this. > Do you think that's a better thing to do ? I don't have a clear opinion (and so asked). Actually if I understood your code correctly this means that if some port once gets bad redirection data it will dropped from PerfMgr cycle forever, right? > >> Redirection does not occur frequently. > > > > How could we know:) > > It's the current use case for PerfMgt. Let's suppose it happens just three times per one PerfMgr cycle - 3 > 1 anyway. Another important advantage is that in case when pkey tables are prepared *before* actual PerfMgr cycle and will not slow down querying itself. Another thought - could p_physp->pkeys be used for index detection/validation? > > When OpenSM is in master mode it cannot change (PerfMgr is synchronized > > with heavy sweep). > > > > It is possible with standby OpenSM, so what - this single request will > > fail once. > > Some recovery for such failure would be needed. Not really - next PerfMgr cycle will fetch valid data. > Also, what about not active ? Same as standby (let's call it "non-master" modes). > >> > All above are not OpenSM errors, but wrong external data. I think it > >> > should be logged as VERBOSE messages. > >> > >> I agree it's wrong external data but it seems serious enough to me to > >> treat as an error. > > > > And some stupid port will be able to put OpenSM in endless error > > printing. I don't think it is a good idea. > > It would be a non compliant PMA which I would think we'd want to know > about sooner rather than later. If an admin want to care about this (and also about other such sort of things) he/she will turn verbosity "on". > >> Seems like some sort of configuration error to me if this is disabled > >> at the manager but the PMA wants to use it. > > > > PMA shouldn't dictate here. > > PMA does dictate redirection. Manager has no way to shut it off. But it should be able to ignore this (including "noisy" logging). > If > manager turns off it's handling of redirection, then it just doesn't > work (that port is inaccessible by the manager). This argues for the > default to be enabled. The current default is disabled since this code > was deemed experimental. Right, and it should be consistent with this (now default) setting. > >> > BTW, why to bother with verifying redirection info when redirection > >> > support is disabled anyway? > >> > >> I thought it was useful to know the redirection info was invalid > >> rather than getting the disabled notification and then enabling and > >> finding out. > > > > For PMAs debug purposes redirection support should be switched "on" > > obviously. > > Why do you say debug purposes ? Isn't it any purpose ? I meant PMA support + PMA debug. Sasha From hnrose at comcast.net Fri Apr 17 08:17:05 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 17 Apr 2009 11:17:05 -0400 Subject: [ofa-general] [PATCH] opensm: Some cosmetic formatting changes Message-ID: <20090417151704.GB7875@comcast.net> Signed-off-by: Hal Rosenstock --- Resending with list added diff --git a/opensm/opensm/osm_pkey_rcv.c b/opensm/opensm/osm_pkey_rcv.c index e8e030e..cf92a3c 100644 --- a/opensm/opensm/osm_pkey_rcv.c +++ b/opensm/opensm/osm_pkey_rcv.c @@ -131,9 +131,8 @@ void osm_pkey_rcv_process(IN void *context, IN void *data) goto Exit; } - osm_dump_pkey_block(sm->p_log, - port_guid, block_num, - port_num, p_pkey_tbl, OSM_LOG_DEBUG); + osm_dump_pkey_block(sm->p_log, port_guid, block_num, port_num, + p_pkey_tbl, OSM_LOG_DEBUG); osm_physp_set_pkey_tbl(sm->p_log, sm->p_subn, p_physp, p_pkey_tbl, block_num); diff --git a/opensm/opensm/osm_port_info_rcv.c b/opensm/opensm/osm_port_info_rcv.c index 06dd10d..7b6fb1a 100644 --- a/opensm/opensm/osm_port_info_rcv.c +++ b/opensm/opensm/osm_port_info_rcv.c @@ -452,8 +452,7 @@ static void pi_rcv_process_set(IN osm_sm_t * sm, IN osm_node_t * p_node, OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 0F10: " "Received error status for SetResp()\n"); } - osm_dump_port_info(sm->p_log, - osm_node_get_node_guid(p_node), + osm_dump_port_info(sm->p_log, osm_node_get_node_guid(p_node), port_guid, port_num, p_pi, level); } @@ -503,8 +502,8 @@ void osm_pi_rcv_process(IN void *context, IN void *data) port_guid = p_context->port_guid; node_guid = p_context->node_guid; - osm_dump_port_info(sm->p_log, - node_guid, port_guid, port_num, p_pi, OSM_LOG_DEBUG); + osm_dump_port_info(sm->p_log, node_guid, port_guid, port_num, p_pi, + OSM_LOG_DEBUG); /* On receipt of client reregister, clear the reregister bit so reregistering won't be sent again and again */ From hnrose at comcast.net Fri Apr 17 08:26:17 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 17 Apr 2009 11:26:17 -0400 Subject: [ofa-general] [PATCH] opensm: Eliminate duplicated calls to osm_log_is_active in SA modules Message-ID: <20090417152617.GA8275@comcast.net> Helper routines call this routine Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_sa.c b/opensm/opensm/osm_sa.c index 3521132..202a38e 100644 --- a/opensm/opensm/osm_sa.c +++ b/opensm/opensm/osm_sa.c @@ -388,9 +388,7 @@ osm_sa_send_error(IN osm_sa_t * sa, if (p_resp_sa_mad->attr_id == IB_MAD_ATTR_MULTIPATH_RECORD) p_resp_sa_mad->attr_id = IB_MAD_ATTR_PATH_RECORD; - if (osm_log_is_active(sa->p_log, OSM_LOG_FRAMES)) - osm_dump_sa_mad(sa->p_log, p_resp_sa_mad, OSM_LOG_FRAMES); - + osm_dump_sa_mad(sa->p_log, p_resp_sa_mad, OSM_LOG_FRAMES); osm_sa_send(sa, p_resp_madw, FALSE); Exit: diff --git a/opensm/opensm/osm_sa_class_port_info.c b/opensm/opensm/osm_sa_class_port_info.c index d2ab96a..7ad08ef 100644 --- a/opensm/opensm/osm_sa_class_port_info.c +++ b/opensm/opensm/osm_sa_class_port_info.c @@ -165,8 +165,7 @@ static void cpi_rcv_respond(IN osm_sa_t * sa, IN const osm_madw_t * p_madw) p_resp_cpi->cap_mask |= OSM_CAP_IS_UD_MCAST_SUP; p_resp_cpi->cap_mask = cl_hton16(p_resp_cpi->cap_mask); - if (osm_log_is_active(sa->p_log, OSM_LOG_FRAMES)) - osm_dump_sa_mad(sa->p_log, p_resp_sa_mad, OSM_LOG_FRAMES); + osm_dump_sa_mad(sa->p_log, p_resp_sa_mad, OSM_LOG_FRAMES); osm_sa_send(sa, p_resp_madw, FALSE); diff --git a/opensm/opensm/osm_sa_link_record.c b/opensm/opensm/osm_sa_link_record.c index bf0b5ee..20b94bd 100644 --- a/opensm/opensm/osm_sa_link_record.c +++ b/opensm/opensm/osm_sa_link_record.c @@ -465,8 +465,7 @@ void osm_lr_rcv_process(IN void *context, IN void *data) goto Exit; } - if (osm_log_is_active(sa->p_log, OSM_LOG_DEBUG)) - osm_dump_link_record(sa->p_log, p_lr, OSM_LOG_DEBUG); + osm_dump_link_record(sa->p_log, p_lr, OSM_LOG_DEBUG); cl_qlist_init(&lr_list); diff --git a/opensm/opensm/osm_sa_mad_ctrl.c b/opensm/opensm/osm_sa_mad_ctrl.c index eeec51c..a791402 100644 --- a/opensm/opensm/osm_sa_mad_ctrl.c +++ b/opensm/opensm/osm_sa_mad_ctrl.c @@ -315,8 +315,7 @@ static void sa_mad_ctrl_rcv_callback(IN osm_madw_t * p_madw, IN void *context, p_sa_mad = osm_madw_get_sa_mad_ptr(p_madw); - if (osm_log_is_active(p_ctrl->p_log, OSM_LOG_FRAMES)) - osm_dump_sa_mad(p_ctrl->p_log, p_sa_mad, OSM_LOG_FRAMES); + osm_dump_sa_mad(p_ctrl->p_log, p_sa_mad, OSM_LOG_FRAMES); /* * C15-0.1.5 - Table 185: SA Header - p884 diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c index 5543221..2cde504 100644 --- a/opensm/opensm/osm_sa_mcmember_record.c +++ b/opensm/opensm/osm_sa_mcmember_record.c @@ -1331,8 +1331,7 @@ static void mcmr_rcv_join_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw) } /* failed to route */ - if (osm_log_is_active(sa->p_log, OSM_LOG_DEBUG)) - osm_dump_mc_record(sa->p_log, &mcmember_rec, OSM_LOG_DEBUG); + osm_dump_mc_record(sa->p_log, &mcmember_rec, OSM_LOG_DEBUG); mcmr_rcv_respond(sa, p_madw, &mcmember_rec); diff --git a/opensm/opensm/osm_sa_multipath_record.c b/opensm/opensm/osm_sa_multipath_record.c index 737d892..59bed2b 100644 --- a/opensm/opensm/osm_sa_multipath_record.c +++ b/opensm/opensm/osm_sa_multipath_record.c @@ -1454,8 +1454,7 @@ void osm_mpr_rcv_process(IN void *context, IN void *data) goto Exit; } - if (osm_log_is_active(sa->p_log, OSM_LOG_DEBUG)) - osm_dump_multipath_record(sa->p_log, p_mpr, OSM_LOG_DEBUG); + osm_dump_multipath_record(sa->p_log, p_mpr, OSM_LOG_DEBUG); cl_qlist_init(&pr_list); diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c index 6e7d5f6..f3146ed 100644 --- a/opensm/opensm/osm_sa_path_record.c +++ b/opensm/opensm/osm_sa_path_record.c @@ -1628,8 +1628,7 @@ void osm_pr_rcv_process(IN void *context, IN void *data) goto Exit; } - if (osm_log_is_active(sa->p_log, OSM_LOG_DEBUG)) - osm_dump_path_record(sa->p_log, p_pr, OSM_LOG_DEBUG); + osm_dump_path_record(sa->p_log, p_pr, OSM_LOG_DEBUG); cl_qlist_init(&pr_list); diff --git a/opensm/opensm/osm_sa_service_record.c b/opensm/opensm/osm_sa_service_record.c index b3c39b0..02496c1 100644 --- a/opensm/opensm/osm_sa_service_record.c +++ b/opensm/opensm/osm_sa_service_record.c @@ -475,9 +475,7 @@ static void sr_rcv_process_get_method(osm_sa_t * sa, IN osm_madw_t * p_madw) p_recvd_service_rec = (ib_service_record_t *) ib_sa_mad_get_payload_ptr(p_sa_mad); - if (osm_log_is_active(sa->p_log, OSM_LOG_DEBUG)) - osm_dump_service_record(sa->p_log, p_recvd_service_rec, - OSM_LOG_DEBUG); + osm_dump_service_record(sa->p_log, p_recvd_service_rec, OSM_LOG_DEBUG); cl_qlist_init(&sr_match_item.sr_list); sr_match_item.p_service_rec = p_recvd_service_rec; @@ -530,9 +528,7 @@ static void sr_rcv_process_set_method(osm_sa_t * sa, IN osm_madw_t * p_madw) comp_mask = p_sa_mad->comp_mask; - if (osm_log_is_active(sa->p_log, OSM_LOG_DEBUG)) - osm_dump_service_record(sa->p_log, p_recvd_service_rec, - OSM_LOG_DEBUG); + osm_dump_service_record(sa->p_log, p_recvd_service_rec, OSM_LOG_DEBUG); if ((comp_mask & (IB_SR_COMPMASK_SID | IB_SR_COMPMASK_SGID)) != (IB_SR_COMPMASK_SID | IB_SR_COMPMASK_SGID)) { @@ -634,9 +630,7 @@ static void sr_rcv_process_delete_method(osm_sa_t * sa, IN osm_madw_t * p_madw) comp_mask = p_sa_mad->comp_mask; - if (osm_log_is_active(sa->p_log, OSM_LOG_DEBUG)) - osm_dump_service_record(sa->p_log, p_recvd_service_rec, - OSM_LOG_DEBUG); + osm_dump_service_record(sa->p_log, p_recvd_service_rec, OSM_LOG_DEBUG); /* Grab the lock */ cl_plock_excl_acquire(sa->p_lock); diff --git a/opensm/opensm/osm_sa_sminfo_record.c b/opensm/opensm/osm_sa_sminfo_record.c index 4d454af..9f11c91 100644 --- a/opensm/opensm/osm_sa_sminfo_record.c +++ b/opensm/opensm/osm_sa_sminfo_record.c @@ -216,8 +216,7 @@ void osm_smir_rcv_process(IN void *ctx, IN void *data) goto Exit; } - if (osm_log_is_active(sa->p_log, OSM_LOG_DEBUG)) - osm_dump_sm_info_record(sa->p_log, p_rcvd_rec, OSM_LOG_DEBUG); + osm_dump_sm_info_record(sa->p_log, p_rcvd_rec, OSM_LOG_DEBUG); p_smi = &p_rcvd_rec->sm_info; diff --git a/opensm/opensm/osm_sa_sw_info_record.c b/opensm/opensm/osm_sa_sw_info_record.c index 2ea8baf..e6ac7fe 100644 --- a/opensm/opensm/osm_sa_sw_info_record.c +++ b/opensm/opensm/osm_sa_sw_info_record.c @@ -239,9 +239,7 @@ void osm_sir_rcv_process(IN void *ctx, IN void *data) goto Exit; } - if (osm_log_is_active(sa->p_log, OSM_LOG_DEBUG)) - osm_dump_switch_info_record(sa->p_log, p_rcvd_rec, - OSM_LOG_DEBUG); + osm_dump_switch_info_record(sa->p_log, p_rcvd_rec, OSM_LOG_DEBUG); cl_qlist_init(&rec_list); From hnrose at comcast.net Fri Apr 17 08:11:39 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 17 Apr 2009 11:11:39 -0400 Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_helper.c: Convert remaining helper routines for GID printing format Message-ID: <20090417151139.GA7537@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c index 0dc8055..ae5a703 100644 --- a/opensm/opensm/osm_helper.c +++ b/opensm/opensm/osm_helper.c @@ -1085,13 +1085,13 @@ void osm_dump_path_record(IN osm_log_t * p_log, IN const ib_path_rec_t * p_pr, IN const osm_log_level_t log_level) { if (osm_log_is_active(p_log, log_level)) { + char gid_str[INET6_ADDRSTRLEN]; + char gid_str2[INET6_ADDRSTRLEN]; osm_log(p_log, log_level, "PathRecord dump:\n" "\t\t\t\tservice_id..............0x%016" PRIx64 "\n" - "\t\t\t\tdgid....................0x%016" PRIx64 " : " - "0x%016" PRIx64 "\n" - "\t\t\t\tsgid....................0x%016" PRIx64 " : " - "0x%016" PRIx64 "\n" + "\t\t\t\tdgid....................%s\n" + "\t\t\t\tsgid....................%s\n" "\t\t\t\tdlid....................%u\n" "\t\t\t\tslid....................%u\n" "\t\t\t\thop_flow_raw............0x%X\n" @@ -1107,10 +1107,10 @@ void osm_dump_path_record(IN osm_log_t * p_log, IN const ib_path_rec_t * p_pr, "\t\t\t\tresv2...................0x%X\n" "\t\t\t\tresv3...................0x%X\n", cl_ntoh64(p_pr->service_id), - cl_ntoh64(p_pr->dgid.unicast.prefix), - cl_ntoh64(p_pr->dgid.unicast.interface_id), - cl_ntoh64(p_pr->sgid.unicast.prefix), - cl_ntoh64(p_pr->sgid.unicast.interface_id), + inet_ntop(AF_INET6, p_pr->dgid.raw, gid_str, + sizeof gid_str), + inet_ntop(AF_INET6, p_pr->sgid.raw, gid_str2, + sizeof gid_str2), cl_ntoh16(p_pr->dlid), cl_ntoh16(p_pr->slid), cl_ntoh32(p_pr->hop_flow_raw), @@ -1135,6 +1135,7 @@ void osm_dump_multipath_record(IN osm_log_t * p_log, IN const osm_log_level_t log_level) { if (osm_log_is_active(p_log, log_level)) { + char gid_str[INET6_ADDRSTRLEN]; char buf_line[1024]; ib_gid_t const *p_gid = p_mpr->gids; int i, n = 0; @@ -1143,11 +1144,9 @@ void osm_dump_multipath_record(IN osm_log_t * p_log, for (i = 0; i < p_mpr->sgid_count; i++) { n += sprintf(buf_line + n, "\t\t\t\tsgid%02d.................." - "0x%016" PRIx64 " : 0x%016" PRIx64 - "\n", i + 1, - cl_ntoh64(p_gid->unicast.prefix), - cl_ntoh64(p_gid->unicast. - interface_id)); + "%s\n", i + 1, + inet_ntop(AF_INET6, p_gid->raw, + gid_str, sizeof gid_str)); p_gid++; } } @@ -1155,11 +1154,9 @@ void osm_dump_multipath_record(IN osm_log_t * p_log, for (i = 0; i < p_mpr->dgid_count; i++) { n += sprintf(buf_line + n, "\t\t\t\tdgid%02d.................." - "0x%016" PRIx64 " : 0x%016" PRIx64 - "\n", i + 1, - cl_ntoh64(p_gid->unicast.prefix), - cl_ntoh64(p_gid->unicast. - interface_id)); + "%s\n", i + 1, + inet_ntop(AF_INET6, p_gid->raw, + gid_str, sizeof gid_str)); p_gid++; } } @@ -1343,15 +1340,14 @@ void osm_dump_inform_info(IN osm_log_t * p_log, if (osm_log_is_active(p_log, log_level)) { uint32_t qpn; uint8_t resp_time_val; - + char gid_str[INET6_ADDRSTRLEN]; ib_inform_info_get_qpn_resp_time(p_ii->g_or_v.generic. qpn_resp_time_val, &qpn, &resp_time_val); if (p_ii->is_generic) { osm_log(p_log, log_level, "InformInfo dump:\n" - "\t\t\t\tgid.....................0x%016" PRIx64 - " : 0x%016" PRIx64 "\n" + "\t\t\t\tgid.....................%s\n" "\t\t\t\tlid_range_begin.........%u\n" "\t\t\t\tlid_range_end...........%u\n" "\t\t\t\tis_generic..............0x%X\n" @@ -1361,8 +1357,8 @@ void osm_dump_inform_info(IN osm_log_t * p_log, "\t\t\t\tqpn.....................0x%06X\n" "\t\t\t\tresp_time_val...........0x%X\n" "\t\t\t\tnode_type...............0x%06X\n" "", - cl_ntoh64(p_ii->gid.unicast.prefix), - cl_ntoh64(p_ii->gid.unicast.interface_id), + inet_ntop(AF_INET6, p_ii->gid.raw, gid_str, + sizeof gid_str), cl_ntoh16(p_ii->lid_range_begin), cl_ntoh16(p_ii->lid_range_end), p_ii->is_generic, p_ii->subscribe, @@ -1373,8 +1369,7 @@ void osm_dump_inform_info(IN osm_log_t * p_log, } else { osm_log(p_log, log_level, "InformInfo dump:\n" - "\t\t\t\tgid.....................0x%016" PRIx64 - " : 0x%016" PRIx64 "\n" + "\t\t\t\tgid.....................%s\n" "\t\t\t\tlid_range_begin.........%u\n" "\t\t\t\tlid_range_end...........%u\n" "\t\t\t\tis_generic..............0x%X\n" @@ -1384,8 +1379,8 @@ void osm_dump_inform_info(IN osm_log_t * p_log, "\t\t\t\tqpn.....................0x%06X\n" "\t\t\t\tresp_time_val...........0x%X\n" "\t\t\t\tvendor_id...............0x%06X\n" "", - cl_ntoh64(p_ii->gid.unicast.prefix), - cl_ntoh64(p_ii->gid.unicast.interface_id), + inet_ntop(AF_INET6, p_ii->gid.raw, gid_str, + sizeof gid_str), cl_ntoh16(p_ii->lid_range_begin), cl_ntoh16(p_ii->lid_range_end), p_ii->is_generic, p_ii->subscribe, @@ -1706,6 +1701,8 @@ void osm_dump_notice(IN osm_log_t * p_log, return; if (ib_notice_is_generic(p_ntci)) { + char gid_str[INET6_ADDRSTRLEN]; + char gid_str2[INET6_ADDRSTRLEN]; char buff[1024]; int n; buff[0] = '\0'; @@ -1717,12 +1714,10 @@ void osm_dump_notice(IN osm_log_t * p_log, case 66: case 67: sprintf(buff, - "\t\t\t\tsrc_gid..................0x%016" - PRIx64 ":0x%016" PRIx64 "\n", - cl_ntoh64(p_ntci->data_details. - ntc_64_67.gid.unicast.prefix), - cl_ntoh64(p_ntci->data_details. - ntc_64_67.gid.unicast.interface_id)); + "\t\t\t\tsrc_gid..................%s\n", + inet_ntop(AF_INET6, p_ntci->data_details. + ntc_64_67.gid.raw, gid_str, + sizeof gid_str)); break; case 128: sprintf(buff, @@ -1815,10 +1810,8 @@ void osm_dump_notice(IN osm_log_t * p_log, "\t\t\t\tsl.......................%d\n" "\t\t\t\tqp1......................0x%x\n" "\t\t\t\tqp2......................0x%x\n" - "\t\t\t\tgid1.....................0x%016" PRIx64 - " : " "0x%016" PRIx64 "\n" - "\t\t\t\tgid2.....................0x%016" PRIx64 - " : " "0x%016" PRIx64 "\n", + "\t\t\t\tgid1.....................%s\n" + "\t\t\t\tgid2.....................%s\n", cl_ntoh16(p_ntci->data_details.ntc_257_258. lid1), cl_ntoh16(p_ntci->data_details.ntc_257_258. @@ -1829,14 +1822,12 @@ void osm_dump_notice(IN osm_log_t * p_log, cl_ntoh32(p_ntci->data_details.ntc_257_258. qp1) & 0xffffff, cl_ntoh32(p_ntci->data_details.ntc_257_258.qp2), - cl_ntoh64(p_ntci->data_details.ntc_257_258.gid1. - unicast.prefix), - cl_ntoh64(p_ntci->data_details.ntc_257_258.gid1. - unicast.interface_id), - cl_ntoh64(p_ntci->data_details.ntc_257_258.gid2. - unicast.prefix), - cl_ntoh64(p_ntci->data_details.ntc_257_258.gid2. - unicast.interface_id)); + inet_ntop(AF_INET6, p_ntci->data_details. + ntc_257_258.gid1.raw, gid_str, + sizeof gid_str), + inet_ntop(AF_INET6, p_ntci->data_details. + ntc_257_258.gid2.raw, gid_str2, + sizeof gid_str2)); break; case 259: sprintf(buff, @@ -1847,10 +1838,8 @@ void osm_dump_notice(IN osm_log_t * p_log, "\t\t\t\tsl.......................%d\n" "\t\t\t\tqp1......................0x%x\n" "\t\t\t\tqp2......................0x%x\n" - "\t\t\t\tgid1.....................0x%016" PRIx64 - " : " "0x%016" PRIx64 "\n" - "\t\t\t\tgid2.....................0x%016" PRIx64 - " : " "0x%016" PRIx64 "\n" + "\t\t\t\tgid1.....................%s\n" + "\t\t\t\tgid2.....................%s\n" "\t\t\t\tsw_lid...................%u\n" "\t\t\t\tport_no..................%u\n", cl_ntoh16(p_ntci->data_details.ntc_259. @@ -1863,14 +1852,12 @@ void osm_dump_notice(IN osm_log_t * p_log, cl_ntoh32(p_ntci->data_details.ntc_259. sl_qp1) & 0xffffff, cl_ntoh32(p_ntci->data_details.ntc_259.qp2), - cl_ntoh64(p_ntci->data_details.ntc_259.gid1. - unicast.prefix), - cl_ntoh64(p_ntci->data_details.ntc_259.gid1. - unicast.interface_id), - cl_ntoh64(p_ntci->data_details.ntc_259.gid2. - unicast.prefix), - cl_ntoh64(p_ntci->data_details.ntc_259.gid2. - unicast.interface_id), + inet_ntop(AF_INET6, p_ntci->data_details. + ntc_259.gid1.raw, gid_str, + sizeof gid_str), + inet_ntop(AF_INET6, p_ntci->data_details. + ntc_259.gid2.raw, gid_str2, + sizeof gid_str2), cl_ntoh16(p_ntci->data_details.ntc_259.sw_lid), p_ntci->data_details.ntc_259.port_no); break; From sashak at voltaire.com Fri Apr 17 08:31:10 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 17 Apr 2009 18:31:10 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] Fixed capability mask problem in ibsim introduec by commit 722b6c6428c9e4921a81f4a6db2838bcee660bb7 In-Reply-To: <49E88F7B.5010900@ext.bull.net> References: <49E88F7B.5010900@ext.bull.net> Message-ID: <20090417153110.GH17631@sk> Nicolas, Did you mean 'ibstat' (ibstead if 'ibsim') in Subject? Sasha On 16:17 Fri 17 Apr , Nicolas Morey Chaisemartin wrote: > > Signed-off-by: Nicolas Morey-Chaisemartin > --- > I don't know if compilation on WinOF is still working with this patch as I have no way to test it but it fixes the problem for Linux. > If it doesn't work anymore, ntohll result should be shift of 32 bits right (>>32) before being cast to unsigned. > > infiniband-diags/src/ibstat.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/infiniband-diags/src/ibstat.c b/infiniband-diags/src/ibstat.c > index 7985be1..99af9a8 100644 > --- a/infiniband-diags/src/ibstat.c > +++ b/infiniband-diags/src/ibstat.c > @@ -111,7 +111,7 @@ port_dump(umad_port_t *port, int alone) > printf("%sBase lid: %d\n", pre, port->base_lid); > printf("%sLMC: %d\n", pre, port->lmc); > printf("%sSM lid: %d\n", pre, port->sm_lid); > - printf("%sCapability mask: 0x%08x\n", pre, (unsigned)ntohll(port->capmask)); > + printf("%sCapability mask: 0x%08x\n", pre, (unsigned)(ntohl((uint32_t)(port->capmask)))); > printf("%sPort GUID: 0x%016llx\n", pre, (long long unsigned)ntohll(port->port_guid)); > return 0; > } > -- > 1.6.2-rc2.GIT > From hnrose at comcast.net Fri Apr 17 08:33:00 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Fri, 17 Apr 2009 11:33:00 -0400 Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_sa_informinfo.c: Eliminate duplicated calls to osm_log_is_active Message-ID: <20090417153300.GA8748@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_sa_informinfo.c b/opensm/opensm/osm_sa_informinfo.c index a41e4ed..c31d3d4 100644 --- a/opensm/opensm/osm_sa_informinfo.c +++ b/opensm/opensm/osm_sa_informinfo.c @@ -345,9 +345,7 @@ static void infr_rcv_process_get_method(osm_sa_t * sa, IN osm_madw_t * p_madw) goto Exit; } - if (osm_log_is_active(sa->p_log, OSM_LOG_DEBUG)) - osm_dump_inform_info_record(sa->p_log, p_rcvd_rec, - OSM_LOG_DEBUG); + osm_dump_inform_info_record(sa->p_log, p_rcvd_rec, OSM_LOG_DEBUG); cl_qlist_init(&rec_list); @@ -410,9 +408,7 @@ static void infr_rcv_process_set_method(osm_sa_t * sa, IN osm_madw_t * p_madw) (ib_inform_info_t *) ib_sa_mad_get_payload_ptr(p_sa_mad); #if 0 - if (osm_log_is_active(sa->p_log, OSM_LOG_DEBUG)) - osm_dump_inform_info(sa->p_log, p_recvd_inform_info, - OSM_LOG_DEBUG); + osm_dump_inform_info(sa->p_log, p_recvd_inform_info, OSM_LOG_DEBUG); #endif /* Grab the lock */ From devel-ofed at morey-chaisemartin.com Fri Apr 17 08:51:30 2009 From: devel-ofed at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Fri, 17 Apr 2009 17:51:30 +0200 Subject: [ofa-general] ***SPAM*** Re: [PATCH] Fixed capability mask problem in ibsim introduec by commit 722b6c6428c9e4921a81f4a6db2838bcee660bb7 In-Reply-To: <20090417153110.GH17631@sk> References: <49E88F7B.5010900@ext.bull.net> <20090417153110.GH17631@sk> Message-ID: <49E8A582.1050707@morey-chaisemartin.com> Le 17/04/2009 17:31, Sasha Khapyorsky a écrit : > Nicolas, > > Did you mean 'ibstat' (ibstead if 'ibsim') in Subject? > > Sasha > Yes sorry for that.I had to leave quickly and wanted to submit before the week end. Nicolas From YJia at tmriusa.com Fri Apr 17 10:06:43 2009 From: YJia at tmriusa.com (Yicheng Jia) Date: Fri, 17 Apr 2009 12:06:43 -0500 Subject: ***SPAM*** Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: <20090416175303.2942bf40.weiny2@llnl.gov> Message-ID: Hi Ira, Here is the output of "iblinkinfo.pl -R": ++++++++++++++++++++++++++++++++++++++++++++++++ [root at ib_manager ~]# iblinkinfo.pl -R Switch 0x00066a00d90009c1 InfiniCon System InfinIO 9024 Lite: 7 1[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 6 1[ ] "MT2520 4 InfiniHostLx Mellanox Technologies" ( ) 2[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 3[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 4[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 5[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 6[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 7[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 8[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 9[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 10[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 11[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 12[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 13[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 14[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 15[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 16[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 17[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 18[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 19[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 20[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 21[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 22[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 23[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) 24[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) ++++++++++++++++++++++++++++++++++++++++++++++++++++ And the "ibstat" output: +++++++++++++++++++++++++++++++++++ [root at ib_manager ~]# ibstat CA 'mthca0' CA type: MT25204 Number of ports: 1 Firmware version: 1.2.0 Hardware version: a0 Node GUID: 0x0002c90200230784 System image GUID: 0x0002c90200230787 Port 1: State: Active Physical state: LinkUp Rate: 10 Base lid: 6 LMC: 0 SM lid: 6 Capability mask: 0x02500a6a Port GUID: 0x0002c90200230785 ++++++++++++++++++++++++++++++++++++++++++++++ The reset command I am using is "ibportstate 7 1 reset", I also tried "ibportstate -D 0,1 1 reset", and it fails with the same result. Thanks! Yicheng Jia Ira Weiny 04/16/2009 07:53 PM To Hal Rosenstock cc Yicheng Jia , general at lists.openfabrics.org Subject Re: ***SPAM*** Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch Yicheng, I am hoping your system is small when I ask; could you send the output from: "iblinkinfo.pl -R" When everything is up and running. Also ibstat from the node you are attempting the reset on. As well as the reset command you are using? Thanks, Ira On Thu, 16 Apr 2009 19:12:00 -0400 Hal Rosenstock wrote: > On Thu, Apr 16, 2009 at 6:06 PM, Yicheng Jia wrote: > > > >> There's a race condition here that I was asking about. If the link > >> initialization takes too long and doesn't complete (gets to init) > >> prior to the enable trying to be sent to the switch, then you could > >> see these results but since it's DOWN until reboot it's something > >> different. > > > > I did the "reset" when ports on both side of the link are in INIT state and > > LinkUp phys state. > > > >> If the disable/wait/enable worked that would've been another story. > > > > It fails too. Both ports go to DOWN after disable is issued and never come > > back. How long am I supposed to wait? > > Ideally you would see init before doing the enable but sounds like > that's not occuring. Either you need low level debug to see why the > link does not initialize at that point or get support from your > CA/switch vendor(s). What's your CA device ? > > -- Hal > > > Thanks! > > Yicheng Jia > > > > > > > > > > Hal Rosenstock > > > > 04/16/2009 03:26 PM > > > > To > > Yicheng Jia > > cc > > Nicolas Morey-Chaisemartin , > > general at lists.openfabrics.org > > Subject > > Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch > > > > > > > > > > On Thu, Apr 16, 2009 at 3:47 PM, Yicheng Jia wrote: > >> > >>> Are you resetting the switch from the peer HCA port or some other port > >>> ? That's what Nicolas asked but I might have missed the answer. > >> > >> Yes, I am trying to reset from the peer HCA port. Is anything wrong with > >> this? > > > > There's a race condition here that I was asking about. If the link > > initialization takes too long and doesn't complete (gets to init) > > prior to the enable trying to be sent to the switch, then you could > > see these results but since it's DOWN until reboot it's something > > different. > > > >>> Also, try disable (wait) and then enable and see if that works. > >> > >> It remains the same, the switch port is DOWN forever. No SMP massage could > >> get to the switch port. > > > > Right; in down, the SMP can't be sent. > > > >>> If I recall correctly, you had those links which are taking a long time > >>> to > >>> initialize. If the link stays down forever after disable, this won't > >>> work but I want to be sure. > >> > >> This is seperate issue. > > > > Since the link stays down yes. If the disable/wait/enable worked that > > would've been another story. > > > > -- Hal > > > >> The "reset" command is tested on a single port HCA > >> directly connected with Qlogic siwth. The HCA is plugged into a Linux > >> machine. It is the simplest test environment. > >> > >> Thanks! > >> Yicheng Jia > >> > >> > >> > >> > >> Hal Rosenstock > >> > >> 04/16/2009 02:29 PM > >> > >> To > >> Yicheng Jia > >> cc > >> Nicolas Morey-Chaisemartin , > >> general at lists.openfabrics.org > >> Subject > >> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch > >> > >> > >> > >> > >> On Thu, Apr 16, 2009 at 3:20 PM, Hal Rosenstock > >> wrote: > >>> On Thu, Apr 16, 2009 at 3:18 PM, Yicheng Jia wrote: > >>>> > >>>> They both are POLLING before "reset". > >>> > >>> Then they _should_ come back to INIT. > >>> > >>> What does the local LDDS value say after reset ? Any way to get the > >>> switch port LDDS value ? > >> > >> Are you resetting the switch from the peer HCA port or some other port > >> ? That's what Nicolas asked but I might have missed the answer. > >> > >> Also, try disable (wait) and then enable and see if that works. If I > >> recall correctly, you had those links which are taking a long time to > >> initialize. If the link stays down forever after disable, this won't > >> work but I want to be sure. > >> > >> -- Hal > >> > >>> -- Hal > >>> > >>>> Thanks! > >>>> Yicheng Jia > >>>> > >>>> > >>>> > >>>> > >>>> Hal Rosenstock > >>>> > >>>> 04/16/2009 01:53 PM > >>>> > >>>> To > >>>> Yicheng Jia > >>>> cc > >>>> Nicolas Morey-Chaisemartin , > >>>> general at lists.openfabrics.org > >>>> Subject > >>>> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch > >>>> > >>>> > >>>> > >>>> > >>>> On Thu, Apr 16, 2009 at 11:12 AM, Yicheng Jia wrote: > >>>>> > >>>>> Hi Nicolas, > >>>>> > >>>>> After this "reset" command, both ports are DOWN forever, I can only get > >>>>> portinfo from local port. > >>>>> > >>>>> I am sure that the port that has been reset is not the local port, > >>>>> otherwise > >>>>> it will prompt "node type not switch" error. > >>>>> > >>>>> I tried to enable this switch port from another port and brought it to > >>>>> POLLING state, but as long as I use "reset", both ports are DOWN. > >>>> > >>>> What are the peer port's LinkDownDefaultStates ? Sounds like one or > >>>> more must be Sleeping rather than Polling for some reason. > >>>> > >>>> -- Hal > >>>> > >>>>> Thanks! > >>>>> > >>>>> Yicheng Jia > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> Nicolas Morey-Chaisemartin > >>>>> > >>>>> 04/16/2009 12:43 AM > >>>>> > >>>>> To > >>>>> Yicheng Jia > >>>>> cc > >>>>> general at lists.openfabrics.org > >>>>> Subject > >>>>> Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> By any chances have you not reset the port you're on? > >>>>> Have you tried using another node to enable the port again? > >>>>> > >>>>> Nicolas > >>>>> > >>>>> Le 16/04/2009 00:45, Yicheng Jia a écrit : > >>>>>> > >>>>>> Hello Randy, > >>>>>> > >>>>>> I am trying to run "ibportstate reset" to reset the switch port on the > >>>>>> other side in order to get 4x link. However I get the following error: > >>>>>> ibwarn: [19660] mad_rpc: _do_madrpc failed; dport (Lid 7) > >>>>>> ibportstate: iberror: failed: smp set portinfo failed > >>>>>> > >>>>>> And the port status change to DOWN after this. Have you ever tried to > >>>>>> run "ibportstate" to reset the switch port? > >>>>>> > >>>>>> Thanks! > >>>>>> > >>>>>> Yicheng Jia > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> ------------------------------ > >>>>>> > >>>>>> Message: 2 > >>>>>> Date: Wed, 4 Mar 2009 18:39:54 -0600 > >>>>>> From: Randy Halverson > >>>>>> Subject: [ofa-general] link width problem of Qlogic 9024 unmanaged > >>>>>> switch > >>>>>> To: "'general at lists.openfabrics.org'" > >>>>>> Message-ID: > >>>>>> <88EC963376E93B4DB0F2A69D932F786903D89CAF at MNEXMB2.qlogic.org> > >>>>>> Content-Type: text/plain; charset="us-ascii" > >>>>>> > >>>>>> Hello Yicheng, > >>>>>> > >>>>>> After checking internally, this appears to be a known problem with > >>>>>> older > >>>>>> firmware for the 9024FC switches. > >>>>>> > >>>>>> It appears that you or another person at 'tmriusa.com' has recently > >>>>>> opened a case with QLogic Tech Support for this issue. Please continue > >>>>>> to work with QLogic Tech Support on firmware upgrade resolution since > >>>>>> you probably don't have our FastFabric Tools to manage the 9024FC > >>>>>> switches.. > >>>>>> > >>>>>> Regards, > >>>>>> > >>>>>> Randy > >>>>>> Technical Support > >>>>>> QLogic Corporation > >>>>>> -------------- next part -------------- > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> _____________________________________________________________________________ > >>>>>> Scanned by IBM Email Security Management Services powered by > >>>>>> MessageLabs. For more information please visit http:// www. ers.ibm.com > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> _____________________________________________________________________________ > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> _____________________________________________________________________________ > >>>>>> Scanned by IBM Email Security Management Services powered by > >>>>>> MessageLabs. For more information please visit http:// www. ers.ibm.com > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> _____________________________________________________________________________ > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> ------------------------------------------------------------------------ > >>>>>> > >>>>>> _______________________________________________ > >>>>>> general mailing list > >>>>>> general at lists.openfabrics.org > >>>>>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>>>>> > >>>>>> To unsubscribe, please visit > >>>>>> http:// openib.org/mailman/listinfo/openib-general > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> _____________________________________________________________________________ > >>>>> Scanned by IBM Email Security Management Services powered by > >>>>> MessageLabs. > >>>>> For more information please visit http:// www. ers.ibm.com > >>>>> > >>>>> > >>>>> > >>>>> _____________________________________________________________________________ > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> _____________________________________________________________________________ > >>>>> Scanned by IBM Email Security Management Services powered by > >>>>> MessageLabs. > >>>>> For more information please visit http:// www. ers.ibm.com > >>>>> > >>>>> > >>>>> > >>>>> _____________________________________________________________________________ > >>>>> > >>>>> _______________________________________________ > >>>>> general mailing list > >>>>> general at lists.openfabrics.org > >>>>> http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>>>> > >>>>> To unsubscribe, please visit > >>>>> http:// openib.org/mailman/listinfo/openib-general > >>>>> > >>>> > >>>> > >>>> > >>>> _____________________________________________________________________________ > >>>> Scanned by IBM Email Security Management Services powered by > >>>> MessageLabs. > >>>> For more information please visit http:// www. ers.ibm.com > >>>> > >>>> > >>>> _____________________________________________________________________________ > >>>> > >>>> > >>>> > >>>> > >>>> _____________________________________________________________________________ > >>>> Scanned by IBM Email Security Management Services powered by > >>>> MessageLabs. > >>>> For more information please visit http:// www. ers.ibm.com > >>>> > >>>> > >>>> _____________________________________________________________________________ > >>>> > >>> > >> > >> > >> _____________________________________________________________________________ > >> Scanned by IBM Email Security Management Services powered by MessageLabs. > >> For more information please visit http:// www. ers.ibm.com > >> > >> _____________________________________________________________________________ > >> > >> > >> > >> _____________________________________________________________________________ > >> Scanned by IBM Email Security Management Services powered by MessageLabs. > >> For more information please visit http:// www. ers.ibm.com > >> > >> _____________________________________________________________________________ > >> > > > > _____________________________________________________________________________ > > Scanned by IBM Email Security Management Services powered by MessageLabs. > > For more information please visit http:// www. ers.ibm.com > > _____________________________________________________________________________ > > > > > > _____________________________________________________________________________ > > Scanned by IBM Email Security Management Services powered by MessageLabs. > > For more information please visit http:// www. ers.ibm.com > > _____________________________________________________________________________ > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http:// lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http:// openib.org/mailman/listinfo/openib-general > -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab weiny2 at llnl.gov _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Fri Apr 17 10:39:49 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 17 Apr 2009 10:39:49 -0700 Subject: [ofa-general] [PATCH] rdma_cm: Add debugfs entries to monitor rdma_cm connections In-Reply-To: <49E71FE1.90102@Voltaire.COM> References: <49E71FE1.90102@Voltaire.COM> Message-ID: If the path is: /sys/kernel/debug/rdma_cm/mthca0_rdma_id do we really need to append '_rdma_id' at the end? (I'll defer to others if debugfs is the right location or not.) >+ if (v == SEQ_START_TOKEN) { >+ seq_printf(file, >+ "%-3s" >+ "%-8s" >+ "%-3s" >+ "%-5s" >+ "%-52s" >+ "%-52s" >+ "%-5s" >+ "%-3s" >+ "%-8s" >+ "\n", >+ "TP", "DEV", "PO", "NDEV", "SRC", "DST", "PS", "ST", "QPN"); {snip} >+ seq_printf(file, >+ "%-3d" >+ "%-8s" >+ "%-3d" >+ "%-5s" >+ "%-52s" >+ "%-52s" >+ "%-5d" >+ "%-3d" >+ "%-8d" >+ "\n", >+ id_priv->id.route.addr.dev_addr.dev_type, >+ (id_priv->id.device) ? id_priv->id.device->name : "", >+ id_priv->id.port_num, >+ (id_priv->id.route.addr.dev_addr.src_dev) ? id_priv- >>id.route.addr.dev_addr.src_dev->name : "", >+ local_addr, >+ remote_addr, >+ id_priv->id.ps, >+ id_priv->state, >+ id_priv->qp_num); nit: I'm not a big fan of one parameter per line. :) It's not readily apparent to me what several of the headings are (TP, PO, PS, ST) or what the numeric values map to (for TP, PS, ST). From weiny2 at llnl.gov Fri Apr 17 10:44:29 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Fri, 17 Apr 2009 10:44:29 -0700 Subject: ***SPAM*** Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: References: <20090416175303.2942bf40.weiny2@llnl.gov> Message-ID: <20090417104429.666be0d7.weiny2@llnl.gov> On Fri, 17 Apr 2009 12:06:43 -0500 Yicheng Jia wrote: > Hi Ira, > > Here is the output of "iblinkinfo.pl -R": > > ++++++++++++++++++++++++++++++++++++++++++++++++ > [root at ib_manager ~]# iblinkinfo.pl -R > Switch 0x00066a00d90009c1 InfiniCon System InfinIO 9024 Lite: > 7 1[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 6 1[ ] "MT2520 4 InfiniHostLx Mellanox Technologies" ( ) > 2[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) > 3[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) > 4[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) [snip] > > And the "ibstat" output: > +++++++++++++++++++++++++++++++++++ > [root at ib_manager ~]# ibstat > CA 'mthca0' > CA type: MT25204 > Number of ports: 1 > Firmware version: 1.2.0 > Hardware version: a0 > Node GUID: 0x0002c90200230784 > System image GUID: 0x0002c90200230787 > Port 1: > State: Active > Physical state: LinkUp > Rate: 10 > Base lid: 6 > LMC: 0 > SM lid: 6 > Capability mask: 0x02500a6a > Port GUID: 0x0002c90200230785 > ++++++++++++++++++++++++++++++++++++++++++++++ > > The reset command I am using is "ibportstate 7 1 reset", I also tried > "ibportstate -D 0,1 1 reset", and it fails with the same result. > Yea that is going to be a problem. The problem is that effectively you just disabled the connection to the switch. A reset disables then enables the port. Once the port is disabled the command can't talk to the switch any longer. You will have to either reset the switch (power cycle) or go to another node and enable the port. From the output you sent me it looks like you don't have any other nodes on the switch, so I take it you are resetting the switch to get the link to come back? I thought there was a warning in the man page or in the help regarding this situation but I don't see it now. Also, this becomes worse if you disable the port the SM is on. (Which I see you are doing.) So you will have a noticeable delay while the SM rescans the network which it is now seeing "again" for the first time. BTW, What are you trying to achieve with this command? Hope this helps, Ira From hal.rosenstock at gmail.com Fri Apr 17 10:49:59 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Fri, 17 Apr 2009 13:49:59 -0400 Subject: ***SPAM*** Re: ***SPAM*** Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: <20090417104429.666be0d7.weiny2@llnl.gov> References: <20090416175303.2942bf40.weiny2@llnl.gov> <20090417104429.666be0d7.weiny2@llnl.gov> Message-ID: On Fri, Apr 17, 2009 at 1:44 PM, Ira Weiny wrote: > On Fri, 17 Apr 2009 12:06:43 -0500 > Yicheng Jia wrote: > >> Hi Ira, >> >> Here is the output of "iblinkinfo.pl -R": >> >> ++++++++++++++++++++++++++++++++++++++++++++++++ >> [root at ib_manager ~]# iblinkinfo.pl -R >> Switch 0x00066a00d90009c1 InfiniCon System InfinIO 9024 Lite: >>       7    1[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>       6    1[  ] "MT2520 4 InfiniHostLx Mellanox Technologies" (  ) >>            2[  ]  ==( 4X 2.5 Gbps   Down /  Polling)==>             [  ] "" (  ) >>            3[  ]  ==( 4X 2.5 Gbps   Down /  Polling)==>             [  ] "" (  ) >>            4[  ]  ==( 4X 2.5 Gbps   Down /  Polling)==>             [  ] "" (  ) > > [snip] > >> >> And the "ibstat" output: >> +++++++++++++++++++++++++++++++++++ >> [root at ib_manager ~]# ibstat >> CA 'mthca0' >>         CA type: MT25204 >>         Number of ports: 1 >>         Firmware version: 1.2.0 >>         Hardware version: a0 >>         Node GUID: 0x0002c90200230784 >>         System image GUID: 0x0002c90200230787 >>         Port 1: >>                 State: Active >>                 Physical state: LinkUp >>                 Rate: 10 >>                 Base lid: 6 >>                 LMC: 0 >>                 SM lid: 6 >>                 Capability mask: 0x02500a6a >>                 Port GUID: 0x0002c90200230785 >> ++++++++++++++++++++++++++++++++++++++++++++++ >> >> The reset command I am using is "ibportstate 7 1 reset", I also tried >> "ibportstate -D 0,1 1 reset", and it fails with the same result. >> > > Yea that is going to be a problem.  The problem is that effectively you just > disabled the connection to the switch. That was Nicolas' original point and what I wrote before was wrong: disable really does disable the link and it doesn't come back to init so things are behaving as expected. > A reset disables then enables the > port.  Once the port is disabled the command can't talk to the switch any > longer. Yes, in this configuration, you've shot yourself in the foot. Guess we could check for this case too and not allow it. > You will have to either reset the switch (power cycle) or go to > another node and enable the port. That's usually how its down (from another port so switch connectivity isn't lost). -- Hal > From the output you sent me it looks like > you don't have any other nodes on the switch, so I take it you are resetting > the switch to get the link to come back? > > I thought there was a warning in the man page or in the help regarding this > situation but I don't see it now. > > Also, this becomes worse if you disable the port the SM is on.  (Which I see > you are doing.)  So you will have a noticeable delay while the SM rescans the > network which it is now seeing "again" for the first time. > > BTW, What are you trying to achieve with this command? > > Hope this helps, > Ira > > From YJia at tmriusa.com Fri Apr 17 13:07:54 2009 From: YJia at tmriusa.com (Yicheng Jia) Date: Fri, 17 Apr 2009 15:07:54 -0500 Subject: ***SPAM*** Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: <20090417104429.666be0d7.weiny2@llnl.gov> Message-ID: Hi Ira, >Yea that is going to be a problem. The problem is that effectively you just > disabled the connection to the switch. A reset disables then enables the > port. Once the port is disabled the command can't talk to the switch any > longer. You will have to either reset the switch (power cycle) or go to > another node and enable the port. From the output you sent me it looks like > you don't have any other nodes on the switch, so I take it you are resetting > the switch to get the link to come back? Thanks your explanation now everything is clear. Can I do "reset" by down/enable instead of down/disable/enable so that I can reset the peer port on the switch? > I thought there was a warning in the man page or in the help regarding this > situation but I don't see it now. There's a warning if I try to reset a port which is not on the switch. > BTW, What are you trying to achieve with this command? Sometime there's 1x link on the subnet after reboot our system, which consists of several HCA nodes directly connected with the switch. By reboot, I mean restart each node. I am trying to achieve 4x link width by using this command on 1x link port. Do you have any better idea of resolving this problem? Thanks! Yicheng Jia Ira Weiny 04/17/2009 12:45 PM To Yicheng Jia cc general at lists.openfabrics.org, Hal Rosenstock Subject Re: ***SPAM*** Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch On Fri, 17 Apr 2009 12:06:43 -0500 Yicheng Jia wrote: > Hi Ira, > > Here is the output of "iblinkinfo.pl -R": > > ++++++++++++++++++++++++++++++++++++++++++++++++ > [root at ib_manager ~]# iblinkinfo.pl -R > Switch 0x00066a00d90009c1 InfiniCon System InfinIO 9024 Lite: > 7 1[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 6 1[ ] "MT2520 4 InfiniHostLx Mellanox Technologies" ( ) > 2[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) > 3[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) > 4[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] "" ( ) [snip] > > And the "ibstat" output: > +++++++++++++++++++++++++++++++++++ > [root at ib_manager ~]# ibstat > CA 'mthca0' > CA type: MT25204 > Number of ports: 1 > Firmware version: 1.2.0 > Hardware version: a0 > Node GUID: 0x0002c90200230784 > System image GUID: 0x0002c90200230787 > Port 1: > State: Active > Physical state: LinkUp > Rate: 10 > Base lid: 6 > LMC: 0 > SM lid: 6 > Capability mask: 0x02500a6a > Port GUID: 0x0002c90200230785 > ++++++++++++++++++++++++++++++++++++++++++++++ > > The reset command I am using is "ibportstate 7 1 reset", I also tried > "ibportstate -D 0,1 1 reset", and it fails with the same result. > Yea that is going to be a problem. The problem is that effectively you just disabled the connection to the switch. A reset disables then enables the port. Once the port is disabled the command can't talk to the switch any longer. You will have to either reset the switch (power cycle) or go to another node and enable the port. From the output you sent me it looks like you don't have any other nodes on the switch, so I take it you are resetting the switch to get the link to come back? I thought there was a warning in the man page or in the help regarding this situation but I don't see it now. Also, this becomes worse if you disable the port the SM is on. (Which I see you are doing.) So you will have a noticeable delay while the SM rescans the network which it is now seeing "again" for the first time. BTW, What are you trying to achieve with this command? Hope this helps, Ira _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ _____________________________________________________________________________ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _____________________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.hefty at intel.com Fri Apr 17 15:41:10 2009 From: sean.hefty at intel.com (Hefty, Sean) Date: Fri, 17 Apr 2009 15:41:10 -0700 Subject: [ofa-general] [PATCH v3 1/3] Create a new library libibnetdisc In-Reply-To: <20090403154251.dec181f2.weiny2@llnl.gov> References: <20090403154251.dec181f2.weiny2@llnl.gov> Message-ID: >+void >+ibnd_debug(int i) >+{ >+ if (i) { >+ ibdebug++; >+ madrpc_show_errors(1); >+ umad_debug(i); >+ } else { >+ ibdebug = 0; >+ madrpc_show_errors(0); >+ umad_debug(0); >+ } >+} Where does the definition for ibdebug come from? From paran at nsc.liu.se Fri Apr 17 16:59:53 2009 From: paran at nsc.liu.se (=?ISO-8859-1?Q?P=E4r_Andersson?=) Date: Sat, 18 Apr 2009 01:59:53 +0200 Subject: [ofa-general] OFED version for RHEL/CentOS 5.3? Message-ID: <49E917F9.9070208@nsc.liu.se> Hello, Our main academic cluster (805 nodes, ConnectX 4x DDR) is currently running CentOS 5.2 with a self-compiled OFED 1.3.1. I am preparing an upgrade of the compute nodes to CentOS 5.3 and is trying to decide what IB stack to use. All IB packages included in CentOS 5.3 seems to be newer than our current OFED 1.3.1, at least according to the RPM version numbers. Do you know which OFED release the 5.3 packages is based on? The options I am considering is: * Keep our current OFED 1.3.1 packages, just rebuild kernel-ib. * Switch to the packages included in CentOS 5.3. * Build and install OFED 1.4. * Something else that I haven't thought about. Any suggestions or recommendations are welcome. How compatible is 1.4 with 1.3, if we should install that? Will MPI libraries and other applications continue to work or need to be recompiled? Best regards, -- Pär Andersson National Supercomputer Centre, Sweden From sean.hefty at intel.com Fri Apr 17 17:00:07 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 17 Apr 2009 17:00:07 -0700 Subject: [ofa-general] [PATCH v3 0/3] Create a new library libibnetdisc and convert iblinkinfo and ibnetdiscover to that library. In-Reply-To: <20090403154244.a65227b5.weiny2@llnl.gov> References: <20090403154244.a65227b5.weiny2@llnl.gov> Message-ID: <45337B1321EE48F187F84E5D9FC5C305@amr.corp.intel.com> I've completed porting the new library and iblinkinfo to windows. I'll submit a patch early next week for few minor changes that are needed. Sasha, if you want to merge Ira's changes into master, that's fine, otherwise, I'll just send a patch against the ibn3 branch. Personally, I'd like to see more of the scripts converted to C for better portability. - Sean From vlad at lists.openfabrics.org Sat Apr 18 03:25:51 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 18 Apr 2009 03:25:51 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090418-0200 daily build status Message-ID: <20090418102551.67B33E61429@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From hnrose at comcast.net Sat Apr 18 04:32:18 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sat, 18 Apr 2009 07:32:18 -0400 Subject: [ofa-general] ***SPAM*** [PATCH] perftest/Makefile: Make rdma_lat and rdma_bw executable names rdma neutral Message-ID: <20090418113218.GA10138@comcast.net> Since rdma_lat and rdma_bw use RDMA CM, they can be used with both IB and iWARP so make their executable names neutral (by removing ib_) Also, IB only tests only require linking with libibverbs Signed-off-by: Hal Rosenstock --- diff --git a/Makefile b/Makefile index 8042531..ad1a40c 100755 --- a/Makefile +++ b/Makefile @@ -1,7 +1,8 @@ -TESTS = write_bw_postlist rdma_lat rdma_bw send_lat send_bw write_lat write_bw read_lat read_bw +RDMACM_TESTS = rdma_lat rdma_bw +TESTS = write_bw_postlist send_lat send_bw write_lat write_bw read_lat read_bw UTILS = clock_test -all: ${TESTS} ${UTILS} +all: ${RDMACM_TESTS} ${TESTS} ${UTILS} CFLAGS += -Wall -g -D_GNU_SOURCE -O2 EXTRA_FILES = get_clock.c @@ -10,11 +11,15 @@ EXTRA_HEADERS = get_clock.h LOADLIBES += LDFLAGS += -${TESTS}: LOADLIBES += -libverbs -lrdmacm +${RDMACM_TESTS} ${UTILS}: LOADLIBES += -libverbs -lrdmacm +${TESTS}: LOADLIBES += -libverbs -${TESTS} ${UTILS}: %: %.c ${EXTRA_FILES} ${EXTRA_HEADERS} +${RDMACM_TESTS} ${UTILS}: %: %.c ${EXTRA_FILES} ${EXTRA_HEADERS} + $(CC) $(CPPFLAGS) $(CFLAGS) $(LDFLAGS) $< ${EXTRA_FILES} $(LOADLIBES) $(LDLIBS) -o $@ +${TESTS}: %: %.c ${EXTRA_FILES} ${EXTRA_HEADERS} $(CC) $(CPPFLAGS) $(CFLAGS) $(LDFLAGS) $< ${EXTRA_FILES} $(LOADLIBES) $(LDLIBS) -o ib_$@ clean: - $(foreach fname,${TESTS} ${UTILS}, rm -f ib_${fname}) + $(foreach fname,${RDMACM_TESTS} ${UTILS}, rm -f ${fname}) + $(foreach fname,${TESTS}, rm -f ib_${fname}) .DELETE_ON_ERROR: .PHONY: all clean From hnrose at comcast.net Sat Apr 18 04:43:47 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sat, 18 Apr 2009 07:43:47 -0400 Subject: [ofa-general] [PATCH] perftest/pertest.spec: Spec file change for executable name changes Message-ID: <20090418114347.GA10271@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/perftest.spec b/perftest.spec index bd234e1..a817589 100755 --- a/perftest.spec +++ b/perftest.spec @@ -23,8 +23,8 @@ export CFLAGS="$RPM_OPT_FLAGS" chmod -x runme %install -install -D -m 0755 ib_rdma_lat $RPM_BUILD_ROOT%{_bindir}/ib_rdma_lat -install -D -m 0755 ib_rdma_bw $RPM_BUILD_ROOT%{_bindir}/ib_rdma_bw +install -D -m 0755 rdma_lat $RPM_BUILD_ROOT%{_bindir}/rdma_lat +install -D -m 0755 rdma_bw $RPM_BUILD_ROOT%{_bindir}/rdma_bw install -D -m 0755 ib_write_lat $RPM_BUILD_ROOT%{_bindir}/ib_write_lat install -D -m 0755 ib_write_bw $RPM_BUILD_ROOT%{_bindir}/ib_write_bw install -D -m 0755 ib_send_lat $RPM_BUILD_ROOT%{_bindir}/ib_send_lat @@ -32,7 +32,7 @@ install -D -m 0755 ib_send_bw $RPM_BUILD_ROOT%{_bindir}/ib_send_bw install -D -m 0755 ib_read_lat $RPM_BUILD_ROOT%{_bindir}/ib_read_lat install -D -m 0755 ib_read_bw $RPM_BUILD_ROOT%{_bindir}/ib_read_bw install -D -m 0755 ib_write_bw_postlist $RPM_BUILD_ROOT%{_bindir}/ib_write_bw_postlist -install -D -m 0755 ib_clock_test $RPM_BUILD_ROOT%{_bindir}/ib_clock_test +install -D -m 0755 clock_test $RPM_BUILD_ROOT%{_bindir}/clock_test %clean rm -rf ${RPM_BUILD_ROOT} @@ -43,6 +43,8 @@ rm -rf ${RPM_BUILD_ROOT} %_bindir/* %changelog +* Sat Apr 18 2009 - hal.rosenstock at gmail.com +- Change executable names for rdma_lat, rdma_bw, and clock_test * Mon Jul 09 2007 - hvogel at suse.de - Use correct version * Wed Jul 04 2007 - hvogel at suse.de From hnrose at comcast.net Sat Apr 18 06:11:36 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sat, 18 Apr 2009 09:11:36 -0400 Subject: [ofa-general] ***SPAM*** [PATCH] infiniband-diags/ibsendtrap.c: Set producer type according to node type Message-ID: <20090418131136.GA13355@comcast.net> rather than assuming CA Signed-off-by: Hal Rosenstock --- diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c index d0afca0..b848c0f 100644 --- a/infiniband-diags/src/ibsendtrap.c +++ b/infiniband-diags/src/ibsendtrap.c @@ -1,6 +1,7 @@ /* * Copyright (c) 2008 Lawrence Livermore National Security * Copyright (c) 2009 Voltaire Inc. All rights reserved. + * Copyright (c) 2009 HNR Consulting. All rights reserved. * * Produced at Lawrence Livermore National Laboratory. * Written by Ira Weiny . @@ -52,32 +53,42 @@ struct ibmad_port *srcport; /* for local link integrity */ int error_port = 1; -static void build_trap144(ib_mad_notice_attr_t * n, uint16_t lid) +static int get_node_type(ib_portid_t *port) +{ + int node_type = IB_NODE_TYPE_CA; + uint8_t data[IB_SMP_DATA_SIZE]; + + if (smp_query_via(data, port, IB_ATTR_NODE_INFO, 0, 0, srcport)) + node_type = mad_get_field(data, 0, IB_NODE_TYPE_F); + return node_type; +} + +static void build_trap144(ib_mad_notice_attr_t * n, ib_portid_t *port) { n->generic_type = 0x80 | IB_NOTICE_TYPE_INFO; - n->g_or_v.generic.prod_type_lsb = cl_hton16(IB_NODE_TYPE_CA); + n->g_or_v.generic.prod_type_lsb = cl_hton16(get_node_type(port)); n->g_or_v.generic.trap_num = cl_hton16(144); - n->issuer_lid = cl_hton16(lid); - n->data_details.ntc_144.lid = cl_hton16(lid); + n->issuer_lid = cl_hton16(port->lid); + n->data_details.ntc_144.lid = cl_hton16(port->lid); n->data_details.ntc_144.local_changes = TRAP_144_MASK_OTHER_LOCAL_CHANGES; n->data_details.ntc_144.change_flgs = TRAP_144_MASK_NODE_DESCRIPTION_CHANGE; } -static void build_trap129(ib_mad_notice_attr_t * n, uint16_t lid) +static void build_trap129(ib_mad_notice_attr_t * n, ib_portid_t *port) { n->generic_type = 0x80 | IB_NOTICE_TYPE_URGENT; - n->g_or_v.generic.prod_type_lsb = cl_hton16(IB_NODE_TYPE_CA); + n->g_or_v.generic.prod_type_lsb = cl_hton16(get_node_type(port)); n->g_or_v.generic.trap_num = cl_hton16(129); - n->issuer_lid = cl_hton16(lid); - n->data_details.ntc_129_131.lid = cl_hton16(lid); + n->issuer_lid = cl_hton16(port->lid); + n->data_details.ntc_129_131.lid = cl_hton16(port->lid); n->data_details.ntc_129_131.pad = 0; n->data_details.ntc_129_131.port_num = error_port; } static int send_trap(const char *name, - void (*build) (ib_mad_notice_attr_t *, uint16_t)) + void (*build) (ib_mad_notice_attr_t *, ib_portid_t *)) { ib_portid_t sm_port; ib_portid_t selfportid; @@ -100,14 +111,14 @@ static int send_trap(const char *name, trap_rpc.dataoffs = IB_SMP_DATA_OFFS; memset(¬ice, 0, sizeof(notice)); - build(¬ice, selfportid.lid); + build(¬ice, &selfportid); return mad_send_via(&trap_rpc, &sm_port, NULL, ¬ice, srcport); } typedef struct _trap_def { char *trap_name; - void (*build_func) (ib_mad_notice_attr_t *, uint16_t); + void (*build_func) (ib_mad_notice_attr_t *, ib_portid_t *); } trap_def_t; trap_def_t traps[3] = { From hnrose at comcast.net Sat Apr 18 06:23:38 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sat, 18 Apr 2009 09:23:38 -0400 Subject: [ofa-general] ***SPAM*** [PATCH] perftest: Remove unneeded executable permissions Message-ID: <20090418132338.GA13505@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/COPYING b/COPYING old mode 100644 new mode 100755 diff --git a/Makefile b/Makefile old mode 100644 new mode 100755 diff --git a/README b/README old mode 100644 new mode 100755 diff --git a/clock_test.c b/clock_test.c old mode 100644 new mode 100755 diff --git a/get_clock.c b/get_clock.c old mode 100644 new mode 100755 diff --git a/get_clock.h b/get_clock.h old mode 100644 new mode 100755 diff --git a/perftest.spec b/perftest.spec old mode 100644 new mode 100755 diff --git a/rdma_bw.c b/rdma_bw.c old mode 100644 new mode 100755 diff --git a/rdma_lat.c b/rdma_lat.c old mode 100644 new mode 100755 diff --git a/read_bw.c b/read_bw.c old mode 100644 new mode 100755 diff --git a/read_lat.c b/read_lat.c old mode 100644 new mode 100755 diff --git a/send_bw.c b/send_bw.c old mode 100644 new mode 100755 diff --git a/send_lat.c b/send_lat.c old mode 100644 new mode 100755 diff --git a/write_bw.c b/write_bw.c old mode 100644 new mode 100755 diff --git a/write_bw_postlist.c b/write_bw_postlist.c old mode 100644 new mode 100755 diff --git a/write_lat.c b/write_lat.c old mode 100644 new mode 100755 From dennis.portello at gmail.com Sat Apr 18 09:57:14 2009 From: dennis.portello at gmail.com (Dennis Portello) Date: Sat, 18 Apr 2009 12:57:14 -0400 Subject: [ofa-general] ***SPAM*** Issues with multicast on bonded IPoIB interfaces Message-ID: <52436c7f0904180957x59e933o510541a9f4ef51b4@mail.gmail.com> Hello, I've run into some issues with IPoIB bonding when attempting multicast communication. I'm not using the ofa-kernel package, but the modules in 2.6.27 kernel including the bonding module. For the most part, everything has worked so far with a few minor adjustments for Debian/Ubuntu. Any assistance or suggestions would be greatly appreciated. I've already done extensive research on this issue through the mailing list archives, but none of the previous suggestions fit or work. Using: SilverStorm 9024 managed switches SuperMicro SuperServer 6015TW-TB QLogic 7104-HCA-128LPX-DDR dual port IB cards Ubuntu 8.10 (2.6.27-11-server) Using OFED 1.4.0 (Guy Coats debian packages) IPoIB Multicast not working for bonded interfaces. The above works when tested against single IB ports, single Ethernet ports, and bonded Ethernet ports, but not on bonded IB ports I've tried disabling IPv6 by blacklisting ipv6 in /etc/modprobe.d/blacklist Unicast TCP/IP seems to work just fine. I've tried setting mode to datagram and changing the MTU so it is lower 2044 Use ib-bond script also added bond directly with echo +bond3 > /sys/class/net/bonding_masters echo 1 > /sys/class/net/bond3/bonding/mode echo 100 > /sys/class/net/bond3/bonding/miimon echo +ib0 > /sys/class/net/bond3/bonding/slaves echo +ib1 > /sys/class/net/bond3/bonding/slaves ifconfig bond3 192.168.47.100/24 route add -net 224.0.0.0/3 gw 192.168.47.100 socat STDIO UDP4-DATAGRAM:224.1.0.1:6666,bind=:6666,range= 192.168.47.0/24,ip-add-membership=224.1.0.1:192.168.47.100 (socat is used to send multicast traffic, it's configure as peer-peer) echo +bond3 > /sys/class/net/bonding_masters echo 1 > /sys/class/net/bond3/bonding/mode echo 100 > /sys/class/net/bond3/bonding/miimon echo +ib0 > /sys/class/net/bond3/bonding/slaves echo +ib1 > /sys/class/net/bond3/bonding/slaves ifconfig bond3 192.168.47.102/24 route add -net 224.0.0.0/3 gw 192.168.47.102 socat STDIO UDP4-DATAGRAM:224.1.0.1:6666,bind=:6666,range= 192.168.47.0/24,ip-add-membership=224.1.0.1:192.168.47.102 === dmesg reports === [45784.497143] bonding: bond3 is being created... [45784.498401] bonding: bond3: setting mode to active-backup (1). [45784.498434] bonding: bond3: Setting MII monitoring interval to 100. [45784.511087] bonding: bond3: doing slave updates when interface is down. [45784.511095] bonding: bond3: Adding slave ib0. [45784.511099] bonding bond3: master_dev is not up in bond_enslave [45784.511100] bonding: bond3: Warning: enslaved VLAN challenged slave ib0. Adding VLANs will be blocked as long as ib0 is part of bond bond3 [45784.511103] bonding: bond3: Warning: The first slave device specified does not support setting the MAC address. Setting fail_over_mac to active.<6>bonding: bond3: enslaving ib0 as a backup interface with a down link. [45784.557785] bonding: bond3: doing slave updates when interface is down. [45784.557794] bonding: bond3: Adding slave ib1. [45784.557797] bonding bond3: master_dev is not up in bond_enslave [45784.557798] bonding: bond3: Warning: enslaved VLAN challenged slave ib1. Adding VLANs will be blocked as long as ib1 is part of bond bond3 [45784.604880] bonding: bond3: enslaving ib1 as a backup interface with a down link. [45784.607960] ib0: mtu > 2044 will cause multicast packet drops. [45784.609909] ib1: mtu > 2044 will cause multicast packet drops. [45784.613484] ADDRCONF(NETDEV_UP): bond3: link is not ready [45784.613498] bonding: bond3: link status definitely up for interface ib0. [45784.613501] bonding: bond3: making interface ib0 the new active one. [45784.613524] bonding: bond3: first active interface up! [45784.613527] bonding: bond3: link status definitely up for interface ib1. [45784.615315] ADDRCONF(NETDEV_CHANGE): bond3: link becomes ready [45784.616065] ib0: multicast join failed for 0001:0000:0000:0000:0000:0000:0000:0000, status -22 [45784.616176] ib0: multicast join failed for 0001:0000:0000:0000:0000:0000:0000:0000, status -22 [45786.610476] ib0: multicast join failed for 0001:0000:0000:0000:0000:0000:0000:0000, status -22 [45794.610301] ib0: multicast join failed for 0001:0000:0000:0000:0000:0000:0000:0000, status -22 [45795.113757] bond3: no IPv6 routers present [45810.610409] ib0: multicast join failed for 0001:0000:0000:0000:0000:0000:0000:0000, status -22 [45826.610284] ib0: multicast join failed for 0001:0000:0000:0000:0000:0000:0000:0000, status -22 === my network config /etc/network/interfaces === auto lo iface lo inet loopback auto eth0 iface eth0 inet manual auto eth1 iface eth1 inet manual # auto eth0.45 # iface eth0.45 inet manual # vlan_raw_device eth0 # # auto eth1.45 # iface eth1.45 inet manual # vlan_raw_device eth1 # # auto eth0.46 # iface eth0.46 inet manual # vlan_raw_device eth0 # # auto eth1.46 # iface eth1.46 inet manual # vlan_raw_device eth1 # auto bond0 # iface bond0 inet manual # slaves eth0.45 eth1.45 # bond_miimon 100 # bond_mode active-backup # pre-up ifup eth0.45 eth1.45 # post-down ifdown eth0.45 eth1.45 auto bond0 iface bond0 inet manual slaves eth0 eth1 bond_miimon 100 bond_mode active-backup pre-up ifup eth0 eth1 post-down ifdown eth0 eth1 # auto bond1 # iface bond1 inet manual # slaves eth0.46 eth1.46 # bond_miimon 100 # bond_mode active-backup # pre-up ifup eth0.46 eth1.46 # post-down ifdown eth0.46 eth1.46 auto vmbr0 iface vmbr0 inet static address 192.168.45.100 netmask 255.255.255.0 network 192.168.45.0 gateway 192.168.45.7 broadcast 192.168.45.255 bridge_ports bond0 bridge_fd 9 bridge_hello 2 bridge_maxage 12 bridge_stp off # auto vmbr1 # iface vmbr1 inet static # address 192.168.46.100 # netmask 255.255.252.0 # network 192.168.46.0 # gateway 192.168.46.7 # broadcast 192.168.46.255 # bridge_ports bond1 # bridge_fd 9 # bridge_hello 2 # bridge_maxage 12 # bridge_stp off auto ib0 iface ib0 inet manual pre-up echo connected > /sys/class/net/$IFACE/mode pre-up echo 65520 > /sys/class/net/$IFACE/mtu auto ib1 iface ib1 inet manual pre-up echo connected > /sys/class/net/$IFACE/mode pre-up echo 65520 > /sys/class/net/$IFACE/mtu auto bond3 iface bond3 inet manual up ib-bond --bond-name bond3 --bond-ip 192.168.47.100/24 --slaves ib0,ib1 up echo 65520 > /sys/class/net/$IFACE/mtu down ib-bond --bond-name bond3 --stop === ifconfig -a reports === bond0 Link encap:Ethernet HWaddr 00:30:48:c6:76:1c inet6 addr: fe80::230:48ff:fec6:761c/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:34656 errors:0 dropped:0 overruns:0 frame:0 TX packets:6411 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2604958 (2.6 MB) TX bytes:3494145 (3.4 MB) bond3 Link encap:UNSPEC HWaddr 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:192.168.47.102 Bcast:192.168.47.255 Mask:255.255.255.0 inet6 addr: fe80::206:6a00:a000:f77d/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:65520 Metric:1 RX packets:2 errors:0 dropped:0 overruns:0 frame:0 TX packets:5 errors:0 dropped:130 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:112 (112.0 B) TX bytes:324 (324.0 B) eth0 Link encap:Ethernet HWaddr 00:30:48:c6:76:1c UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:34656 errors:0 dropped:0 overruns:0 frame:0 TX packets:6411 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:2604958 (2.6 MB) TX bytes:3494145 (3.4 MB) Memory:d8220000-d8240000 eth1 Link encap:Ethernet HWaddr 00:30:48:c6:76:1c UP BROADCAST SLAVE MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) Memory:d8260000-d8280000 ib0 Link encap:UNSPEC HWaddr 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 UP BROADCAST RUNNING SLAVE MULTICAST MTU:65520 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:2 errors:0 dropped:123 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:0 (0.0 B) TX bytes:136 (136.0 B) ib1 Link encap:UNSPEC HWaddr 80-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 UP BROADCAST RUNNING SLAVE MULTICAST MTU:65520 Metric:1 RX packets:2 errors:0 dropped:0 overruns:0 frame:0 TX packets:3 errors:0 dropped:7 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:112 (112.0 B) TX bytes:188 (188.0 B) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:4 errors:0 dropped:0 overruns:0 frame:0 TX packets:4 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:352 (352.0 B) TX bytes:352 (352.0 B) vmbr0 Link encap:Ethernet HWaddr 00:30:48:c6:76:1c inet addr:192.168.45.102 Bcast:192.168.45.255 Mask:255.255.255.0 inet6 addr: fe80::230:48ff:fec6:761c/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:11605 errors:0 dropped:0 overruns:0 frame:0 TX packets:5638 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:1057952 (1.0 MB) TX bytes:3439987 (3.4 MB) vnet0 Link encap:Ethernet HWaddr 8a:c0:6b:6a:db:5a inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0 inet6 addr: fe80::88c0:6bff:fe6a:db5a/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:48 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:6274 (6.2 KB) Best regards, Dennis Portello -------------- next part -------------- An HTML attachment was scrubbed... URL: From dennis.portello at gmail.com Sat Apr 18 10:16:10 2009 From: dennis.portello at gmail.com (Dennis Portello) Date: Sat, 18 Apr 2009 13:16:10 -0400 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Issues with multicast on bonded IPoIB interfaces In-Reply-To: <52436c7f0904180957x59e933o510541a9f4ef51b4@mail.gmail.com> References: <52436c7f0904180957x59e933o510541a9f4ef51b4@mail.gmail.com> Message-ID: <52436c7f0904181016s55a4beealbf74e1f0764e98e4@mail.gmail.com> More information... I also wanted to note that I did a verbose tcpdump on the receiving end and didn't see anything on ib0 or ib1 that looked out of the ordinary. More info on the software stack... I'm using Roland libibverbs packages available in the Ubuntu 8.10 repository. Best Regards, Dennis Portello On Sat, Apr 18, 2009 at 12:57 PM, Dennis Portello wrote: > Hello, > > I've run into some issues with IPoIB bonding when attempting multicast > communication. I'm not using the ofa-kernel package, but the modules in > 2.6.27 kernel including the bonding module. For the most part, everything > has worked so far with a few minor adjustments for Debian/Ubuntu. > > Any assistance or suggestions would be greatly appreciated. I've already > done extensive research on this issue through the mailing list archives, but > none of the previous suggestions fit or work. > > Using: > SilverStorm 9024 managed switches > SuperMicro SuperServer 6015TW-TB > QLogic 7104-HCA-128LPX-DDR dual port IB cards > Ubuntu 8.10 (2.6.27-11-server) > Using OFED 1.4.0 (Guy Coats debian packages) > > IPoIB Multicast not working for bonded interfaces. > > The above works when tested against single IB ports, single Ethernet ports, > and bonded Ethernet ports, but not on bonded IB ports > > I've tried disabling IPv6 by blacklisting ipv6 in /etc/modprobe.d/blacklist > > Unicast TCP/IP seems to work just fine. > > I've tried setting mode to datagram and changing the MTU so it is lower > 2044 > > Use ib-bond script also added bond directly with > > echo +bond3 > /sys/class/net/bonding_masters > echo 1 > /sys/class/net/bond3/bonding/mode > echo 100 > /sys/class/net/bond3/bonding/miimon > echo +ib0 > /sys/class/net/bond3/bonding/slaves > echo +ib1 > /sys/class/net/bond3/bonding/slaves > ifconfig bond3 192.168.47.100/24 > route add -net 224.0.0.0/3 gw 192.168.47.100 > socat STDIO UDP4-DATAGRAM:224.1.0.1:6666,bind=:6666,range= > 192.168.47.0/24,ip-add-membership=224.1.0.1:192.168.47.100 > (socat is used to send multicast traffic, it's configure as peer-peer) > > echo +bond3 > /sys/class/net/bonding_masters > echo 1 > /sys/class/net/bond3/bonding/mode > echo 100 > /sys/class/net/bond3/bonding/miimon > echo +ib0 > /sys/class/net/bond3/bonding/slaves > echo +ib1 > /sys/class/net/bond3/bonding/slaves > ifconfig bond3 192.168.47.102/24 > route add -net 224.0.0.0/3 gw 192.168.47.102 > socat STDIO UDP4-DATAGRAM:224.1.0.1:6666,bind=:6666,range= > 192.168.47.0/24,ip-add-membership=224.1.0.1:192.168.47.102 > > === dmesg reports === > [45784.497143] bonding: bond3 is being created... > [45784.498401] bonding: bond3: setting mode to active-backup (1). > [45784.498434] bonding: bond3: Setting MII monitoring interval to 100. > [45784.511087] bonding: bond3: doing slave updates when interface is down. > [45784.511095] bonding: bond3: Adding slave ib0. > [45784.511099] bonding bond3: master_dev is not up in bond_enslave > [45784.511100] bonding: bond3: Warning: enslaved VLAN challenged slave ib0. > Adding VLANs will be blocked as long as ib0 is part of bond bond3 > [45784.511103] bonding: bond3: Warning: The first slave device specified > does not support setting the MAC address. Setting fail_over_mac to > active.<6>bonding: bond3: enslaving ib0 as a backup interface with a down > link. > [45784.557785] bonding: bond3: doing slave updates when interface is down. > [45784.557794] bonding: bond3: Adding slave ib1. > [45784.557797] bonding bond3: master_dev is not up in bond_enslave > [45784.557798] bonding: bond3: Warning: enslaved VLAN challenged slave ib1. > Adding VLANs will be blocked as long as ib1 is part of bond bond3 > [45784.604880] bonding: bond3: enslaving ib1 as a backup interface with a > down link. > [45784.607960] ib0: mtu > 2044 will cause multicast packet drops. > [45784.609909] ib1: mtu > 2044 will cause multicast packet drops. > [45784.613484] ADDRCONF(NETDEV_UP): bond3: link is not ready > [45784.613498] bonding: bond3: link status definitely up for interface ib0. > [45784.613501] bonding: bond3: making interface ib0 the new active one. > [45784.613524] bonding: bond3: first active interface up! > [45784.613527] bonding: bond3: link status definitely up for interface ib1. > [45784.615315] ADDRCONF(NETDEV_CHANGE): bond3: link becomes ready > [45784.616065] ib0: multicast join failed for > 0001:0000:0000:0000:0000:0000:0000:0000, status -22 > [45784.616176] ib0: multicast join failed for > 0001:0000:0000:0000:0000:0000:0000:0000, status -22 > [45786.610476] ib0: multicast join failed for > 0001:0000:0000:0000:0000:0000:0000:0000, status -22 > [45794.610301] ib0: multicast join failed for > 0001:0000:0000:0000:0000:0000:0000:0000, status -22 > [45795.113757] bond3: no IPv6 routers present > [45810.610409] ib0: multicast join failed for > 0001:0000:0000:0000:0000:0000:0000:0000, status -22 > [45826.610284] ib0: multicast join failed for > 0001:0000:0000:0000:0000:0000:0000:0000, status -22 > > === my network config /etc/network/interfaces === > > auto lo > iface lo inet loopback > > auto eth0 > iface eth0 inet manual > > auto eth1 > iface eth1 inet manual > > # auto eth0.45 > # iface eth0.45 inet manual > # vlan_raw_device eth0 > # > # auto eth1.45 > # iface eth1.45 inet manual > # vlan_raw_device eth1 > # > # auto eth0.46 > # iface eth0.46 inet manual > # vlan_raw_device eth0 > # > # auto eth1.46 > # iface eth1.46 inet manual > # vlan_raw_device eth1 > > # auto bond0 > # iface bond0 inet manual > # slaves eth0.45 eth1.45 > # bond_miimon 100 > # bond_mode active-backup > # pre-up ifup eth0.45 eth1.45 > # post-down ifdown eth0.45 eth1.45 > > auto bond0 > iface bond0 inet manual > slaves eth0 eth1 > bond_miimon 100 > bond_mode active-backup > pre-up ifup eth0 eth1 > post-down ifdown eth0 eth1 > > # auto bond1 > # iface bond1 inet manual > # slaves eth0.46 eth1.46 > # bond_miimon 100 > # bond_mode active-backup > # pre-up ifup eth0.46 eth1.46 > # post-down ifdown eth0.46 eth1.46 > > auto vmbr0 > iface vmbr0 inet static > address 192.168.45.100 > netmask 255.255.255.0 > network 192.168.45.0 > gateway 192.168.45.7 > broadcast 192.168.45.255 > bridge_ports bond0 > bridge_fd 9 > bridge_hello 2 > bridge_maxage 12 > bridge_stp off > > # auto vmbr1 > # iface vmbr1 inet static > # address 192.168.46.100 > # netmask 255.255.252.0 > # network 192.168.46.0 > # gateway 192.168.46.7 > # broadcast 192.168.46.255 > # bridge_ports bond1 > # bridge_fd 9 > # bridge_hello 2 > # bridge_maxage 12 > # bridge_stp off > > auto ib0 > iface ib0 inet manual > pre-up echo connected > /sys/class/net/$IFACE/mode > pre-up echo 65520 > /sys/class/net/$IFACE/mtu > > auto ib1 > iface ib1 inet manual > pre-up echo connected > /sys/class/net/$IFACE/mode > pre-up echo 65520 > /sys/class/net/$IFACE/mtu > > auto bond3 > iface bond3 inet manual > up ib-bond --bond-name bond3 --bond-ip 192.168.47.100/24 --slaves ib0,ib1 > up echo 65520 > /sys/class/net/$IFACE/mtu > down ib-bond --bond-name bond3 --stop > > === ifconfig -a reports === > bond0 Link encap:Ethernet HWaddr 00:30:48:c6:76:1c > inet6 addr: fe80::230:48ff:fec6:761c/64 Scope:Link > UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 > RX packets:34656 errors:0 dropped:0 overruns:0 frame:0 > TX packets:6411 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:2604958 (2.6 MB) TX bytes:3494145 (3.4 MB) > > bond3 Link encap:UNSPEC HWaddr > 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > inet addr:192.168.47.102 Bcast:192.168.47.255 > Mask:255.255.255.0 > inet6 addr: fe80::206:6a00:a000:f77d/64 Scope:Link > UP BROADCAST RUNNING MASTER MULTICAST MTU:65520 Metric:1 > RX packets:2 errors:0 dropped:0 overruns:0 frame:0 > TX packets:5 errors:0 dropped:130 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:112 (112.0 B) TX bytes:324 (324.0 B) > > eth0 Link encap:Ethernet HWaddr 00:30:48:c6:76:1c > UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 > RX packets:34656 errors:0 dropped:0 overruns:0 frame:0 > TX packets:6411 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:2604958 (2.6 MB) TX bytes:3494145 (3.4 MB) > Memory:d8220000-d8240000 > > eth1 Link encap:Ethernet HWaddr 00:30:48:c6:76:1c > UP BROADCAST SLAVE MULTICAST MTU:1500 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) > Memory:d8260000-d8280000 > > ib0 Link encap:UNSPEC HWaddr > 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > UP BROADCAST RUNNING SLAVE MULTICAST MTU:65520 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:2 errors:0 dropped:123 overruns:0 carrier:0 > collisions:0 txqueuelen:256 > RX bytes:0 (0.0 B) TX bytes:136 (136.0 B) > > ib1 Link encap:UNSPEC HWaddr > 80-00-04-05-FE-80-00-00-00-00-00-00-00-00-00-00 > UP BROADCAST RUNNING SLAVE MULTICAST MTU:65520 Metric:1 > RX packets:2 errors:0 dropped:0 overruns:0 frame:0 > TX packets:3 errors:0 dropped:7 overruns:0 carrier:0 > collisions:0 txqueuelen:256 > RX bytes:112 (112.0 B) TX bytes:188 (188.0 B) > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:4 errors:0 dropped:0 overruns:0 frame:0 > TX packets:4 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:352 (352.0 B) TX bytes:352 (352.0 B) > > vmbr0 Link encap:Ethernet HWaddr 00:30:48:c6:76:1c > inet addr:192.168.45.102 Bcast:192.168.45.255 > Mask:255.255.255.0 > inet6 addr: fe80::230:48ff:fec6:761c/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:11605 errors:0 dropped:0 overruns:0 frame:0 > TX packets:5638 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:1057952 (1.0 MB) TX bytes:3439987 (3.4 MB) > > vnet0 Link encap:Ethernet HWaddr 8a:c0:6b:6a:db:5a > inet addr:192.168.122.1 Bcast:192.168.122.255 > Mask:255.255.255.0 > inet6 addr: fe80::88c0:6bff:fe6a:db5a/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:48 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:0 (0.0 B) TX bytes:6274 (6.2 KB) > > Best regards, > Dennis Portello > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Sat Apr 18 23:00:55 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 19 Apr 2009 09:00:55 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] infiniband-diags/ibsendtrap.c: Set producer type according to node type In-Reply-To: <20090418131136.GA13355@comcast.net> References: <20090418131136.GA13355@comcast.net> Message-ID: <20090419060055.GD5922@sk> On 09:11 Sat 18 Apr , Hal Rosenstock wrote: > > rather than assuming CA > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sat Apr 18 23:04:32 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 19 Apr 2009 09:04:32 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm/osm_helper.c: Convert remaining helper routines for GID printing format In-Reply-To: <20090417151139.GA7537@comcast.net> References: <20090417151139.GA7537@comcast.net> Message-ID: <20090419060432.GE5922@sk> On 11:11 Fri 17 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sat Apr 18 23:06:08 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 19 Apr 2009 09:06:08 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm: Some cosmetic formatting changes In-Reply-To: <20090417151704.GB7875@comcast.net> References: <20090417151704.GB7875@comcast.net> Message-ID: <20090419060608.GF5922@sk> On 11:17 Fri 17 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sat Apr 18 23:14:41 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 19 Apr 2009 09:14:41 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm: Eliminate duplicated calls to osm_log_is_active in SA modules In-Reply-To: <20090417152617.GA8275@comcast.net> References: <20090417152617.GA8275@comcast.net> Message-ID: <20090419061441.GG5922@sk> Hi Hal, On 11:26 Fri 17 Apr , Hal Rosenstock wrote: > > Helper routines call this routine Right, but I leave this code as is during OSM_LOG() speed up reworking, for the same performance reasons - without function calls it was faster in non-debug mode (I measured then). I agree that duplicated checks are someting ugly. Probably replcing osm_dump_*() functions by macros could solve this somehow. BTW OSM_LOG() has double check too - one in macro and another one inside osm_log() function. Sasha > > Signed-off-by: Hal Rosenstock > --- > diff --git a/opensm/opensm/osm_sa.c b/opensm/opensm/osm_sa.c > index 3521132..202a38e 100644 > --- a/opensm/opensm/osm_sa.c > +++ b/opensm/opensm/osm_sa.c > @@ -388,9 +388,7 @@ osm_sa_send_error(IN osm_sa_t * sa, > if (p_resp_sa_mad->attr_id == IB_MAD_ATTR_MULTIPATH_RECORD) > p_resp_sa_mad->attr_id = IB_MAD_ATTR_PATH_RECORD; > > - if (osm_log_is_active(sa->p_log, OSM_LOG_FRAMES)) > - osm_dump_sa_mad(sa->p_log, p_resp_sa_mad, OSM_LOG_FRAMES); > - > + osm_dump_sa_mad(sa->p_log, p_resp_sa_mad, OSM_LOG_FRAMES); > osm_sa_send(sa, p_resp_madw, FALSE); > > Exit: > diff --git a/opensm/opensm/osm_sa_class_port_info.c b/opensm/opensm/osm_sa_class_port_info.c > index d2ab96a..7ad08ef 100644 > --- a/opensm/opensm/osm_sa_class_port_info.c > +++ b/opensm/opensm/osm_sa_class_port_info.c > @@ -165,8 +165,7 @@ static void cpi_rcv_respond(IN osm_sa_t * sa, IN const osm_madw_t * p_madw) > p_resp_cpi->cap_mask |= OSM_CAP_IS_UD_MCAST_SUP; > p_resp_cpi->cap_mask = cl_hton16(p_resp_cpi->cap_mask); > > - if (osm_log_is_active(sa->p_log, OSM_LOG_FRAMES)) > - osm_dump_sa_mad(sa->p_log, p_resp_sa_mad, OSM_LOG_FRAMES); > + osm_dump_sa_mad(sa->p_log, p_resp_sa_mad, OSM_LOG_FRAMES); > > osm_sa_send(sa, p_resp_madw, FALSE); > > diff --git a/opensm/opensm/osm_sa_link_record.c b/opensm/opensm/osm_sa_link_record.c > index bf0b5ee..20b94bd 100644 > --- a/opensm/opensm/osm_sa_link_record.c > +++ b/opensm/opensm/osm_sa_link_record.c > @@ -465,8 +465,7 @@ void osm_lr_rcv_process(IN void *context, IN void *data) > goto Exit; > } > > - if (osm_log_is_active(sa->p_log, OSM_LOG_DEBUG)) > - osm_dump_link_record(sa->p_log, p_lr, OSM_LOG_DEBUG); > + osm_dump_link_record(sa->p_log, p_lr, OSM_LOG_DEBUG); > > cl_qlist_init(&lr_list); > > diff --git a/opensm/opensm/osm_sa_mad_ctrl.c b/opensm/opensm/osm_sa_mad_ctrl.c > index eeec51c..a791402 100644 > --- a/opensm/opensm/osm_sa_mad_ctrl.c > +++ b/opensm/opensm/osm_sa_mad_ctrl.c > @@ -315,8 +315,7 @@ static void sa_mad_ctrl_rcv_callback(IN osm_madw_t * p_madw, IN void *context, > > p_sa_mad = osm_madw_get_sa_mad_ptr(p_madw); > > - if (osm_log_is_active(p_ctrl->p_log, OSM_LOG_FRAMES)) > - osm_dump_sa_mad(p_ctrl->p_log, p_sa_mad, OSM_LOG_FRAMES); > + osm_dump_sa_mad(p_ctrl->p_log, p_sa_mad, OSM_LOG_FRAMES); > > /* > * C15-0.1.5 - Table 185: SA Header - p884 > diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c > index 5543221..2cde504 100644 > --- a/opensm/opensm/osm_sa_mcmember_record.c > +++ b/opensm/opensm/osm_sa_mcmember_record.c > @@ -1331,8 +1331,7 @@ static void mcmr_rcv_join_mgrp(IN osm_sa_t * sa, IN osm_madw_t * p_madw) > > } > /* failed to route */ > - if (osm_log_is_active(sa->p_log, OSM_LOG_DEBUG)) > - osm_dump_mc_record(sa->p_log, &mcmember_rec, OSM_LOG_DEBUG); > + osm_dump_mc_record(sa->p_log, &mcmember_rec, OSM_LOG_DEBUG); > > mcmr_rcv_respond(sa, p_madw, &mcmember_rec); > > diff --git a/opensm/opensm/osm_sa_multipath_record.c b/opensm/opensm/osm_sa_multipath_record.c > index 737d892..59bed2b 100644 > --- a/opensm/opensm/osm_sa_multipath_record.c > +++ b/opensm/opensm/osm_sa_multipath_record.c > @@ -1454,8 +1454,7 @@ void osm_mpr_rcv_process(IN void *context, IN void *data) > goto Exit; > } > > - if (osm_log_is_active(sa->p_log, OSM_LOG_DEBUG)) > - osm_dump_multipath_record(sa->p_log, p_mpr, OSM_LOG_DEBUG); > + osm_dump_multipath_record(sa->p_log, p_mpr, OSM_LOG_DEBUG); > > cl_qlist_init(&pr_list); > > diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c > index 6e7d5f6..f3146ed 100644 > --- a/opensm/opensm/osm_sa_path_record.c > +++ b/opensm/opensm/osm_sa_path_record.c > @@ -1628,8 +1628,7 @@ void osm_pr_rcv_process(IN void *context, IN void *data) > goto Exit; > } > > - if (osm_log_is_active(sa->p_log, OSM_LOG_DEBUG)) > - osm_dump_path_record(sa->p_log, p_pr, OSM_LOG_DEBUG); > + osm_dump_path_record(sa->p_log, p_pr, OSM_LOG_DEBUG); > > cl_qlist_init(&pr_list); > > diff --git a/opensm/opensm/osm_sa_service_record.c b/opensm/opensm/osm_sa_service_record.c > index b3c39b0..02496c1 100644 > --- a/opensm/opensm/osm_sa_service_record.c > +++ b/opensm/opensm/osm_sa_service_record.c > @@ -475,9 +475,7 @@ static void sr_rcv_process_get_method(osm_sa_t * sa, IN osm_madw_t * p_madw) > p_recvd_service_rec = > (ib_service_record_t *) ib_sa_mad_get_payload_ptr(p_sa_mad); > > - if (osm_log_is_active(sa->p_log, OSM_LOG_DEBUG)) > - osm_dump_service_record(sa->p_log, p_recvd_service_rec, > - OSM_LOG_DEBUG); > + osm_dump_service_record(sa->p_log, p_recvd_service_rec, OSM_LOG_DEBUG); > > cl_qlist_init(&sr_match_item.sr_list); > sr_match_item.p_service_rec = p_recvd_service_rec; > @@ -530,9 +528,7 @@ static void sr_rcv_process_set_method(osm_sa_t * sa, IN osm_madw_t * p_madw) > > comp_mask = p_sa_mad->comp_mask; > > - if (osm_log_is_active(sa->p_log, OSM_LOG_DEBUG)) > - osm_dump_service_record(sa->p_log, p_recvd_service_rec, > - OSM_LOG_DEBUG); > + osm_dump_service_record(sa->p_log, p_recvd_service_rec, OSM_LOG_DEBUG); > > if ((comp_mask & (IB_SR_COMPMASK_SID | IB_SR_COMPMASK_SGID)) != > (IB_SR_COMPMASK_SID | IB_SR_COMPMASK_SGID)) { > @@ -634,9 +630,7 @@ static void sr_rcv_process_delete_method(osm_sa_t * sa, IN osm_madw_t * p_madw) > > comp_mask = p_sa_mad->comp_mask; > > - if (osm_log_is_active(sa->p_log, OSM_LOG_DEBUG)) > - osm_dump_service_record(sa->p_log, p_recvd_service_rec, > - OSM_LOG_DEBUG); > + osm_dump_service_record(sa->p_log, p_recvd_service_rec, OSM_LOG_DEBUG); > > /* Grab the lock */ > cl_plock_excl_acquire(sa->p_lock); > diff --git a/opensm/opensm/osm_sa_sminfo_record.c b/opensm/opensm/osm_sa_sminfo_record.c > index 4d454af..9f11c91 100644 > --- a/opensm/opensm/osm_sa_sminfo_record.c > +++ b/opensm/opensm/osm_sa_sminfo_record.c > @@ -216,8 +216,7 @@ void osm_smir_rcv_process(IN void *ctx, IN void *data) > goto Exit; > } > > - if (osm_log_is_active(sa->p_log, OSM_LOG_DEBUG)) > - osm_dump_sm_info_record(sa->p_log, p_rcvd_rec, OSM_LOG_DEBUG); > + osm_dump_sm_info_record(sa->p_log, p_rcvd_rec, OSM_LOG_DEBUG); > > p_smi = &p_rcvd_rec->sm_info; > > diff --git a/opensm/opensm/osm_sa_sw_info_record.c b/opensm/opensm/osm_sa_sw_info_record.c > index 2ea8baf..e6ac7fe 100644 > --- a/opensm/opensm/osm_sa_sw_info_record.c > +++ b/opensm/opensm/osm_sa_sw_info_record.c > @@ -239,9 +239,7 @@ void osm_sir_rcv_process(IN void *ctx, IN void *data) > goto Exit; > } > > - if (osm_log_is_active(sa->p_log, OSM_LOG_DEBUG)) > - osm_dump_switch_info_record(sa->p_log, p_rcvd_rec, > - OSM_LOG_DEBUG); > + osm_dump_switch_info_record(sa->p_log, p_rcvd_rec, OSM_LOG_DEBUG); > > cl_qlist_init(&rec_list); > > From sashak at voltaire.com Sat Apr 18 23:15:45 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 19 Apr 2009 09:15:45 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm/osm_sa_informinfo.c: Eliminate duplicated calls to osm_log_is_active In-Reply-To: <20090417153300.GA8748@comcast.net> References: <20090417153300.GA8748@comcast.net> Message-ID: <20090419061545.GH5922@sk> On 11:33 Fri 17 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Same comments as for the previous patch. Sasha From dorfman.eli at gmail.com Sun Apr 19 00:52:47 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Sun, 19 Apr 2009 10:52:47 +0300 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCH] ibsim: support xmitwait counters In-Reply-To: <20090414102321.GA5519@sk> References: <49D0B659.2040107@Voltaire.COM> <20090414102321.GA5519@sk> Message-ID: <49EAD84F.9000600@gmail.com> Sasha Khapyorsky wrote: > On 15:08 Mon 30 Mar , Doron Shoham wrote: >> support xmitwait counters >> >> Signed-off-by: Doron Shoham > > Applied. Thanks. > > What are the plans to use it (now unlike other counters XmitWait will > be always zero)? > We have plans to add new commands to the simulator that will increment this counter based on link rate and "sent" bw. Eli From eli at dev.mellanox.co.il Sun Apr 19 00:53:42 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Sun, 19 Apr 2009 10:53:42 +0300 Subject: [ofa-general] Re: [PATCH v3] mlx4_ib: Optimize hugetlab pages support In-Reply-To: <49E3902C.30704@voltaire.com> References: <20090129152725.GA26284@mtls03> <49D0EA11.2040409@Voltaire.COM> <20090402084406.GB21370@mtls03> <20090405065047.GA567@mtls03> <49DA40B7.3040004@voltaire.com> <20090407065955.GA2308@mtls03> <49E3902C.30704@voltaire.com> Message-ID: <20090419075342.GA9148@mtls03> On Mon, Apr 13, 2009 at 10:19:08PM +0300, Yossi Etigin wrote: I see. That surprises me... could you also print vma->vm_start and vma->vm_end? I would have expected the vma to be streched to fill a full huge page and return the number of regular pages fitting in it. > Eli Cohen wrote: > > On Mon, Apr 06, 2009 at 08:49:43PM +0300, Yossi Etigin wrote: > >> I don't understand - if all area is huge pages, it does not mean that > >> it fills full huge pages - I can have just 4096 bytes in huge page memory > >> and umem->hugetlb will remain 1, right? > > > > You may call ib_umem_get() with a fraction of a huge page but I expect > > the number of pages returned from get_user_pages() will fill up a huge > > page. Can you check that with the mckey test you were using? > > The number of pages is 1. > I got this in dmesg with the modified mckey (see the last line): > > umem: addr=508000 size=1024 hugetlb=0 npages=1 > umem: addr=50a000 size=4096 hugetlb=0 npages=1 > umem: addr=50c000 size=4352 hugetlb=0 npages=2 > umem: addr=50f000 size=4096 hugetlb=0 npages=1 > umem: addr=2aaaaac00000 size=140 hugetlb=1 npages=1 > > > After applying this to umem.c: > > --- ofa_kernel-1.4.1/drivers/infiniband/core/umem.c 2009-04-13 22:15:19.000000000 +0300 > +++ ofa_kernel-1.4.1.patched/drivers/infiniband/core/umem.c 2009-04-13 22:09:36.000000000 +0300 > @@ -137,6 +137,7 @@ > int ret; > int off; > int i; > + int ntotalpages; > DEFINE_DMA_ATTRS(attrs); > > if (dmasync) > @@ -196,6 +197,7 @@ > cur_base = addr & PAGE_MASK; > > ret = 0; > + ntotalpages = 0; > while (npages) { > ret = get_user_pages(current, current->mm, cur_base, > min_t(unsigned long, npages, > @@ -226,6 +228,7 @@ > !is_vm_hugetlb_page(vma_list[i + off])) > umem->hugetlb = 0; > sg_set_page(&chunk->page_list[i], page_list[i + off], PAGE_SIZE, 0); > + ntotalpages++; > } > > chunk->nmap = ib_dma_map_sg_attrs(context->device, > @@ -254,8 +257,11 @@ > if (ret < 0) { > __ib_umem_release(context->device, umem, 0); > kfree(umem); > - } else > + } else { > current->mm->locked_vm = locked; > + printk(KERN_DEBUG "umem: addr=%lx size=%ld hugetlb=%d npages=%d\n", > + addr, size, umem->hugetlb, ntotalpages); > + } > > up_write(¤t->mm->mmap_sem); > if (vma_list) From arlin.r.davis at intel.com Sun Apr 19 00:55:51 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Sun, 19 Apr 2009 00:55:51 -0700 Subject: [ofa-general] [PATCH] dapl-scm: getsockopt optlen needs initialized to size of optval Message-ID: >From 55459699fa9c0e5fb7e2b17822f0916412c64b35 Mon Sep 17 00:00:00 2001 From: Arlin Davis Date: Fri, 10 Apr 2009 08:31:22 -0700 Signed-off-by: Arlin Davis --- dapl/openib_scm/dapl_ib_cm.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/dapl/openib_scm/dapl_ib_cm.c b/dapl/openib_scm/dapl_ib_cm.c index 6028a45..a11cd05 100644 --- a/dapl/openib_scm/dapl_ib_cm.c +++ b/dapl/openib_scm/dapl_ib_cm.c @@ -1642,6 +1642,7 @@ void cr_thread(void *arg) } else if (ret == DAPL_FD_WRITE || ret == DAPL_FD_ERROR) { if (cr->state == SCM_CONN_PENDING) { opt = 0; + opt_len = sizeof(opt); ret = getsockopt(cr->socket, SOL_SOCKET, SO_ERROR, (char *) &opt, &opt_len); if (!ret) -- 1.5.2.5 From arlin.r.davis at intel.com Sun Apr 19 00:55:55 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Sun, 19 Apr 2009 00:55:55 -0700 Subject: [ofa-general] [PATCH] reduce wait time for thread startup. Message-ID: thread startup wait reduce to 2ms to reduce open times. Signed-off-by: Arlin Davis --- dapl/openib_scm/dapl_ib_util.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/dapl/openib_scm/dapl_ib_util.c b/dapl/openib_scm/dapl_ib_util.c index 5e371ee..e0c61dd 100644 --- a/dapl/openib_scm/dapl_ib_util.c +++ b/dapl/openib_scm/dapl_ib_util.c @@ -388,7 +388,7 @@ found: /* wait for thread */ while (hca_ptr->ib_trans.cr_state != IB_THREAD_RUN) { - dapl_os_sleep_usec(20000); + dapl_os_sleep_usec(2000); } dapl_dbg_log(DAPL_DBG_TYPE_UTIL, @@ -461,7 +461,7 @@ DAT_RETURN dapls_ib_close_hca ( IN DAPL_HCA *hca_ptr ) dapl_log(DAPL_DBG_TYPE_UTIL, " thread_destroy: thread wakeup err = %s\n", strerror(errno)); - dapl_os_sleep_usec(20000); + dapl_os_sleep_usec(2000); } dapl_os_lock_destroy(&hca_ptr->ib_trans.lock); -- 1.5.2.5 From arlin.r.davis at intel.com Sun Apr 19 00:56:03 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Sun, 19 Apr 2009 00:56:03 -0700 Subject: [ofa-general] [PATCH] fix issue with multi-thread dapltest Message-ID: The dapl connect call takes as input an address (sockaddr) and a port number as separate input parameters. It modifies the sockaddr address to set the port number before trying to connect. This leads to a situation in dapltest with multiple threads that reference the same buffer for their address, but specify different port numbers, where the different threads end up trying to connect to the same remote port. To solve this, do not modify the caller's address buffer and instead use a local buffer. This fixes an issue seen running multithreaded tests with dapltest. Signed-off-by: Sean Hefty --- dapl/openib_scm/dapl_ib_cm.c | 12 +++++++----- 1 files changed, 7 insertions(+), 5 deletions(-) diff --git a/dapl/openib_scm/dapl_ib_cm.c b/dapl/openib_scm/dapl_ib_cm.c index 88e65e7..6db2b4a 100644 --- a/dapl/openib_scm/dapl_ib_cm.c +++ b/dapl/openib_scm/dapl_ib_cm.c @@ -436,6 +436,7 @@ dapli_socket_connect(DAPL_EP *ep_ptr, dp_ib_cm_handle_t cm_ptr; int ret; DAPL_IA *ia_ptr = ep_ptr->header.owner_ia; + struct sockaddr_in addr; dapl_dbg_log(DAPL_DBG_TYPE_EP, " connect: r_qual %d p_size=%d\n", r_qual,p_size); @@ -458,14 +459,15 @@ dapli_socket_connect(DAPL_EP *ep_ptr, goto bail; } - ((struct sockaddr_in*)r_addr)->sin_port = htons(r_qual); - ret = dapl_connect_socket(cm_ptr->socket, (struct sockaddr *) r_addr, - sizeof(*r_addr)); + dapl_os_memcpy(&addr, r_addr, sizeof(addr)); + addr.sin_port = htons(r_qual); + ret = dapl_connect_socket(cm_ptr->socket, (struct sockaddr *) &addr, + sizeof(addr)); if (ret && ret != EAGAIN) { dapl_log(DAPL_DBG_TYPE_ERR, " socket connect ERROR: %s -> %s r_qual %d\n", strerror(errno), - inet_ntoa(((struct sockaddr_in *)r_addr)->sin_addr), + inet_ntoa(addr.sin_addr), (unsigned int)r_qual); dapli_cm_destroy(cm_ptr); return DAT_INVALID_ADDRESS; @@ -498,7 +500,7 @@ dapli_socket_connect(DAPL_EP *ep_ptr, dapl_dbg_log(DAPL_DBG_TYPE_EP, " connect: socket %d to %s r_qual %d pending\n", cm_ptr->socket, - inet_ntoa(((struct sockaddr_in *)r_addr)->sin_addr), + inet_ntoa(addr.sin_addr), (unsigned int)r_qual); dapli_cm_queue(cm_ptr); -- 1.5.2.5 From arlin.r.davis at intel.com Sun Apr 19 00:59:54 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Sun, 19 Apr 2009 00:59:54 -0700 Subject: [ofa-general] [PATCH] Read all data off internal socket pipe to handle multiple wakeup requests efficiently Message-ID: Communication to the CR thread is done using an internal socket. When a new connection request is ready for processing, an object is placed on the CR list, and data is written to the internal socket. The write causes the CR thread to wake-up and process anything on its cr list. If multiple objects are placed on the CR list around the same time, then the CR thread will read in a single character, but process the entire list. This results in additional data being left on the internal socket. When the CR does a select(), it will find more data to read, read the data, but not have any real work to do. The result is that the thread spins in a loop checking for changes when none have occurred until all data on the internal socket has been read. Avoid this overhead by reading all data off the internal socket before processing the CR list. Signed-off-by: Sean Hefty --- dapl/openib_scm/dapl_ib_cm.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/dapl/openib_scm/dapl_ib_cm.c b/dapl/openib_scm/dapl_ib_cm.c index 6db2b4a..6af9cb2 100644 --- a/dapl/openib_scm/dapl_ib_cm.c +++ b/dapl/openib_scm/dapl_ib_cm.c @@ -1677,7 +1677,7 @@ void cr_thread(void *arg) dapl_select(set); /* if pipe used to wakeup, consume */ - if (dapl_poll(g_scm[0], DAPL_FD_READ) == DAPL_FD_READ) { + while (dapl_poll(g_scm[0], DAPL_FD_READ) == DAPL_FD_READ) { if (recv(g_scm[0], rbuf, 2, 0) == -1) dapl_log(DAPL_DBG_TYPE_CM, " cr_thread: read pipe error = %s\n", -- 1.5.2.5 From arlin.r.davis at intel.com Sun Apr 19 01:03:25 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Sun, 19 Apr 2009 01:03:25 -0700 Subject: [ofa-general] [PATCH] uDAPL v2: dat_evd_wait, hold lock when working with rbuf resources Message-ID: Since the removal thread is the user's, but the queuing thread is not, the synchronization must be provided by DAPL. Hold the evd lock around any calls to dapls_rbuf_*. Signed-off-by: Sean Hefty --- dapl/udapl/dapl_evd_wait.c | 10 +++++++--- 1 files changed, 7 insertions(+), 3 deletions(-) diff --git a/dapl/udapl/dapl_evd_wait.c b/dapl/udapl/dapl_evd_wait.c index 9fc0ba2..c973397 100644 --- a/dapl/udapl/dapl_evd_wait.c +++ b/dapl/udapl/dapl_evd_wait.c @@ -149,7 +149,6 @@ DAT_RETURN DAT_API dapl_evd_wait ( { /* Bogus state, bail out */ dat_status = DAT_ERROR (DAT_INVALID_STATE,0); - dapl_os_unlock ( &evd_ptr->header.lock ); goto bail; } @@ -160,10 +159,8 @@ DAT_RETURN DAT_API dapl_evd_wait ( evd_ptr->evd_state = evd_state; dat_status = DAT_ERROR (DAT_INVALID_STATE, DAT_INVALID_STATE_EVD_UNWAITABLE); - dapl_os_unlock ( &evd_ptr->header.lock ); goto bail; } - dapl_os_unlock ( &evd_ptr->header.lock ); /* @@ -185,7 +182,9 @@ DAT_RETURN DAT_API dapl_evd_wait ( * return right away if the ib_cq_handle associate with these evd * equal to IB_INVALID_HANDLE */ + dapl_os_unlock(&evd_ptr->header.lock); dapls_evd_copy_cq(evd_ptr); + dapl_os_lock(&evd_ptr->header.lock); if (dapls_rbuf_count(&evd_ptr->pending_event_queue) >= threshold) { @@ -226,6 +225,7 @@ DAT_RETURN DAT_API dapl_evd_wait ( evd_ptr->threshold = threshold; DAPL_CNTR(evd_ptr, DCNT_EVD_WAIT_BLOCKED); + dapl_os_unlock(&evd_ptr->header.lock); #ifdef CQ_WAIT_OBJECT if (evd_ptr->cq_wait_obj_handle) @@ -235,6 +235,9 @@ DAT_RETURN DAT_API dapl_evd_wait ( #endif dat_status = dapl_os_wait_object_wait ( &evd_ptr->wait_object, time_out ); + + dapl_os_lock(&evd_ptr->header.lock); + /* * FIXME: if the thread loops around and waits again * the time_out value needs to be updated. @@ -276,6 +279,7 @@ DAT_RETURN DAT_API dapl_evd_wait ( *nmore = dapls_rbuf_count(&evd_ptr->pending_event_queue); bail: + dapl_os_unlock(&evd_ptr->header.lock); if ( dat_status ) { dapl_dbg_log (DAPL_DBG_TYPE_RTN, "dapl_evd_wait () returns 0x%x\n", dat_status); -- 1.5.2.5 From arlin.r.davis at intel.com Sun Apr 19 01:06:00 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Sun, 19 Apr 2009 01:06:00 -0700 Subject: [ofa-general] [PATCH] uDAPL v2: dapltest changes to support multi endpoints in any order Message-ID: dapltest assumes that events across multiple endpoints occur in a specific order. Since this is a false assumption, avoid this by directing events to per endpoint EVDs, rather than using shared EVDs. Signed-off-by: Sean Hefty --- test/dapltest/include/dapl_proto.h | 4 - test/dapltest/include/dapl_transaction_test.h | 8 +- test/dapltest/test/dapl_transaction_test.c | 264 +++++++++++-------------- test/dapltest/test/dapl_transaction_util.c | 12 +- 4 files changed, 121 insertions(+), 167 deletions(-) diff --git a/test/dapltest/include/dapl_proto.h b/test/dapltest/include/dapl_proto.h index d8be354..fb5a293 100644 --- a/test/dapltest/include/dapl_proto.h +++ b/test/dapltest/include/dapl_proto.h @@ -504,15 +504,12 @@ bool DT_handle_post_recv_buf (DT_Tdep_Print_Head* phead, bool DT_handle_send_op (DT_Tdep_Print_Head* phead, Ep_Context_t * ep_context, - DAT_EVD_HANDLE reqt_evd_hdl, unsigned int num_eps, int op_indx, bool poll); bool DT_handle_recv_op (DT_Tdep_Print_Head* phead, Ep_Context_t * ep_context, - DAT_EVD_HANDLE recv_evd_hdl, - DAT_EVD_HANDLE reqt_evd_hdl, unsigned int num_eps, int op_indx, bool poll, @@ -520,7 +517,6 @@ bool DT_handle_recv_op (DT_Tdep_Print_Head* phead, bool DT_handle_rdma_op (DT_Tdep_Print_Head* phead, Ep_Context_t * ep_context, - DAT_EVD_HANDLE reqt_evd_hdl, unsigned int num_eps, DT_Transfer_Type opcode, int op_indx, diff --git a/test/dapltest/include/dapl_transaction_test.h b/test/dapltest/include/dapl_transaction_test.h index 3e1a8e7..7401cdf 100644 --- a/test/dapltest/include/dapl_transaction_test.h +++ b/test/dapltest/include/dapl_transaction_test.h @@ -59,6 +59,10 @@ typedef struct Transaction_Test_Op_t op[ MAX_OPS ]; DAT_RSP_HANDLE rsp_handle; DAT_PSP_HANDLE psp_handle; + DAT_EVD_HANDLE recv_evd_hdl; /* receive */ + DAT_EVD_HANDLE reqt_evd_hdl; /* request+rmr */ + DAT_EVD_HANDLE conn_evd_hdl; /* connect */ + DAT_EVD_HANDLE creq_evd_hdl; /* "" request */ } Ep_Context_t; @@ -88,10 +92,6 @@ typedef struct /* This group set up by each thread in DT_Transaction_Main() */ DAT_PZ_HANDLE pz_handle; - DAT_EVD_HANDLE recv_evd_hdl; /* receive */ - DAT_EVD_HANDLE reqt_evd_hdl; /* request+rmr */ - DAT_EVD_HANDLE conn_evd_hdl; /* connect */ - DAT_EVD_HANDLE creq_evd_hdl; /* "" request */ Ep_Context_t *ep_context; /* Statistics set by DT_Transaction_Run() */ diff --git a/test/dapltest/test/dapl_transaction_test.c b/test/dapltest/test/dapl_transaction_test.c index 4abda1e..1c01456 100644 --- a/test/dapltest/test/dapl_transaction_test.c +++ b/test/dapltest/test/dapl_transaction_test.c @@ -273,63 +273,6 @@ DT_Transaction_Main (void *param) goto test_failure; } - /* create 4 EVDs - recv, request+RMR, conn-request, connect */ - ret = DT_Tdep_evd_create (test_ptr->ia_handle, - test_ptr->evd_length, - NULL, - DAT_EVD_DTO_FLAG, - &test_ptr->recv_evd_hdl); /* recv */ - if (ret != DAT_SUCCESS) - { - DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_evd_create (recv) error: %s\n", - test_ptr->base_port, DT_RetToString (ret)); - test_ptr->recv_evd_hdl = DAT_HANDLE_NULL; - goto test_failure; - } - - ret = DT_Tdep_evd_create (test_ptr->ia_handle, - test_ptr->evd_length, - NULL, - DAT_EVD_DTO_FLAG | DAT_EVD_RMR_BIND_FLAG, - &test_ptr->reqt_evd_hdl); /* request + rmr bind */ - if (ret != DAT_SUCCESS) - { - DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_evd_create (request) error: %s\n", - test_ptr->base_port, DT_RetToString (ret)); - test_ptr->reqt_evd_hdl = DAT_HANDLE_NULL; - goto test_failure; - } - - if (pt_ptr->local_is_server) - { - /* Client-side doesn't need CR events */ - ret = DT_Tdep_evd_create (test_ptr->ia_handle, - test_ptr->evd_length, - NULL, - DAT_EVD_CR_FLAG, - &test_ptr->creq_evd_hdl); /* cr */ - if (ret != DAT_SUCCESS) - { - DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_evd_create (cr) error: %s\n", - test_ptr->base_port, DT_RetToString (ret)); - test_ptr->creq_evd_hdl = DAT_HANDLE_NULL; - goto test_failure; - } - } - - ret = DT_Tdep_evd_create (test_ptr->ia_handle, - test_ptr->evd_length, - NULL, - DAT_EVD_CONNECTION_FLAG, - &test_ptr->conn_evd_hdl); /* conn */ - if (ret != DAT_SUCCESS) - { - DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_evd_create (conn) error: %s\n", - test_ptr->base_port, DT_RetToString (ret)); - test_ptr->conn_evd_hdl = DAT_HANDLE_NULL; - goto test_failure; - } - /* Allocate per-EP data */ test_ptr->ep_context = (Ep_Context_t *) DT_MemListAlloc (pt_ptr, @@ -359,6 +302,55 @@ DT_Transaction_Main (void *param) DAT_EP_ATTR ep_attr; DAT_UINT32 buff_size = MAX_OPS * sizeof (RemoteMemoryInfo); + /* create 4 EVDs - recv, request+RMR, conn-request, connect */ + ret = DT_Tdep_evd_create(test_ptr->ia_handle, test_ptr->evd_length, + NULL, DAT_EVD_DTO_FLAG, + &test_ptr->ep_context[i].recv_evd_hdl); + if (ret != DAT_SUCCESS) + { + DT_Tdep_PT_Printf(phead, "Test[" F64x "]: dat_evd_create (recv) error: %s\n", + test_ptr->base_port, DT_RetToString (ret)); + test_ptr->ep_context[i].recv_evd_hdl = DAT_HANDLE_NULL; + goto test_failure; + } + + ret = DT_Tdep_evd_create(test_ptr->ia_handle, test_ptr->evd_length, + NULL, DAT_EVD_DTO_FLAG | DAT_EVD_RMR_BIND_FLAG, + &test_ptr->ep_context[i].reqt_evd_hdl); + if (ret != DAT_SUCCESS) + { + DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_evd_create (request) error: %s\n", + test_ptr->base_port, DT_RetToString (ret)); + test_ptr->ep_context[i].reqt_evd_hdl = DAT_HANDLE_NULL; + goto test_failure; + } + + if (pt_ptr->local_is_server) + { + /* Client-side doesn't need CR events */ + ret = DT_Tdep_evd_create(test_ptr->ia_handle, test_ptr->evd_length, + NULL, DAT_EVD_CR_FLAG, + &test_ptr->ep_context[i].creq_evd_hdl); + if (ret != DAT_SUCCESS) + { + DT_Tdep_PT_Printf(phead, "Test[" F64x "]: dat_evd_create (cr) error: %s\n", + test_ptr->base_port, DT_RetToString (ret)); + test_ptr->ep_context[i].creq_evd_hdl = DAT_HANDLE_NULL; + goto test_failure; + } + } + + ret = DT_Tdep_evd_create(test_ptr->ia_handle, test_ptr->evd_length, + NULL, DAT_EVD_CONNECTION_FLAG, + &test_ptr->ep_context[i].conn_evd_hdl); + if (ret != DAT_SUCCESS) + { + DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_evd_create (conn) error: %s\n", + test_ptr->base_port, DT_RetToString (ret)); + test_ptr->ep_context[i].conn_evd_hdl = DAT_HANDLE_NULL; + goto test_failure; + } + /* * Adjust default EP attributes to fit the requested test. * This is simplistic; in that we don't count ops of each @@ -379,9 +371,9 @@ DT_Transaction_Main (void *param) /* Create EP */ ret = dat_ep_create (test_ptr->ia_handle, /* IA */ test_ptr->pz_handle, /* PZ */ - test_ptr->recv_evd_hdl, /* recv */ - test_ptr->reqt_evd_hdl, /* request */ - test_ptr->conn_evd_hdl, /* connect */ + test_ptr->ep_context[i].recv_evd_hdl, /* recv */ + test_ptr->ep_context[i].reqt_evd_hdl, /* request */ + test_ptr->ep_context[i].conn_evd_hdl, /* connect */ &ep_attr, /* EP attrs */ &test_ptr->ep_context[i].ep_handle); if (ret != DAT_SUCCESS) @@ -470,7 +462,7 @@ DT_Transaction_Main (void *param) ret = dat_rsp_create (test_ptr->ia_handle, test_ptr->ep_context[i].ia_port, test_ptr->ep_context[i].ep_handle, - test_ptr->creq_evd_hdl, + test_ptr->ep_context[i].creq_evd_hdl, &test_ptr->ep_context[i].rsp_handle); if (ret != DAT_SUCCESS) { @@ -483,7 +475,7 @@ DT_Transaction_Main (void *param) { ret = dat_psp_create (test_ptr->ia_handle, test_ptr->ep_context[i].ia_port, - test_ptr->creq_evd_hdl, + test_ptr->ep_context[i].creq_evd_hdl, DAT_PSP_CONSUMER_FLAG, &test_ptr->ep_context[i].psp_handle); if (ret != DAT_SUCCESS) @@ -520,7 +512,7 @@ DT_Transaction_Main (void *param) DT_Mdep_Unlock (&pt_ptr->Thread_counter_lock); } - + for (i = 0; i < test_ptr->cmd->eps_per_thread; i++) { DAT_UINT32 buff_size = MAX_OPS * sizeof (RemoteMemoryInfo); @@ -542,7 +534,7 @@ DT_Transaction_Main (void *param) /* wait for the connection request */ if (!DT_cr_event_wait (phead, - test_ptr->creq_evd_hdl, + test_ptr->ep_context[i].creq_evd_hdl, &cr_stat) || !DT_cr_check ( phead, &cr_stat, @@ -593,7 +585,7 @@ DT_Transaction_Main (void *param) /* wait for DAT_CONNECTION_EVENT_ESTABLISHED */ if (!DT_conn_event_wait ( phead, test_ptr->ep_context[i].ep_handle, - test_ptr->conn_evd_hdl, + test_ptr->ep_context[i].conn_evd_hdl, &event_num)) { /* error message printed by DT_conn_event_wait */ @@ -616,14 +608,13 @@ DT_Transaction_Main (void *param) */ /* wait for a connection request */ if (!DT_cr_event_wait (phead, - test_ptr->creq_evd_hdl, + test_ptr->ep_context[i].creq_evd_hdl, &cr_stat) ) { DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_psp_create #%d error: %s\n", test_ptr->base_port, i, DT_RetToString (ret)); goto test_failure; } - if ( !DT_cr_check ( phead, &cr_stat, test_ptr->ep_context[i].psp_handle, @@ -675,7 +666,7 @@ DT_Transaction_Main (void *param) /* wait for DAT_CONNECTION_EVENT_ESTABLISHED */ if (!DT_conn_event_wait ( phead, test_ptr->ep_context[i].ep_handle, - test_ptr->conn_evd_hdl, + test_ptr->ep_context[i].conn_evd_hdl, &event_num)) { /* error message printed by DT_cr_event_wait */ @@ -732,7 +723,7 @@ retry: /* wait for DAT_CONNECTION_EVENT_ESTABLISHED */ if (!DT_conn_event_wait ( phead, test_ptr->ep_context[i].ep_handle, - test_ptr->conn_evd_hdl, + test_ptr->ep_context[i].conn_evd_hdl, &event_num)) { /* error message printed by DT_cr_event_wait */ @@ -750,7 +741,7 @@ retry: dat_ep_reset (test_ptr->ep_context[i].ep_handle); do { - ret = DT_Tdep_evd_dequeue ( test_ptr->recv_evd_hdl, + ret = DT_Tdep_evd_dequeue ( test_ptr->ep_context[i].recv_evd_hdl, &event); drained++; } while (DAT_GET_TYPE(ret) != DAT_QUEUE_EMPTY); @@ -845,7 +836,7 @@ retry: test_ptr->ia_handle, test_ptr->pz_handle, test_ptr->ep_context[i].ep_handle, - test_ptr->reqt_evd_hdl, + test_ptr->ep_context[i].reqt_evd_hdl, test_ptr->ep_context[i].op[j].seg_size, test_ptr->ep_context[i].op[j].num_segs, DAT_OPTIMAL_ALIGNMENT, @@ -881,7 +872,7 @@ retry: test_ptr->ia_handle, test_ptr->pz_handle, test_ptr->ep_context[i].ep_handle, - test_ptr->reqt_evd_hdl, + test_ptr->ep_context[i].reqt_evd_hdl, test_ptr->ep_context[i].op[j].seg_size, test_ptr->ep_context[i].op[j].num_segs, DAT_OPTIMAL_ALIGNMENT, @@ -1000,7 +991,7 @@ retry: (DAT_PVOID) DT_Bpool_GetBuffer ( test_ptr->ep_context[i].bp, RMI_SEND_BUFFER_ID); - if (!DT_dto_event_wait (phead, test_ptr->reqt_evd_hdl, &dto_stat) || + if (!DT_dto_event_wait (phead, test_ptr->ep_context[i].reqt_evd_hdl, &dto_stat) || !DT_dto_check ( phead, &dto_stat, test_ptr->ep_context[i].ep_handle, @@ -1024,7 +1015,7 @@ retry: (DAT_PVOID) DT_Bpool_GetBuffer ( test_ptr->ep_context[i].bp, RMI_RECV_BUFFER_ID); - if (!DT_dto_event_wait (phead, test_ptr->recv_evd_hdl, &dto_stat) || + if (!DT_dto_event_wait (phead, test_ptr->ep_context[i].recv_evd_hdl, &dto_stat) || !DT_dto_check ( phead, &dto_stat, test_ptr->ep_context[i].ep_handle, @@ -1053,7 +1044,7 @@ retry: (DAT_PVOID) DT_Bpool_GetBuffer ( test_ptr->ep_context[i].bp, RMI_SEND_BUFFER_ID); - if (!DT_dto_event_wait (phead, test_ptr->reqt_evd_hdl, &dto_stat) || + if (!DT_dto_event_wait (phead, test_ptr->ep_context[i].reqt_evd_hdl, &dto_stat) || !DT_dto_check ( phead, &dto_stat, test_ptr->ep_context[i].ep_handle, @@ -1183,33 +1174,14 @@ test_failure: if ( success ) /* Ensure DT_Transaction_Run did not return error otherwise may get stuck waiting for disconnect event*/ { if (!DT_disco_event_wait ( phead, - test_ptr->conn_evd_hdl, + test_ptr->ep_context[i].conn_evd_hdl, &ep_handle)) { DT_Tdep_PT_Printf (phead, "Test[" F64x "]: bad disconnect event\n", test_ptr->base_port); } - else - { - /* - * We have successfully obtained a completed EP. We are - * racing with the remote node on disconnects, so we - * don't know which EP this is. Run the list and - * remove it so we don't disconnect a disconnected EP - */ - for (j = 0; j < test_ptr->cmd->eps_per_thread; j++) - { - if ( test_ptr->ep_context[j].ep_handle == ep_handle ) - { - test_ptr->ep_context[j].ep_handle = NULL; - } - } - } - } - else /* !success - QP may be in error state */ - { - ep_handle = test_ptr->ep_context[i].ep_handle; } + ep_handle = test_ptr->ep_context[i].ep_handle; /* * Free the handle returned by the disconnect event. @@ -1226,7 +1198,7 @@ test_failure: */ do { - ret = DT_Tdep_evd_dequeue ( test_ptr->recv_evd_hdl, + ret = DT_Tdep_evd_dequeue ( test_ptr->ep_context[i].recv_evd_hdl, &event); } while (ret == DAT_SUCCESS); /* Destroy the EP */ @@ -1238,56 +1210,51 @@ test_failure: /* carry on trying, regardless */ } } + /* clean up the EVDs */ + if (test_ptr->ep_context[i].conn_evd_hdl) + { + ret = DT_Tdep_evd_free (test_ptr->ep_context[i].conn_evd_hdl); + if (ret != DAT_SUCCESS) + { + DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_evd_free (conn) error: %s\n", + test_ptr->base_port, DT_RetToString (ret)); + } + } + if (pt_ptr->local_is_server) + { + if (test_ptr->ep_context[i].creq_evd_hdl) + { + ret = DT_Tdep_evd_free (test_ptr->ep_context[i].creq_evd_hdl); + if (ret != DAT_SUCCESS) + { + DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_evd_free (creq) error: %s\n", + test_ptr->base_port, DT_RetToString (ret)); + } + } + } + if (test_ptr->ep_context[i].reqt_evd_hdl) + { + ret = DT_Tdep_evd_free (test_ptr->ep_context[i].reqt_evd_hdl); + if (ret != DAT_SUCCESS) + { + DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_evd_free (reqt) error: %s\n", + test_ptr->base_port, DT_RetToString (ret)); + } + } + if (test_ptr->ep_context[i].recv_evd_hdl) + { + ret = DT_Tdep_evd_free (test_ptr->ep_context[i].recv_evd_hdl); + if (ret != DAT_SUCCESS) + { + DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_evd_free (recv) error: %s\n", + test_ptr->base_port, DT_RetToString (ret)); + } + } } /* end foreach per-EP context */ DT_MemListFree (pt_ptr, test_ptr->ep_context); } - /* clean up the EVDs */ - if (test_ptr->conn_evd_hdl) - { - ret = DT_Tdep_evd_free (test_ptr->conn_evd_hdl); - if (ret != DAT_SUCCESS) - { - DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_evd_free (conn) error: %s\n", - test_ptr->base_port, DT_RetToString (ret)); - /* fall through, keep trying */ - } - } - if (pt_ptr->local_is_server) - { - if (test_ptr->creq_evd_hdl) - { - ret = DT_Tdep_evd_free (test_ptr->creq_evd_hdl); - if (ret != DAT_SUCCESS) - { - DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_evd_free (creq) error: %s\n", - test_ptr->base_port, DT_RetToString (ret)); - /* fall through, keep trying */ - } - } - } - if (test_ptr->reqt_evd_hdl) - { - ret = DT_Tdep_evd_free (test_ptr->reqt_evd_hdl); - if (ret != DAT_SUCCESS) - { - DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_evd_free (reqt) error: %s\n", - test_ptr->base_port, DT_RetToString (ret)); - /* fall through, keep trying */ - } - } - if (test_ptr->recv_evd_hdl) - { - ret = DT_Tdep_evd_free (test_ptr->recv_evd_hdl); - if (ret != DAT_SUCCESS) - { - DT_Tdep_PT_Printf (phead, "Test[" F64x "]: dat_evd_free (recv) error: %s\n", - test_ptr->base_port, DT_RetToString (ret)); - /* fall through, keep trying */ - } - } - /* clean up the PZ */ if (test_ptr->pz_handle) { @@ -1399,7 +1366,7 @@ DT_Transaction_Run (DT_Tdep_Print_Head *phead, Transaction_Test_t * test_ptr) DAT_DTO_COMPLETION_EVENT_DATA dto_stat; if ( !DT_dto_event_wait (phead, - test_ptr->reqt_evd_hdl, + test_ptr->ep_context[i].reqt_evd_hdl, &dto_stat) ) { DT_Tdep_PT_Debug (1,(phead,"Test[" F64x "]: Server sync send error\n", @@ -1416,7 +1383,7 @@ DT_Transaction_Run (DT_Tdep_Print_Head *phead, Transaction_Test_t * test_ptr) DAT_DTO_COMPLETION_EVENT_DATA dto_stat; if ( !DT_dto_event_wait (phead, - test_ptr->recv_evd_hdl, + test_ptr->ep_context[i].recv_evd_hdl, &dto_stat) ) { DT_Tdep_PT_Debug (1,(phead,"Test[" F64x "]: Server sync recv error\n", @@ -1437,7 +1404,7 @@ DT_Transaction_Run (DT_Tdep_Print_Head *phead, Transaction_Test_t * test_ptr) DAT_DTO_COMPLETION_EVENT_DATA dto_stat; if ( !DT_dto_event_wait (phead, - test_ptr->recv_evd_hdl, + test_ptr->ep_context[i].recv_evd_hdl, &dto_stat) ) { DT_Tdep_PT_Debug (1,(phead,"Test[" F64x "]: Client sync recv error\n", @@ -1474,7 +1441,7 @@ DT_Transaction_Run (DT_Tdep_Print_Head *phead, Transaction_Test_t * test_ptr) DAT_DTO_COMPLETION_EVENT_DATA dto_stat; if ( !DT_dto_event_wait (phead, - test_ptr->reqt_evd_hdl, + test_ptr->ep_context[i].reqt_evd_hdl, &dto_stat) ) { goto bail; @@ -1518,7 +1485,6 @@ DT_Transaction_Run (DT_Tdep_Print_Head *phead, Transaction_Test_t * test_ptr) op)); if (!DT_handle_rdma_op (phead, test_ptr->ep_context, - test_ptr->reqt_evd_hdl, test_ptr->cmd->eps_per_thread, RDMA_READ, op, @@ -1542,7 +1508,6 @@ DT_Transaction_Run (DT_Tdep_Print_Head *phead, Transaction_Test_t * test_ptr) op)); if (!DT_handle_rdma_op (phead, test_ptr->ep_context, - test_ptr->reqt_evd_hdl, test_ptr->cmd->eps_per_thread, RDMA_WRITE, op, @@ -1566,7 +1531,6 @@ DT_Transaction_Run (DT_Tdep_Print_Head *phead, Transaction_Test_t * test_ptr) /* send data */ if (!DT_handle_send_op (phead, test_ptr->ep_context, - test_ptr->reqt_evd_hdl, test_ptr->cmd->eps_per_thread, op, test_ptr->cmd->poll)) @@ -1582,8 +1546,6 @@ DT_Transaction_Run (DT_Tdep_Print_Head *phead, Transaction_Test_t * test_ptr) if (!DT_handle_recv_op (phead, test_ptr->ep_context, - test_ptr->recv_evd_hdl, - test_ptr->reqt_evd_hdl, test_ptr->cmd->eps_per_thread, op, test_ptr->cmd->poll, diff --git a/test/dapltest/test/dapl_transaction_util.c b/test/dapltest/test/dapl_transaction_util.c index 970aa8c..93312c7 100644 --- a/test/dapltest/test/dapl_transaction_util.c +++ b/test/dapltest/test/dapl_transaction_util.c @@ -87,7 +87,6 @@ DT_handle_post_recv_buf (DT_Tdep_Print_Head *phead, bool DT_handle_send_op (DT_Tdep_Print_Head *phead, Ep_Context_t * ep_context, - DAT_EVD_HANDLE reqt_evd_hdl, unsigned int num_eps, int op_indx, bool poll) @@ -161,7 +160,7 @@ DT_handle_send_op (DT_Tdep_Print_Head *phead, DAT_DTO_COOKIE dto_cookie; unsigned int epnum; - if (!DT_dto_event_reap (phead, reqt_evd_hdl, poll, &dto_stat)) + if (!DT_dto_event_reap (phead, ep_context[i].reqt_evd_hdl, poll, &dto_stat)) { DT_Mdep_Free (completion_reaped); return false; @@ -235,8 +234,6 @@ DT_handle_send_op (DT_Tdep_Print_Head *phead, bool DT_handle_recv_op (DT_Tdep_Print_Head *phead, Ep_Context_t * ep_context, - DAT_EVD_HANDLE recv_evd_hdl, - DAT_EVD_HANDLE reqt_evd_hdl, unsigned int num_eps, int op_indx, bool poll, @@ -270,7 +267,7 @@ DT_handle_recv_op (DT_Tdep_Print_Head *phead, unsigned int epnum; /* First reap the recv DTO event */ - if (!DT_dto_event_reap (phead, recv_evd_hdl, poll, &dto_stat)) + if (!DT_dto_event_reap (phead, ep_context[i].recv_evd_hdl, poll, &dto_stat)) { DT_Mdep_Free (recv_completion_reaped); DT_Mdep_Free (send_completion_reaped); @@ -340,7 +337,7 @@ DT_handle_recv_op (DT_Tdep_Print_Head *phead, return false; } - if (!DT_dto_event_reap (phead, reqt_evd_hdl, poll, &dto_stat)) + if (!DT_dto_event_reap (phead, ep_context[i].reqt_evd_hdl, poll, &dto_stat)) { DT_Mdep_Free (recv_completion_reaped); DT_Mdep_Free (send_completion_reaped); @@ -463,7 +460,6 @@ DT_handle_recv_op (DT_Tdep_Print_Head *phead, bool DT_handle_rdma_op (DT_Tdep_Print_Head *phead, Ep_Context_t * ep_context, - DAT_EVD_HANDLE reqt_evd_hdl, unsigned int num_eps, DT_Transfer_Type opcode, int op_indx, @@ -561,7 +557,7 @@ DT_handle_rdma_op (DT_Tdep_Print_Head *phead, DAT_DTO_COOKIE dto_cookie; unsigned int epnum; - if (!DT_dto_event_reap (phead, reqt_evd_hdl, poll, &dto_stat)) + if (!DT_dto_event_reap (phead, ep_context[i].reqt_evd_hdl, poll, &dto_stat)) { DT_Mdep_Free (completion_reaped); return ( false ); -- 1.5.2.5 From arlin.r.davis at intel.com Sun Apr 19 01:06:05 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Sun, 19 Apr 2009 01:06:05 -0700 Subject: [ofa-general] [PATCH] uDAPL v2: dapltest, adjust next port number for number of threads to avoid duplication Message-ID: To avoid duplicating port numbers between different tests, the next port number to use must increment based on the number of endpoints per thread * the number of threads. Signed-off-by: Sean Hefty --- test/dapltest/test/dapl_server.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/test/dapltest/test/dapl_server.c b/test/dapltest/test/dapl_server.c index d589e7b..4d386fe 100644 --- a/test/dapltest/test/dapl_server.c +++ b/test/dapltest/test/dapl_server.c @@ -480,6 +480,9 @@ DT_cs_Server (Params_t * params_ptr) case TRANSACTION_TEST: { /* create a thread to handle this pt_ptr; */ + ps_ptr->NextPortNumber += + (pt_ptr->Params.u.Transaction_Cmd.eps_per_thread - 1) * + pt_ptr->Client_Info.total_threads; DT_Tdep_PT_Debug (1,(phead,"%s: Creating Transaction Test Thread\n", module)); pt_ptr->thread = DT_Thread_Create (pt_ptr, DT_Transaction_Test_Server, -- 1.5.2.5 From arlin.r.davis at intel.com Sun Apr 19 01:07:06 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Sun, 19 Apr 2009 01:07:06 -0700 Subject: [ofa-general] [PATCH] uDAPL v2: dapltest: reset server listen ports to avoid collisions during long runs Message-ID: If server is running continuously the port number increments from base without reseting between tests. This will eventually cause collisions in port space. Signed-off-by: Arlin Davis --- test/dapltest/test/dapl_server.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/test/dapltest/test/dapl_server.c b/test/dapltest/test/dapl_server.c index 4d386fe..fc6c8af 100644 --- a/test/dapltest/test/dapl_server.c +++ b/test/dapltest/test/dapl_server.c @@ -465,6 +465,9 @@ DT_cs_Server (Params_t * params_ptr) DT_Tdep_PT_Debug (1,(phead,"%s: Send Server_Info\n", module)); pt_ptr->Server_Info.dapltest_version = DAPLTEST_VERSION; pt_ptr->Server_Info.is_little_endian = DT_local_is_little_endian; + /* reset port, don't eat up port space on long runs */ + if (ps_ptr->NextPortNumber >= SERVER_PORT_NUMBER + 1000) + ps_ptr->NextPortNumber = SERVER_PORT_NUMBER + 1; pt_ptr->Server_Info.first_port_number = ps_ptr->NextPortNumber; ps_ptr->NextPortNumber += pt_ptr->Client_Info.total_threads; -- 1.5.2.5 From vlad at lists.openfabrics.org Sun Apr 19 03:23:45 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 19 Apr 2009 03:23:45 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090419-0200 daily build status Message-ID: <20090419102345.A6EFCE612AB@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From hnrose at comcast.net Sun Apr 19 03:32:41 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sun, 19 Apr 2009 06:32:41 -0400 Subject: [ofa-general] ***SPAM*** [PATCH] infiniband-diags/man/vendstat.8: Indicate IS4 config group config not persistent across IS4 reboot Message-ID: <20090419103241.GA25675@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/infiniband-diags/man/vendstat.8 b/infiniband-diags/man/vendstat.8 index 803e3b7..bdd0c3c 100644 --- a/infiniband-diags/man/vendstat.8 +++ b/infiniband-diags/man/vendstat.8 @@ -29,7 +29,8 @@ show IS4 counter group info. \fB\-c\fR configure IS4 counter groups. -Configure IS4 counter groups 0 and 1. +Configure IS4 counter groups 0 and 1. Such configuration +is not persistent across IS4 reboot. First number is for counter group 0 and second is for counter group 1. From tziporet at dev.mellanox.co.il Sun Apr 19 03:58:48 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 19 Apr 2009 13:58:48 +0300 Subject: [ofa-general] Re: 1.5 roadmap In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F4266DD4E8@orsmsx507.amr.corp.intel.com> References: <382A478CAD40FA4FB46605CF81FE39F4266DD407@orsmsx507.amr.corp.intel.com> <15ddcffd0904071601q30431502t41618f1438bf2a2a@mail.gmail.com> <382A478CAD40FA4FB46605CF81FE39F4266DD46F@orsmsx507.amr.corp.intel.com> <15ddcffd0904071637q23406905x8e9636b51e02b2f0@mail.gmail.com> <382A478CAD40FA4FB46605CF81FE39F4266DD4E8@orsmsx507.amr.corp.intel.com> Message-ID: <49EB03E8.6050208@mellanox.co.il> Woodruff, Robert J wrote: > > Good point. If we want to have a feature freeze date of May 7, > it might be a good idea to get these code reviews started ASAP, > and even then, might be hard to get them all done before May 7. > > If the code will not be ready on time it will not be included in the release However since the Vnic driver code is an add-on module I think we can add it later in the game since it does not harm the stability of other OFED components. Tziporet From monis at Voltaire.COM Sun Apr 19 04:58:29 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Sun, 19 Apr 2009 14:58:29 +0300 Subject: [ofa-general] [PATCH] rdma_cm: Add debugfs entries to monitor rdma_cm connections In-Reply-To: References: <49E71FE1.90102@Voltaire.COM> Message-ID: <49EB11E5.5000407@Voltaire.COM> Thanks Sean Hefty wrote: > If the path is: > > /sys/kernel/debug/rdma_cm/mthca0_rdma_id > > do we really need to append '_rdma_id' at the end? (I'll defer to others if > debugfs is the right location or not.) rdma_id is a suffix that leaves room for more, or in other works - I just wanted to leave room for other debug information in the future (e.g. number of count of total incoming connection on device) > > nit: > I'm > not > a > big > fan > of > one > parameter > per > line. > I wanted to make it easier to the code reader but I'm not a style fanatic :) I'll change it if its critical. > :) > > It's not readily apparent to me what several of the headings are (TP, PO, PS, > ST) or what the numeric values map to (for TP, PS, ST). > TP=TyPe (Device type) PO=POrt (Port Number) PS=PortSpace ST=STate I tried to shorten the output line as much as possible to make the output looks as easy to read table (on most screen the output will be one line per rdma_id) The same thought made me print only the numeric value and not it's string value. Again, I'm not attached to this style and will change it if required. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From tziporet at dev.mellanox.co.il Sun Apr 19 05:21:51 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Sun, 19 Apr 2009 15:21:51 +0300 Subject: [ofa-general] OFED version for RHEL/CentOS 5.3? In-Reply-To: <49E917F9.9070208@nsc.liu.se> References: <49E917F9.9070208@nsc.liu.se> Message-ID: <49EB175F.9050901@mellanox.co.il> Pär Andersson wrote: > Hello, > > Our main academic cluster (805 nodes, ConnectX 4x DDR) is currently > running CentOS 5.2 with a self-compiled OFED 1.3.1. I am preparing an > upgrade of the compute nodes to CentOS 5.3 and is trying to decide > what IB stack to use. > > All IB packages included in CentOS 5.3 seems to be newer than our > current OFED 1.3.1, at least according to the RPM version numbers. Do > you know which OFED release the 5.3 packages is based on? > > The options I am considering is: > > * Keep our current OFED 1.3.1 packages, just rebuild kernel-ib. > * Switch to the packages included in CentOS 5.3. > * Build and install OFED 1.4. You will need to take 1.4.1 since 1.4 does not supporting CentOS 3.5 > > How compatible is 1.4 with 1.3, if we should install that? Will MPI > libraries and other applications continue to work or need to be > recompiled? I think you best options is take libraries and kernel from CentOS 5.3. If not I would move to OFED 1.4.1 (GA soon) Tziporet From jackm at dev.mellanox.co.il Sun Apr 19 06:17:23 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 19 Apr 2009 16:17:23 +0300 Subject: [ofa-general] [PATCH] mthca: increase INIT_HCA timeout In-Reply-To: <20090413184657.GE22355@sgi.com> References: <20090413184657.GE22355@sgi.com> Message-ID: <200904191617.23823.jackm@dev.mellanox.co.il> On Monday 13 April 2009 21:46, akepner at sgi.com wrote: > Here's a little patch we've been carrying along for a while. > > If the num_qp module parameter is set higher than 2^19 or so, > HCA initialization times out with EBUSY, e.g.: > > ib_mthca: probe of 0031:01:00.0 failed with error -16 > > A 60 second timeout seems to be sufficient for the max number > of QPs that the h/w can accomodate. > > Signed-off-by: Arthur Kepner > I'm posting a slightly different patch for this issue. There are other commands which also had a 1 HZ timeout, where according to the InfiniHost/InfiniHost III PRMs the timeout should have been higher -- or different. All commands in the driver should have 60-second timeouts to be on the safe side, and the patch which follows corrects this oversight for INIT_HCA, CLOSE_HCA, SYS_EN, SYS_DIS, and CLOSE_IB. - Jack From jackm at dev.mellanox.co.il Sun Apr 19 06:20:32 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 19 Apr 2009 16:20:32 +0300 Subject: [ofa-general] [PATCH] ib_mthca: Bring INIT_HCA and other commands timeout into consistency with PRM Message-ID: <200904191620.32466.jackm@dev.mellanox.co.il> Commands INIT_HCA, CLOSE_HCA, SYS_EN, SYS_DIS, and CLOSE_IB all had 1 second timeouts. For INIT_HCA this caused problems when had more than 2^18 max qp's configured. All other commands have 60-second timeouts. This patch brings the above commands into consistency with the rest of the commands. This patch is an expansion of the INIT_HCA timeout patch submitted by A. Kepner. Signed-off-by: Jack Morgenstein Index: ofed_kernel/drivers/infiniband/hw/mthca/mthca_cmd.c =================================================================== --- ofed_kernel.orig/drivers/infiniband/hw/mthca/mthca_cmd.c 2009-04-19 14:54:12.000000000 +0300 +++ ofed_kernel/drivers/infiniband/hw/mthca/mthca_cmd.c 2009-04-19 15:49:13.655998000 +0300 @@ -157,13 +157,15 @@ enum { enum { CMD_TIME_CLASS_A = (HZ + 999) / 1000 + 1, CMD_TIME_CLASS_B = (HZ + 99) / 100 + 1, - CMD_TIME_CLASS_C = (HZ + 9) / 10 + 1 + CMD_TIME_CLASS_C = (HZ + 9) / 10 + 1, + CMD_TIME_CLASS_D = 60 * HZ }; #else enum { CMD_TIME_CLASS_A = 60 * HZ, CMD_TIME_CLASS_B = 60 * HZ, - CMD_TIME_CLASS_C = 60 * HZ + CMD_TIME_CLASS_C = 60 * HZ, + CMD_TIME_CLASS_D = 60 * HZ }; #endif @@ -598,7 +600,7 @@ int mthca_SYS_EN(struct mthca_dev *dev, u64 out; int ret; - ret = mthca_cmd_imm(dev, 0, &out, 0, 0, CMD_SYS_EN, HZ, status); + ret = mthca_cmd_imm(dev, 0, &out, 0, 0, CMD_SYS_EN, CMD_TIME_CLASS_D, status); if (*status == MTHCA_CMD_STAT_DDR_MEM_ERR) mthca_warn(dev, "SYS_EN DDR error: syn=%x, sock=%d, " @@ -611,7 +613,7 @@ int mthca_SYS_EN(struct mthca_dev *dev, int mthca_SYS_DIS(struct mthca_dev *dev, u8 *status) { - return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, HZ, status); + return mthca_cmd(dev, 0, 0, 0, CMD_SYS_DIS, CMD_TIME_CLASS_C, status); } static int mthca_map_cmd(struct mthca_dev *dev, u16 op, struct mthca_icm *icm, @@ -1390,7 +1392,7 @@ int mthca_INIT_HCA(struct mthca_dev *dev MTHCA_PUT(inbox, param->uarc_base, INIT_HCA_UAR_CTX_BASE_OFFSET); } - err = mthca_cmd(dev, mailbox->dma, 0, 0, CMD_INIT_HCA, HZ, status); + err = mthca_cmd(dev, mailbox->dma, 0, 0, CMD_INIT_HCA, CMD_TIME_CLASS_D, status); mthca_free_mailbox(dev, mailbox); return err; @@ -1450,12 +1452,12 @@ int mthca_INIT_IB(struct mthca_dev *dev, int mthca_CLOSE_IB(struct mthca_dev *dev, int port, u8 *status) { - return mthca_cmd(dev, 0, port, 0, CMD_CLOSE_IB, HZ, status); + return mthca_cmd(dev, 0, port, 0, CMD_CLOSE_IB, CMD_TIME_CLASS_A, status); } int mthca_CLOSE_HCA(struct mthca_dev *dev, int panic, u8 *status) { - return mthca_cmd(dev, 0, 0, panic, CMD_CLOSE_HCA, HZ, status); + return mthca_cmd(dev, 0, 0, panic, CMD_CLOSE_HCA, CMD_TIME_CLASS_C, status); } int mthca_SET_IB(struct mthca_dev *dev, struct mthca_set_ib_param *param, From arkady.kanevsky at gmail.com Sun Apr 19 06:23:50 2009 From: arkady.kanevsky at gmail.com (arkady kanevsky) Date: Sun, 19 Apr 2009 09:23:50 -0400 Subject: [ofa-general] [PATCH] uDAPL v2: dapltest, adjust next port number for number of threads to avoid duplication In-Reply-To: References: Message-ID: <517c62fb0904190623n7a42f670ia90e01e6655b65fb@mail.gmail.com> Arlin, did we encounter this issue at interop? I do not recall seeing it. For IB, does Wait period between assigning the same QP# handles it? Thanks, Arkady On Sun, Apr 19, 2009 at 4:06 AM, Davis, Arlin R wrote: > > To avoid duplicating port numbers between different tests, the next port > number to use must increment based on the number of endpoints per thread * > the number of threads. > > Signed-off-by: Sean Hefty > --- > test/dapltest/test/dapl_server.c | 3 +++ > 1 files changed, 3 insertions(+), 0 deletions(-) > > diff --git a/test/dapltest/test/dapl_server.c > b/test/dapltest/test/dapl_server.c > index d589e7b..4d386fe 100644 > --- a/test/dapltest/test/dapl_server.c > +++ b/test/dapltest/test/dapl_server.c > @@ -480,6 +480,9 @@ DT_cs_Server (Params_t * params_ptr) > case TRANSACTION_TEST: > { > /* create a thread to handle this pt_ptr; */ > + ps_ptr->NextPortNumber += > + (pt_ptr->Params.u.Transaction_Cmd.eps_per_thread - > 1) * > + pt_ptr->Client_Info.total_threads; > DT_Tdep_PT_Debug (1,(phead,"%s: Creating Transaction Test > Thread\n", module)); > pt_ptr->thread = DT_Thread_Create (pt_ptr, > DT_Transaction_Test_Server, > -- > 1.5.2.5 > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -- Cheers, Arkady Kanevsky -------------- next part -------------- An HTML attachment was scrubbed... URL: From hnrose at comcast.net Sun Apr 19 07:53:06 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sun, 19 Apr 2009 10:53:06 -0400 Subject: [ofa-general] ***SPAM*** [PATCH][TRIVIAL] opensm/osm_perfmgr.c: Eliminate duplicated error number Message-ID: <20090419145306.GA30667@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c index b5c7841..8d5ed97 100644 --- a/opensm/opensm/osm_perfmgr.c +++ b/opensm/opensm/osm_perfmgr.c @@ -632,7 +632,7 @@ static int sweep_hop_1(osm_sm_t * sm) CL_DISP_MSGID_NONE, &context); if (status != IB_SUCCESS) - OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 4C82: " + OSM_LOG(sm->p_log, OSM_LOG_ERROR, "ERR 4C84: " "Request for NodeInfo failed\n"); } break; From dorons at voltaire.com Sun Apr 19 09:08:52 2009 From: dorons at voltaire.com (Doron Shoham) Date: Sun, 19 Apr 2009 19:08:52 +0300 Subject: [ofa-general] osmtest fails with latest opensm Message-ID: <1240157332.5310.10.camel@dorons.voltaire.com> Hi, osmtest fails with latest opensm. commit number 3640a83d552f3da20236ee5f3ae5ab1835a68af1 osmtest -fc Command Line Arguments Done with args Flow = Create Inventory Apr 19 19:05:26 938129 [C7095910] 0x7f -> Setting log level to: 0x03 Apr 19 19:05:26 938370 [C7095910] 0x02 -> osm_vendor_init: 1000 pending umads specified Apr 19 19:05:26 952061 [C7095910] 0x02 -> osm_vendor_bind: Binding to port 0x2c9020022f019 Apr 19 19:05:26 968876 [C7095910] 0x02 -> osmtest_validate_sa_class_port_info: ----------------------------- SA Class Port Info: base_ver:1 class_ver:2 cap_mask:0x2602 cap_mask2:0x0 resp_time_val:0x10 ----------------------------- OSMTEST: TEST "Create Inventory" PASS osmtest -fa Command Line Arguments Done with args Flow = All Validations Apr 19 19:06:31 641676 [A30D5910] 0x7f -> Setting log level to: 0x03 Apr 19 19:06:31 641928 [A30D5910] 0x02 -> osm_vendor_init: 1000 pending umads specified Apr 19 19:06:31 655491 [A30D5910] 0x02 -> osm_vendor_bind: Binding to port 0x2c9020022f019 Apr 19 19:06:31 672279 [A30D5910] 0x02 -> osmtest_validate_sa_class_port_info: ----------------------------- SA Class Port Info: base_ver:1 class_ver:2 cap_mask:0x2602 cap_mask2:0x0 resp_time_val:0x10 ----------------------------- Apr 19 19:06:31 678017 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0002c90200258408, LID 0x2 Apr 19 19:06:31 678034 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0002c9020022f018, LID 0x1 Apr 19 19:06:31 678044 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f10400403619, LID 0x3 Apr 19 19:06:31 678052 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f1040398fe20, LID 0x4 Apr 19 19:06:31 678061 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f10403982e70, LID 0x5 Apr 19 19:06:31 678070 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f10403970798, LID 0x6 Apr 19 19:06:31 678078 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f10403982da4, LID 0x7 Apr 19 19:06:31 678087 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f104003f1bbc, LID 0x8 Apr 19 19:06:31 678095 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f104003f1bbd, LID 0x9 Apr 19 19:06:31 678104 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f104039706dc, LID 0xA Apr 19 19:06:31 678211 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0002c9020022f018, LID 0x1 Apr 19 19:06:31 678310 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0002c90200258408, LID 0x2 Apr 19 19:06:31 678414 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f10400403619, LID 0x3 Apr 19 19:06:31 678513 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f1040398fe20, LID 0x4 Apr 19 19:06:31 678613 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f10403982e70, LID 0x5 Apr 19 19:06:31 678712 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f10403970798, LID 0x6 Apr 19 19:06:31 678810 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f10403982da4, LID 0x7 Apr 19 19:06:31 678914 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f104003f1bbc, LID 0x8 Apr 19 19:06:31 679014 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f104003f1bbd, LID 0x9 Apr 19 19:06:31 679111 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f104039706dc, LID 0xA Apr 19 19:06:31 679595 [A30D5910] 0x01 -> osmtest_validate_against_db: [[ ===== Expecting Errors - START ===== Apr 19 19:06:31 679698 [40D3F940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x0200 Apr 19 19:06:31 679711 [40D3F940] 0x01 -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Apr 19 19:06:31 679727 [A30D5910] 0x01 -> osmtest_get_multipath_rec: ERR 0069: ib_query failed (IB_REMOTE_ERROR) Apr 19 19:06:31 679740 [A30D5910] 0x01 -> osmtest_get_multipath_rec: Remote error = IB_SA_MAD_STATUS_REQ_INVALID Apr 19 19:06:31 679748 [A30D5910] 0x01 -> osmtest_validate_against_db: Got error IB_REMOTE_ERROR Apr 19 19:06:31 679756 [A30D5910] 0x01 -> osmtest_validate_against_db: ===== Expecting Errors - END ===== ]] Apr 19 19:06:31 679764 [A30D5910] 0x01 -> osmtest_validate_against_db: [[ ===== Expecting Errors - START ===== Apr 19 19:06:31 679860 [40D3F940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x0200 Apr 19 19:06:31 679870 [40D3F940] 0x01 -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Apr 19 19:06:31 679888 [A30D5910] 0x01 -> osmtest_get_multipath_rec: ERR 0069: ib_query failed (IB_REMOTE_ERROR) Apr 19 19:06:31 679903 [A30D5910] 0x01 -> osmtest_get_multipath_rec: Remote error = IB_SA_MAD_STATUS_REQ_INVALID Apr 19 19:06:31 679912 [A30D5910] 0x01 -> osmtest_validate_against_db: Got error IB_REMOTE_ERROR Apr 19 19:06:31 679938 [A30D5910] 0x01 -> osmtest_validate_against_db: ===== Expecting Errors - END ===== ]] Apr 19 19:06:31 679949 [A30D5910] 0x01 -> osmtest_validate_against_db: [[ ===== Expecting Errors - START ===== Apr 19 19:06:31 680038 [40D3F940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x0500 Apr 19 19:06:31 680048 [40D3F940] 0x01 -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Apr 19 19:06:31 680063 [A30D5910] 0x01 -> osmtest_get_multipath_rec: ERR 0069: ib_query failed (IB_REMOTE_ERROR) Apr 19 19:06:31 680075 [A30D5910] 0x01 -> osmtest_get_multipath_rec: Remote error = IB_SA_MAD_STATUS_INVALID_GID Apr 19 19:06:31 680083 [A30D5910] 0x01 -> osmtest_validate_against_db: Got error IB_REMOTE_ERROR Apr 19 19:06:31 680090 [A30D5910] 0x01 -> osmtest_validate_against_db: ===== Expecting Errors - END ===== ]] Apr 19 19:06:31 680098 [A30D5910] 0x01 -> osmtest_validate_against_db: [[ ===== Expecting Errors - START ===== Apr 19 19:06:31 680184 [40D3F940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x0500 Apr 19 19:06:31 680193 [40D3F940] 0x01 -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Apr 19 19:06:31 680207 [A30D5910] 0x01 -> osmtest_get_multipath_rec: ERR 0069: ib_query failed (IB_REMOTE_ERROR) Apr 19 19:06:31 680219 [A30D5910] 0x01 -> osmtest_get_multipath_rec: Remote error = IB_SA_MAD_STATUS_INVALID_GID Apr 19 19:06:31 680227 [A30D5910] 0x01 -> osmtest_validate_against_db: Got error IB_REMOTE_ERROR Apr 19 19:06:31 680235 [A30D5910] 0x01 -> osmtest_validate_against_db: ===== Expecting Errors - END ===== ]] Apr 19 19:06:31 681077 [A30D5910] 0x01 -> osmtest_validate_against_db: [[ ===== Expecting Errors - START ===== Apr 19 19:06:31 681170 [40D3F940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x0200 Apr 19 19:06:31 681180 [40D3F940] 0x01 -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Apr 19 19:06:31 681198 [A30D5910] 0x01 -> osmtest_get_pkeytbl_rec_by_lid: ERR 007F: ib_query failed (IB_REMOTE_ERROR) Apr 19 19:06:31 681208 [A30D5910] 0x01 -> osmtest_get_pkeytbl_rec_by_lid: Remote error = IB_SA_MAD_STATUS_REQ_INVALID Apr 19 19:06:31 681216 [A30D5910] 0x01 -> osmtest_validate_against_db: Got error IB_INSUFFICIENT_MEMORY Apr 19 19:06:31 681224 [A30D5910] 0x01 -> osmtest_validate_against_db: ===== Expecting Errors - END ===== ]] Apr 19 19:06:31 682245 [40D3F940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x000C Apr 19 19:06:31 682257 [40D3F940] 0x01 -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Apr 19 19:06:31 682273 [A30D5910] 0x01 -> osmtest_sminfo_record_request: ERR 008D: ib_query failed (IB_REMOTE_ERROR) Apr 19 19:06:31 682284 [A30D5910] 0x01 -> osmtest_sminfo_record_request: Remote error = IB_MAD_STATUS_UNSUP_METHOD_ATTR Apr 19 19:06:31 682292 [A30D5910] 0x01 -> osmtest_validate_against_db: IS EXPECTED ERROR ^^^^ Apr 19 19:06:31 683923 [40D3F940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x000C Apr 19 19:06:31 683934 [40D3F940] 0x01 -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Apr 19 19:06:31 683950 [A30D5910] 0x01 -> osmtest_informinfo_request: ERR 008F: ib_query failed (IB_REMOTE_ERROR) Apr 19 19:06:31 683959 [A30D5910] 0x01 -> osmtest_informinfo_request: Remote error = IB_MAD_STATUS_UNSUP_METHOD_ATTR Apr 19 19:06:31 683967 [A30D5910] 0x01 -> osmtest_validate_against_db: InformInfoRecord IS EXPECTED ERROR ^^^^ Apr 19 19:06:31 684114 [40D3F940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x000C Apr 19 19:06:31 684123 [40D3F940] 0x01 -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Apr 19 19:06:31 684139 [A30D5910] 0x01 -> osmtest_informinfo_request: ERR 008F: ib_query failed (IB_REMOTE_ERROR) Apr 19 19:06:31 684148 [A30D5910] 0x01 -> osmtest_informinfo_request: Remote error = IB_MAD_STATUS_UNSUP_METHOD_ATTR Apr 19 19:06:31 684156 [A30D5910] 0x01 -> osmtest_validate_against_db: InformInfo IS EXPECTED ERROR ^^^^ Apr 19 19:06:31 684240 [40D3F940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x0200 Apr 19 19:06:31 684265 [40D3F940] 0x01 -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Apr 19 19:06:31 684282 [A30D5910] 0x01 -> osmtest_informinfo_request: ERR 008F: ib_query failed (IB_REMOTE_ERROR) Apr 19 19:06:31 684291 [A30D5910] 0x01 -> osmtest_informinfo_request: Remote error = IB_SA_MAD_STATUS_REQ_INVALID Apr 19 19:06:31 684299 [A30D5910] 0x01 -> osmtest_validate_against_db: InformInfo UnSubscribe IS EXPECTED ERROR ^^^^ Apr 19 19:06:31 685930 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x2 to DLID 0x2 Apr 19 19:06:31 686024 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x3 to DLID 0x2 Apr 19 19:06:31 686112 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x1 to DLID 0x2 Apr 19 19:06:31 686201 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x4 to DLID 0x2 Apr 19 19:06:31 686289 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x5 to DLID 0x2 Apr 19 19:06:31 686376 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x6 to DLID 0x2 Apr 19 19:06:31 686464 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x7 to DLID 0x2 Apr 19 19:06:31 686552 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x8 to DLID 0x2 Apr 19 19:06:31 686640 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x9 to DLID 0x2 Apr 19 19:06:31 686728 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0xA to DLID 0x2 Apr 19 19:06:31 686816 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x2 to DLID 0x3 Apr 19 19:06:31 686907 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x3 to DLID 0x3 Apr 19 19:06:31 686996 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x1 to DLID 0x3 Apr 19 19:06:31 687083 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x4 to DLID 0x3 Apr 19 19:06:31 687172 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x5 to DLID 0x3 Apr 19 19:06:31 687259 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x6 to DLID 0x3 Apr 19 19:06:31 687346 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x7 to DLID 0x3 Apr 19 19:06:31 687433 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x8 to DLID 0x3 Apr 19 19:06:31 687521 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x9 to DLID 0x3 Apr 19 19:06:31 687609 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0xA to DLID 0x3 Apr 19 19:06:31 687699 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x2 to DLID 0x1 Apr 19 19:06:31 687788 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x3 to DLID 0x1 Apr 19 19:06:31 687879 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x1 to DLID 0x1 Apr 19 19:06:31 687970 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x4 to DLID 0x1 Apr 19 19:06:31 688057 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x5 to DLID 0x1 Apr 19 19:06:31 688146 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x6 to DLID 0x1 Apr 19 19:06:31 688234 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x7 to DLID 0x1 Apr 19 19:06:31 688323 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x8 to DLID 0x1 Apr 19 19:06:31 688411 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x9 to DLID 0x1 Apr 19 19:06:31 688498 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0xA to DLID 0x1 Apr 19 19:06:31 688587 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x2 to DLID 0x4 Apr 19 19:06:31 688675 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x3 to DLID 0x4 Apr 19 19:06:31 688764 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x1 to DLID 0x4 Apr 19 19:06:31 688852 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x4 to DLID 0x4 Apr 19 19:06:31 688947 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x5 to DLID 0x4 Apr 19 19:06:31 689056 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x6 to DLID 0x4 Apr 19 19:06:31 689143 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x7 to DLID 0x4 Apr 19 19:06:31 689232 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x8 to DLID 0x4 Apr 19 19:06:31 689321 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x9 to DLID 0x4 Apr 19 19:06:31 689409 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0xA to DLID 0x4 Apr 19 19:06:31 689497 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x2 to DLID 0x5 Apr 19 19:06:31 689586 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x3 to DLID 0x5 Apr 19 19:06:31 689675 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x1 to DLID 0x5 Apr 19 19:06:31 689765 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x4 to DLID 0x5 Apr 19 19:06:31 689853 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x5 to DLID 0x5 Apr 19 19:06:31 689942 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x6 to DLID 0x5 Apr 19 19:06:31 690031 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x7 to DLID 0x5 Apr 19 19:06:31 690120 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x8 to DLID 0x5 Apr 19 19:06:31 690209 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x9 to DLID 0x5 Apr 19 19:06:31 690299 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0xA to DLID 0x5 Apr 19 19:06:31 690387 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x2 to DLID 0x6 Apr 19 19:06:31 690475 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x3 to DLID 0x6 Apr 19 19:06:31 690563 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x1 to DLID 0x6 Apr 19 19:06:31 690651 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x4 to DLID 0x6 Apr 19 19:06:31 690740 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x5 to DLID 0x6 Apr 19 19:06:31 690829 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x6 to DLID 0x6 Apr 19 19:06:31 690923 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x7 to DLID 0x6 Apr 19 19:06:31 691012 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x8 to DLID 0x6 Apr 19 19:06:31 691100 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x9 to DLID 0x6 Apr 19 19:06:31 691187 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0xA to DLID 0x6 Apr 19 19:06:31 691275 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x2 to DLID 0x7 Apr 19 19:06:31 691364 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x3 to DLID 0x7 Apr 19 19:06:31 691452 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x1 to DLID 0x7 Apr 19 19:06:31 691540 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x4 to DLID 0x7 Apr 19 19:06:31 691629 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x5 to DLID 0x7 Apr 19 19:06:31 691717 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x6 to DLID 0x7 Apr 19 19:06:31 691806 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x7 to DLID 0x7 Apr 19 19:06:31 691897 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x8 to DLID 0x7 Apr 19 19:06:31 691986 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x9 to DLID 0x7 Apr 19 19:06:31 692076 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0xA to DLID 0x7 Apr 19 19:06:31 692164 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x2 to DLID 0x8 Apr 19 19:06:31 692252 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x3 to DLID 0x8 Apr 19 19:06:31 692339 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x1 to DLID 0x8 Apr 19 19:06:31 692428 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x4 to DLID 0x8 Apr 19 19:06:31 692531 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x5 to DLID 0x8 Apr 19 19:06:31 692620 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x6 to DLID 0x8 Apr 19 19:06:31 692710 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x7 to DLID 0x8 Apr 19 19:06:31 692798 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x8 to DLID 0x8 Apr 19 19:06:31 692889 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x9 to DLID 0x8 Apr 19 19:06:31 692978 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0xA to DLID 0x8 Apr 19 19:06:31 693068 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x2 to DLID 0x9 Apr 19 19:06:31 693158 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x3 to DLID 0x9 Apr 19 19:06:31 693246 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x1 to DLID 0x9 Apr 19 19:06:31 693334 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x4 to DLID 0x9 Apr 19 19:06:31 693423 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x5 to DLID 0x9 Apr 19 19:06:31 693511 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x6 to DLID 0x9 Apr 19 19:06:31 693600 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x7 to DLID 0x9 Apr 19 19:06:31 693691 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x8 to DLID 0x9 Apr 19 19:06:31 693781 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x9 to DLID 0x9 Apr 19 19:06:31 693870 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0xA to DLID 0x9 Apr 19 19:06:31 693962 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x2 to DLID 0xA Apr 19 19:06:31 694052 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x3 to DLID 0xA Apr 19 19:06:31 694140 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x1 to DLID 0xA Apr 19 19:06:31 694228 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x4 to DLID 0xA Apr 19 19:06:31 694316 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x5 to DLID 0xA Apr 19 19:06:31 694403 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x6 to DLID 0xA Apr 19 19:06:31 694493 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x7 to DLID 0xA Apr 19 19:06:31 694581 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x8 to DLID 0xA Apr 19 19:06:31 694670 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x9 to DLID 0xA Apr 19 19:06:31 694759 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0xA to DLID 0xA Apr 19 19:06:31 694938 [40D3F940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x0300 Apr 19 19:06:31 694948 [40D3F940] 0x01 -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Apr 19 19:06:31 694965 [A30D5910] 0x01 -> osmtest_get_port_rec_by_num: ERR 0078: ib_query failed (IB_REMOTE_ERROR) Apr 19 19:06:31 694974 [A30D5910] 0x01 -> osmtest_get_port_rec_by_num: Remote error = IB_SA_MAD_STATUS_NO_RECORDS Apr 19 19:06:31 694983 [A30D5910] 0x01 -> osmtest_validate_port_data: Checking port LID 0x0, Num 0x0 Apr 19 19:06:31 694993 [A30D5910] 0x01 -> osmtest_validate_port_data: ERR 0039: Field mismatch port LID 0x0 Num:0x0 Expected m_key 0x0800000000000000, received 0x0000000000000000 Apr 19 19:06:31 695002 [A30D5910] 0x01 -> osmtest_validate_single_port_rec_lid: ERR 0109: osmtest_validate_port_data failed (IB_ERROR) Apr 19 19:06:31 695011 [A30D5910] 0x01 -> osmtest_validate_single_port_recs: ERR 011B: osmtest_validate_single_port_rec_lid (IB_ERROR) Apr 19 19:06:31 695019 [A30D5910] 0x01 -> osmtest_run: ERR 0146: SA validation database failure (IB_ERROR) OSMTEST: TEST "All Validations" FAIL From arlin.r.davis at intel.com Sun Apr 19 10:09:51 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Sun, 19 Apr 2009 10:09:51 -0700 Subject: [ofa-general] [PATCH] uDAPL v2: dapltest, adjust next port number for number of threads to avoid duplication In-Reply-To: <517c62fb0904190623n7a42f670ia90e01e6655b65fb@mail.gmail.com> References: <517c62fb0904190623n7a42f670ia90e01e6655b65fb@mail.gmail.com> Message-ID: Arkady, Actually, we encountered these problems running back to back multi-threaded transaction tests on Windows while trying to get to uDAPL 100% common code across Windows and Linux. Difference in schedulers bought out some of these problems along with extended runs with the server running continuously. We also run dapltest with multiple clients, one server, so we are pushing this pretty hard. -arlin ________________________________ From: arkady kanevsky [mailto:arkady.kanevsky at gmail.com] Sent: Sunday, April 19, 2009 6:24 AM To: Davis, Arlin R Cc: OpenIB Subject: Re: [ofa-general] [PATCH] uDAPL v2: dapltest, adjust next port number for number of threads to avoid duplication Arlin, did we encounter this issue at interop? I do not recall seeing it. For IB, does Wait period between assigning the same QP# handles it? Thanks, Arkady On Sun, Apr 19, 2009 at 4:06 AM, Davis, Arlin R > wrote: To avoid duplicating port numbers between different tests, the next port number to use must increment based on the number of endpoints per thread * the number of threads. Signed-off-by: Sean Hefty > --- test/dapltest/test/dapl_server.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/test/dapltest/test/dapl_server.c b/test/dapltest/test/dapl_server.c index d589e7b..4d386fe 100644 --- a/test/dapltest/test/dapl_server.c +++ b/test/dapltest/test/dapl_server.c @@ -480,6 +480,9 @@ DT_cs_Server (Params_t * params_ptr) case TRANSACTION_TEST: { /* create a thread to handle this pt_ptr; */ + ps_ptr->NextPortNumber += + (pt_ptr->Params.u.Transaction_Cmd.eps_per_thread - 1) * + pt_ptr->Client_Info.total_threads; DT_Tdep_PT_Debug (1,(phead,"%s: Creating Transaction Test Thread\n", module)); pt_ptr->thread = DT_Thread_Create (pt_ptr, DT_Transaction_Test_Server, -- 1.5.2.5 _______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Cheers, Arkady Kanevsky -------------- next part -------------- An HTML attachment was scrubbed... URL: From worleys at gmail.com Sun Apr 19 11:45:25 2009 From: worleys at gmail.com (Chris Worley) Date: Sun, 19 Apr 2009 12:45:25 -0600 Subject: [ofa-general] ***SPAM*** Will this brick my switch? Message-ID: I've got a QDR switch from Mellanox that's a few months old... and has no markings, but looks like the only QDR switch described on their web pages. I went to burn the firmware, but got the message that I shouldn't: # ./flint -d lid-6 -i ~/fw-IS4-rel-7_2_000-MTS3600Q_A1.bin b Current FW version on flash: 7.0.142 New FW version: 7.2.0 You are about to replace current PSID on flash - "MT_0C00110003" with a different PSID - "MT_0C20110003". Note: It is highly recommended not to change the PSID. Do you want to continue ? (y/n) [n] : Is this okay? Thanks, Chris From rdreier at cisco.com Sun Apr 19 14:13:54 2009 From: rdreier at cisco.com (Roland Dreier) Date: Sun, 19 Apr 2009 14:13:54 -0700 Subject: [ofa-general] Re: Subject: [PATCH] RDMA/nes: Update iw_nes version In-Reply-To: <20090413152841.GA3648@ctung-MOBL> (Chien Tung's message of "Mon, 13 Apr 2009 10:28:41 -0500") References: <20090413152841.GA3648@ctung-MOBL> Message-ID: Which patches are you expecting to be applied to mark for 1.5.0.0? For example, I see Faisal just sent a patch right after yours -- should I apply that "application hang" one too? [somehow you got "Subject:" into your subject line too :)] From chien.tin.tung at intel.com Sun Apr 19 19:56:09 2009 From: chien.tin.tung at intel.com (Tung, Chien Tin) Date: Sun, 19 Apr 2009 19:56:09 -0700 Subject: [ofa-general] RE: Subject: [PATCH] RDMA/nes: Update iw_nes version In-Reply-To: References: <20090413152841.GA3648@ctung-MOBL> Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA383033CDCB03C@azsmsx501.amr.corp.intel.com> >Which patches are you expecting to be applied to mark for 1.5.0.0? For >example, I see Faisal just sent a patch right after yours -- should I >apply that "application hang" one too? The intention is to mark the driver in 2.6.30 as 1.5.0.0. Since this will be the only patch to update the version number, it doesn't matter that much. >[somehow you got "Subject:" into your subject line too :)] I reformat git-format-patch output for email with mutt/cygwin. Every manual step is a chance for error such as this. I will look into using git-send-email to cut down on mistakes. Chien From dorons at voltaire.com Sun Apr 19 22:41:29 2009 From: dorons at voltaire.com (Doron Shoham) Date: Mon, 20 Apr 2009 08:41:29 +0300 Subject: [ofa-general] osmtest fails with latest opensm Message-ID: <49EC0B09.5070003@voltaire.com> Hi, osmtest fails with latest opensm. commit number 3640a83d552f3da20236ee5f3ae5ab1835a68af1 osmtest -fc Command Line Arguments Done with args Flow = Create Inventory Apr 19 19:05:26 938129 [C7095910] 0x7f -> Setting log level to: 0x03 Apr 19 19:05:26 938370 [C7095910] 0x02 -> osm_vendor_init: 1000 pending umads specified Apr 19 19:05:26 952061 [C7095910] 0x02 -> osm_vendor_bind: Binding to port 0x2c9020022f019 Apr 19 19:05:26 968876 [C7095910] 0x02 -> osmtest_validate_sa_class_port_info: ----------------------------- SA Class Port Info: base_ver:1 class_ver:2 cap_mask:0x2602 cap_mask2:0x0 resp_time_val:0x10 ----------------------------- OSMTEST: TEST "Create Inventory" PASS osmtest -fa Command Line Arguments Done with args Flow = All Validations Apr 19 19:06:31 641676 [A30D5910] 0x7f -> Setting log level to: 0x03 Apr 19 19:06:31 641928 [A30D5910] 0x02 -> osm_vendor_init: 1000 pending umads specified Apr 19 19:06:31 655491 [A30D5910] 0x02 -> osm_vendor_bind: Binding to port 0x2c9020022f019 Apr 19 19:06:31 672279 [A30D5910] 0x02 -> osmtest_validate_sa_class_port_info: ----------------------------- SA Class Port Info: base_ver:1 class_ver:2 cap_mask:0x2602 cap_mask2:0x0 resp_time_val:0x10 ----------------------------- Apr 19 19:06:31 678017 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0002c90200258408, LID 0x2 Apr 19 19:06:31 678034 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0002c9020022f018, LID 0x1 Apr 19 19:06:31 678044 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f10400403619, LID 0x3 Apr 19 19:06:31 678052 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f1040398fe20, LID 0x4 Apr 19 19:06:31 678061 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f10403982e70, LID 0x5 Apr 19 19:06:31 678070 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f10403970798, LID 0x6 Apr 19 19:06:31 678078 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f10403982da4, LID 0x7 Apr 19 19:06:31 678087 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f104003f1bbc, LID 0x8 Apr 19 19:06:31 678095 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f104003f1bbd, LID 0x9 Apr 19 19:06:31 678104 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f104039706dc, LID 0xA Apr 19 19:06:31 678211 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0002c9020022f018, LID 0x1 Apr 19 19:06:31 678310 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0002c90200258408, LID 0x2 Apr 19 19:06:31 678414 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f10400403619, LID 0x3 Apr 19 19:06:31 678513 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f1040398fe20, LID 0x4 Apr 19 19:06:31 678613 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f10403982e70, LID 0x5 Apr 19 19:06:31 678712 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f10403970798, LID 0x6 Apr 19 19:06:31 678810 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f10403982da4, LID 0x7 Apr 19 19:06:31 678914 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f104003f1bbc, LID 0x8 Apr 19 19:06:31 679014 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f104003f1bbd, LID 0x9 Apr 19 19:06:31 679111 [A30D5910] 0x01 -> osmtest_validate_node_data: Checking node 0x0008f104039706dc, LID 0xA Apr 19 19:06:31 679595 [A30D5910] 0x01 -> osmtest_validate_against_db: [[ ===== Expecting Errors - START ===== Apr 19 19:06:31 679698 [40D3F940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x0200 Apr 19 19:06:31 679711 [40D3F940] 0x01 -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Apr 19 19:06:31 679727 [A30D5910] 0x01 -> osmtest_get_multipath_rec: ERR 0069: ib_query failed (IB_REMOTE_ERROR) Apr 19 19:06:31 679740 [A30D5910] 0x01 -> osmtest_get_multipath_rec: Remote error = IB_SA_MAD_STATUS_REQ_INVALID Apr 19 19:06:31 679748 [A30D5910] 0x01 -> osmtest_validate_against_db: Got error IB_REMOTE_ERROR Apr 19 19:06:31 679756 [A30D5910] 0x01 -> osmtest_validate_against_db: ===== Expecting Errors - END ===== ]] Apr 19 19:06:31 679764 [A30D5910] 0x01 -> osmtest_validate_against_db: [[ ===== Expecting Errors - START ===== Apr 19 19:06:31 679860 [40D3F940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x0200 Apr 19 19:06:31 679870 [40D3F940] 0x01 -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Apr 19 19:06:31 679888 [A30D5910] 0x01 -> osmtest_get_multipath_rec: ERR 0069: ib_query failed (IB_REMOTE_ERROR) Apr 19 19:06:31 679903 [A30D5910] 0x01 -> osmtest_get_multipath_rec: Remote error = IB_SA_MAD_STATUS_REQ_INVALID Apr 19 19:06:31 679912 [A30D5910] 0x01 -> osmtest_validate_against_db: Got error IB_REMOTE_ERROR Apr 19 19:06:31 679938 [A30D5910] 0x01 -> osmtest_validate_against_db: ===== Expecting Errors - END ===== ]] Apr 19 19:06:31 679949 [A30D5910] 0x01 -> osmtest_validate_against_db: [[ ===== Expecting Errors - START ===== Apr 19 19:06:31 680038 [40D3F940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x0500 Apr 19 19:06:31 680048 [40D3F940] 0x01 -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Apr 19 19:06:31 680063 [A30D5910] 0x01 -> osmtest_get_multipath_rec: ERR 0069: ib_query failed (IB_REMOTE_ERROR) Apr 19 19:06:31 680075 [A30D5910] 0x01 -> osmtest_get_multipath_rec: Remote error = IB_SA_MAD_STATUS_INVALID_GID Apr 19 19:06:31 680083 [A30D5910] 0x01 -> osmtest_validate_against_db: Got error IB_REMOTE_ERROR Apr 19 19:06:31 680090 [A30D5910] 0x01 -> osmtest_validate_against_db: ===== Expecting Errors - END ===== ]] Apr 19 19:06:31 680098 [A30D5910] 0x01 -> osmtest_validate_against_db: [[ ===== Expecting Errors - START ===== Apr 19 19:06:31 680184 [40D3F940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x0500 Apr 19 19:06:31 680193 [40D3F940] 0x01 -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Apr 19 19:06:31 680207 [A30D5910] 0x01 -> osmtest_get_multipath_rec: ERR 0069: ib_query failed (IB_REMOTE_ERROR) Apr 19 19:06:31 680219 [A30D5910] 0x01 -> osmtest_get_multipath_rec: Remote error = IB_SA_MAD_STATUS_INVALID_GID Apr 19 19:06:31 680227 [A30D5910] 0x01 -> osmtest_validate_against_db: Got error IB_REMOTE_ERROR Apr 19 19:06:31 680235 [A30D5910] 0x01 -> osmtest_validate_against_db: ===== Expecting Errors - END ===== ]] Apr 19 19:06:31 681077 [A30D5910] 0x01 -> osmtest_validate_against_db: [[ ===== Expecting Errors - START ===== Apr 19 19:06:31 681170 [40D3F940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x0200 Apr 19 19:06:31 681180 [40D3F940] 0x01 -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Apr 19 19:06:31 681198 [A30D5910] 0x01 -> osmtest_get_pkeytbl_rec_by_lid: ERR 007F: ib_query failed (IB_REMOTE_ERROR) Apr 19 19:06:31 681208 [A30D5910] 0x01 -> osmtest_get_pkeytbl_rec_by_lid: Remote error = IB_SA_MAD_STATUS_REQ_INVALID Apr 19 19:06:31 681216 [A30D5910] 0x01 -> osmtest_validate_against_db: Got error IB_INSUFFICIENT_MEMORY Apr 19 19:06:31 681224 [A30D5910] 0x01 -> osmtest_validate_against_db: ===== Expecting Errors - END ===== ]] Apr 19 19:06:31 682245 [40D3F940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x000C Apr 19 19:06:31 682257 [40D3F940] 0x01 -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Apr 19 19:06:31 682273 [A30D5910] 0x01 -> osmtest_sminfo_record_request: ERR 008D: ib_query failed (IB_REMOTE_ERROR) Apr 19 19:06:31 682284 [A30D5910] 0x01 -> osmtest_sminfo_record_request: Remote error = IB_MAD_STATUS_UNSUP_METHOD_ATTR Apr 19 19:06:31 682292 [A30D5910] 0x01 -> osmtest_validate_against_db: IS EXPECTED ERROR ^^^^ Apr 19 19:06:31 683923 [40D3F940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x000C Apr 19 19:06:31 683934 [40D3F940] 0x01 -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Apr 19 19:06:31 683950 [A30D5910] 0x01 -> osmtest_informinfo_request: ERR 008F: ib_query failed (IB_REMOTE_ERROR) Apr 19 19:06:31 683959 [A30D5910] 0x01 -> osmtest_informinfo_request: Remote error = IB_MAD_STATUS_UNSUP_METHOD_ATTR Apr 19 19:06:31 683967 [A30D5910] 0x01 -> osmtest_validate_against_db: InformInfoRecord IS EXPECTED ERROR ^^^^ Apr 19 19:06:31 684114 [40D3F940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x000C Apr 19 19:06:31 684123 [40D3F940] 0x01 -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Apr 19 19:06:31 684139 [A30D5910] 0x01 -> osmtest_informinfo_request: ERR 008F: ib_query failed (IB_REMOTE_ERROR) Apr 19 19:06:31 684148 [A30D5910] 0x01 -> osmtest_informinfo_request: Remote error = IB_MAD_STATUS_UNSUP_METHOD_ATTR Apr 19 19:06:31 684156 [A30D5910] 0x01 -> osmtest_validate_against_db: InformInfo IS EXPECTED ERROR ^^^^ Apr 19 19:06:31 684240 [40D3F940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x0200 Apr 19 19:06:31 684265 [40D3F940] 0x01 -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Apr 19 19:06:31 684282 [A30D5910] 0x01 -> osmtest_informinfo_request: ERR 008F: ib_query failed (IB_REMOTE_ERROR) Apr 19 19:06:31 684291 [A30D5910] 0x01 -> osmtest_informinfo_request: Remote error = IB_SA_MAD_STATUS_REQ_INVALID Apr 19 19:06:31 684299 [A30D5910] 0x01 -> osmtest_validate_against_db: InformInfo UnSubscribe IS EXPECTED ERROR ^^^^ Apr 19 19:06:31 685930 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x2 to DLID 0x2 Apr 19 19:06:31 686024 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x3 to DLID 0x2 Apr 19 19:06:31 686112 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x1 to DLID 0x2 Apr 19 19:06:31 686201 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x4 to DLID 0x2 Apr 19 19:06:31 686289 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x5 to DLID 0x2 Apr 19 19:06:31 686376 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x6 to DLID 0x2 Apr 19 19:06:31 686464 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x7 to DLID 0x2 Apr 19 19:06:31 686552 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x8 to DLID 0x2 Apr 19 19:06:31 686640 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x9 to DLID 0x2 Apr 19 19:06:31 686728 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0xA to DLID 0x2 Apr 19 19:06:31 686816 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x2 to DLID 0x3 Apr 19 19:06:31 686907 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x3 to DLID 0x3 Apr 19 19:06:31 686996 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x1 to DLID 0x3 Apr 19 19:06:31 687083 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x4 to DLID 0x3 Apr 19 19:06:31 687172 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x5 to DLID 0x3 Apr 19 19:06:31 687259 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x6 to DLID 0x3 Apr 19 19:06:31 687346 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x7 to DLID 0x3 Apr 19 19:06:31 687433 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x8 to DLID 0x3 Apr 19 19:06:31 687521 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x9 to DLID 0x3 Apr 19 19:06:31 687609 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0xA to DLID 0x3 Apr 19 19:06:31 687699 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x2 to DLID 0x1 Apr 19 19:06:31 687788 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x3 to DLID 0x1 Apr 19 19:06:31 687879 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x1 to DLID 0x1 Apr 19 19:06:31 687970 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x4 to DLID 0x1 Apr 19 19:06:31 688057 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x5 to DLID 0x1 Apr 19 19:06:31 688146 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x6 to DLID 0x1 Apr 19 19:06:31 688234 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x7 to DLID 0x1 Apr 19 19:06:31 688323 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x8 to DLID 0x1 Apr 19 19:06:31 688411 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x9 to DLID 0x1 Apr 19 19:06:31 688498 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0xA to DLID 0x1 Apr 19 19:06:31 688587 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x2 to DLID 0x4 Apr 19 19:06:31 688675 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x3 to DLID 0x4 Apr 19 19:06:31 688764 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x1 to DLID 0x4 Apr 19 19:06:31 688852 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x4 to DLID 0x4 Apr 19 19:06:31 688947 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x5 to DLID 0x4 Apr 19 19:06:31 689056 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x6 to DLID 0x4 Apr 19 19:06:31 689143 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x7 to DLID 0x4 Apr 19 19:06:31 689232 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x8 to DLID 0x4 Apr 19 19:06:31 689321 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x9 to DLID 0x4 Apr 19 19:06:31 689409 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0xA to DLID 0x4 Apr 19 19:06:31 689497 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x2 to DLID 0x5 Apr 19 19:06:31 689586 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x3 to DLID 0x5 Apr 19 19:06:31 689675 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x1 to DLID 0x5 Apr 19 19:06:31 689765 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x4 to DLID 0x5 Apr 19 19:06:31 689853 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x5 to DLID 0x5 Apr 19 19:06:31 689942 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x6 to DLID 0x5 Apr 19 19:06:31 690031 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x7 to DLID 0x5 Apr 19 19:06:31 690120 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x8 to DLID 0x5 Apr 19 19:06:31 690209 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x9 to DLID 0x5 Apr 19 19:06:31 690299 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0xA to DLID 0x5 Apr 19 19:06:31 690387 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x2 to DLID 0x6 Apr 19 19:06:31 690475 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x3 to DLID 0x6 Apr 19 19:06:31 690563 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x1 to DLID 0x6 Apr 19 19:06:31 690651 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x4 to DLID 0x6 Apr 19 19:06:31 690740 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x5 to DLID 0x6 Apr 19 19:06:31 690829 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x6 to DLID 0x6 Apr 19 19:06:31 690923 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x7 to DLID 0x6 Apr 19 19:06:31 691012 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x8 to DLID 0x6 Apr 19 19:06:31 691100 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x9 to DLID 0x6 Apr 19 19:06:31 691187 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0xA to DLID 0x6 Apr 19 19:06:31 691275 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x2 to DLID 0x7 Apr 19 19:06:31 691364 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x3 to DLID 0x7 Apr 19 19:06:31 691452 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x1 to DLID 0x7 Apr 19 19:06:31 691540 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x4 to DLID 0x7 Apr 19 19:06:31 691629 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x5 to DLID 0x7 Apr 19 19:06:31 691717 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x6 to DLID 0x7 Apr 19 19:06:31 691806 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x7 to DLID 0x7 Apr 19 19:06:31 691897 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x8 to DLID 0x7 Apr 19 19:06:31 691986 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x9 to DLID 0x7 Apr 19 19:06:31 692076 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0xA to DLID 0x7 Apr 19 19:06:31 692164 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x2 to DLID 0x8 Apr 19 19:06:31 692252 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x3 to DLID 0x8 Apr 19 19:06:31 692339 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x1 to DLID 0x8 Apr 19 19:06:31 692428 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x4 to DLID 0x8 Apr 19 19:06:31 692531 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x5 to DLID 0x8 Apr 19 19:06:31 692620 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x6 to DLID 0x8 Apr 19 19:06:31 692710 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x7 to DLID 0x8 Apr 19 19:06:31 692798 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x8 to DLID 0x8 Apr 19 19:06:31 692889 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x9 to DLID 0x8 Apr 19 19:06:31 692978 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0xA to DLID 0x8 Apr 19 19:06:31 693068 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x2 to DLID 0x9 Apr 19 19:06:31 693158 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x3 to DLID 0x9 Apr 19 19:06:31 693246 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x1 to DLID 0x9 Apr 19 19:06:31 693334 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x4 to DLID 0x9 Apr 19 19:06:31 693423 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x5 to DLID 0x9 Apr 19 19:06:31 693511 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x6 to DLID 0x9 Apr 19 19:06:31 693600 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x7 to DLID 0x9 Apr 19 19:06:31 693691 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x8 to DLID 0x9 Apr 19 19:06:31 693781 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x9 to DLID 0x9 Apr 19 19:06:31 693870 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0xA to DLID 0x9 Apr 19 19:06:31 693962 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x2 to DLID 0xA Apr 19 19:06:31 694052 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x3 to DLID 0xA Apr 19 19:06:31 694140 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x1 to DLID 0xA Apr 19 19:06:31 694228 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x4 to DLID 0xA Apr 19 19:06:31 694316 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x5 to DLID 0xA Apr 19 19:06:31 694403 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x6 to DLID 0xA Apr 19 19:06:31 694493 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x7 to DLID 0xA Apr 19 19:06:31 694581 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x8 to DLID 0xA Apr 19 19:06:31 694670 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0x9 to DLID 0xA Apr 19 19:06:31 694759 [A30D5910] 0x01 -> osmtest_validate_path_data: Checking path SLID 0xA to DLID 0xA Apr 19 19:06:31 694938 [40D3F940] 0x01 -> __osmv_sa_mad_rcv_cb: ERR 5501: Remote error:0x0300 Apr 19 19:06:31 694948 [40D3F940] 0x01 -> osmtest_query_res_cb: ERR 0003: Error on query (IB_REMOTE_ERROR) Apr 19 19:06:31 694965 [A30D5910] 0x01 -> osmtest_get_port_rec_by_num: ERR 0078: ib_query failed (IB_REMOTE_ERROR) Apr 19 19:06:31 694974 [A30D5910] 0x01 -> osmtest_get_port_rec_by_num: Remote error = IB_SA_MAD_STATUS_NO_RECORDS Apr 19 19:06:31 694983 [A30D5910] 0x01 -> osmtest_validate_port_data: Checking port LID 0x0, Num 0x0 Apr 19 19:06:31 694993 [A30D5910] 0x01 -> osmtest_validate_port_data: ERR 0039: Field mismatch port LID 0x0 Num:0x0 Expected m_key 0x0800000000000000, received 0x0000000000000000 Apr 19 19:06:31 695002 [A30D5910] 0x01 -> osmtest_validate_single_port_rec_lid: ERR 0109: osmtest_validate_port_data failed (IB_ERROR) Apr 19 19:06:31 695011 [A30D5910] 0x01 -> osmtest_validate_single_port_recs: ERR 011B: osmtest_validate_single_port_rec_lid (IB_ERROR) Apr 19 19:06:31 695019 [A30D5910] 0x01 -> osmtest_run: ERR 0146: SA validation database failure (IB_ERROR) OSMTEST: TEST "All Validations" FAIL Thanks, Doron From dotanba at gmail.com Sun Apr 19 23:37:23 2009 From: dotanba at gmail.com (Dotan Barak) Date: Mon, 20 Apr 2009 09:37:23 +0300 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Will this brick my switch? In-Reply-To: References: Message-ID: <2f3bf9a60904192337v6c816036yfda4a708947ccc9e@mail.gmail.com> On Sun, Apr 19, 2009 at 9:45 PM, Chris Worley wrote: > I've got a QDR switch from Mellanox that's a few months old... and has > no markings, but looks like the only QDR switch described on their web > pages. > > I went to burn the firmware, but got the message that I shouldn't: > > # ./flint -d lid-6 -i ~/fw-IS4-rel-7_2_000-MTS3600Q_A1.bin b > >    Current FW version on flash:  7.0.142 >    New FW version:               7.2.0 > >    You are about to replace current PSID on flash - "MT_0C00110003" > with a different PSID - "MT_0C20110003". >    Note: It is highly recommended not to change the PSID. > >  Do you want to continue ? (y/n) [n] : > > Is this okay? This means that the configuration file that the burning tool is using is different than the one in your switch. It is highly advised not to do it, and a find firmare that has your configuration file (or find the configuration file for your switch). Dotan From vlad at lists.openfabrics.org Mon Apr 20 03:31:44 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 20 Apr 2009 03:31:44 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090420-0200 daily build status Message-ID: <20090420103144.F3AD3E6110B@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From dennis.portello at gmail.com Mon Apr 20 04:21:15 2009 From: dennis.portello at gmail.com (Dennis Portello) Date: Mon, 20 Apr 2009 07:21:15 -0400 Subject: [ofa-general] ***SPAM*** Re: IB Bonding errors with recent kernel Message-ID: <52436c7f0904200421s53d65a31vdcbb26babd9be196@mail.gmail.com> Hello, I seem to be experiencing the exact issue discussed below (back in December). I'm using the 2.6.27 kernel and the bonding drivers available in that kernel. Was there ever a solution or patch to solve this? I have been using the ib-bond scripts as well, but using other approaches like standard OS tools or adding the bond through sysfs all seem to have the same results. Regular TCP/IP unicast works, though dmesg is full of warning about multicast failing. Multicast does not work at all. Any hints or suggestions would be greatly appreciated. Best regards, Dennis Portello > Or Gerlitz wrote: >>> If I am not mistaken the issue you mention is a little different from the one I pointed out. >>> Without bonding I see the following: >>> kernel: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -11 >>> However, with bonding what I see is : >>> ib0: multicast join failed for 0001:0000:0000:0000:0000:0000:0000:0000, status -22 >> >> Please note that -11 EAGAIN (try again) is and -22 is EINVAL (invalid >> argument). So you can get EAGAIN when the underlying core sa agent is >> not ready to send SA queries, while you get EINVAL when attempting to >> join on a junk MGID. I am confident that for long time we see joins on >> junk MGIDs and it has been reported on this list (google...) in the >> past, no resolution yet. > > Or, > > I looked through the mailing list going back more than a year. The closest > I can find to this issue (-EINVAL) was when you reported problems with junk MGID on a > child interface (and that works properly now). > > I agree that the -EAGAIN problem has been known for some time now. However, this issue with > IPoIB bonding is new. My recollections are that it all worked properly around end October. > I had not tested since then, so this is something that must have cropped in the interregnum. > >> >> Under bonding there might be a window is time where from the kernel >> network stack perspective the bonding device ether-type is ethernet >> and not infiniband and hence the wrong (ip_eth_mc_map instead of >> ip_ib_mc_map) function would be called to do the mapping from the IP >> multicast address to the HW multicast address >> >> >>> Subsequently an ib-bond status does not reveal any slave as active as shown below: >>> ib-bond --status >>> bond0: 80:00:04:04:fe:80:00:00:00:00:00:00:00:05:ad:00:00:03:05:b9 >>> slave0: ib0 >>> slave1: ib1 >> >> As this script is not standard and deprecated, I would recommend not >> to use it but rather the classic /proc/net/bonding/bond0 entry, along >> with ip addr show on bond0, ib0, ib1 > Thanks for alerting me to the fact that the ib-bond script was deprecated. Again this seemed > to all work about 6 weeks ago. Is that (ib-bond is deprecated) documented somewhere? > > Pradeep > -------------- next part -------------- An HTML attachment was scrubbed... URL: From monis at Voltaire.COM Mon Apr 20 05:05:25 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Mon, 20 Apr 2009 15:05:25 +0300 Subject: [ofa-general] ***SPAM*** Re: IB Bonding errors with recent kernel In-Reply-To: <52436c7f0904200421s53d65a31vdcbb26babd9be196@mail.gmail.com> References: <52436c7f0904200421s53d65a31vdcbb26babd9be196@mail.gmail.com> Message-ID: <49EC6505.40406@Voltaire.COM> Dennis Portello wrote: > Hello, > > I seem to be experiencing the exact issue discussed below (back in > December). I'm using the 2.6.27 kernel and the bonding drivers available > in that kernel. Was there ever a solution or patch to solve this? I have Not really, a patch was sent a long time ago but it wasn't accepted. The claim was that it takes care of a phenomena that exists only with Redhat 4 (I tend to agree). You can read the discussion here http://kerneltrap.org/mailarchive/linux-netdev/2008/4/10/1391984 > been using the ib-bond scripts as well, but using other approaches like > standard OS tools or adding the bond through sysfs all seem to have the > same results. > As I recall, this shouldn't happen when working with ib-bond. However, ib-bond is a deprecated tool so I wouldn't tell you to use it as a solution but I do wonder why do you still see the -22 status in d,esg > Regular TCP/IP unicast works, though dmesg is full of warning about > multicast failing. Multicast does not work at all. Multicast that is not working is not related to the issue you describe. I suggest that you open a bug here https://bugs.openfabrics.org/ and describe what you do and what you get in details. thanks MoniS From nicolas.morey-chaisemartin at ext.bull.net Mon Apr 20 05:05:25 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey-Chaisemartin) Date: Mon, 20 Apr 2009 14:05:25 +0200 Subject: [ofa-general] XRC / Libibverbs Message-ID: <49EC6505.1070203@ext.bull.net> HI, I was wondering why in libibverbs XRC is implemented as patches and not directly in the code? Are there compatibility problems? Latests qperf can't be build even with the latest libibverbs as it requires XRC defines which do not exist (except if you manually apply the patches) which is not a good thing IMHO. Regards Nicolas From ogerlitz at Voltaire.com Mon Apr 20 05:17:46 2009 From: ogerlitz at Voltaire.com (Or Gerlitz) Date: Mon, 20 Apr 2009 15:17:46 +0300 Subject: [ofa-general] ***SPAM*** Re: IB Bonding errors with recent kernel In-Reply-To: <52436c7f0904200421s53d65a31vdcbb26babd9be196@mail.gmail.com> References: <52436c7f0904200421s53d65a31vdcbb26babd9be196@mail.gmail.com> Message-ID: <49EC67EA.2030403@Voltaire.com> Dennis Portello wrote: > Regular TCP/IP unicast works, though dmesg is full of warning about > multicast failing. Multicast does not work at all. Unicast IP relies on ARP and IPoIB ARPs use the broadcast multicast group, so IB multicast does work on your setup... to see what IB multicast groups are being joined by your IPoIB devices, you can use the ipoib debugfs entries $ mount -t debugfs none /sys/kernel/debug $ cat /sys/kernel/debug/ipoib/ibxxx_mcg see Documentation/infiniband/ipoib.txt for more info Or. From dorfman.eli at gmail.com Mon Apr 20 05:46:32 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Mon, 20 Apr 2009 15:46:32 +0300 Subject: [ofa-general] ***SPAM*** [PATCH] ib_types.h: fix commit 103891092f5f6f0b2cf56555e19fdf008f164c41 Message-ID: <49EC6EA8.8030403@gmail.com> fix wrong padding for SA portinfo record after addition of max_credit_hint and link_rt_latency to SM portinfo Signed-off-by: Eli Dorfman --- opensm/include/iba/ib_types.h | 1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/opensm/include/iba/ib_types.h b/opensm/include/iba/ib_types.h index 1be2109..beb7492 100644 --- a/opensm/include/iba/ib_types.h +++ b/opensm/include/iba/ib_types.h @@ -5816,7 +5816,6 @@ typedef struct _ib_portinfo_record { uint8_t port_num; uint8_t resv; ib_port_info_t port_info; - uint8_t pad[6]; } PACK_SUFFIX ib_portinfo_record_t; #include -- 1.5.5 From jackm at dev.mellanox.co.il Mon Apr 20 06:02:33 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 20 Apr 2009 16:02:33 +0300 Subject: [ofa-general] XRC / Libibverbs In-Reply-To: <49EC6505.1070203@ext.bull.net> References: <49EC6505.1070203@ext.bull.net> Message-ID: <200904201602.33815.jackm@dev.mellanox.co.il> On Monday 20 April 2009 15:05, Nicolas Morey-Chaisemartin wrote: > HI, > > I was wondering why in libibverbs XRC is implemented as patches and not directly in the code? > Are there compatibility problems? > Latests qperf can't be build even with the latest libibverbs as it requires XRC defines which do not exist (except if you manually apply the patches) which is not a good thing IMHO. > XRC has not yet been accepted into the kernel. It has been pending for a long time, but neither Roland nor I have had the time to polish it up. I need to work on this with Roland in the upcoming weeks. - Jack From dennis.portello at gmail.com Mon Apr 20 06:29:35 2009 From: dennis.portello at gmail.com (Dennis Portello) Date: Mon, 20 Apr 2009 09:29:35 -0400 Subject: [ofa-general] ***SPAM*** Re: IB Bonding errors with recent kernel In-Reply-To: <49EC6505.40406@Voltaire.COM> References: <52436c7f0904200421s53d65a31vdcbb26babd9be196@mail.gmail.com> <49EC6505.40406@Voltaire.COM> Message-ID: <52436c7f0904200629q447d969cwbe12ce8bb606584b@mail.gmail.com> I can confirm that this issue exists beyond Redhat 4, I'm using Ubuntu 8.10 (2.6.27). I'm using ib-bond and I've also tried adding he bonds directly with echo +bond0 > /sys/class/net/bonding_masters echo 1 > /sys/class/net/bond0/bonding/mode echo 100 > /sys/class/net/bond0/bonding/miimon echo +ib0 > /sys/class/net/bond0/bonding/slaves echo +ib1 > /sys/class/net/bond0/bonding/slaves ifconfig bond0 192.168.47.102/24 route add -net 224.0.0.0/3 gw 192.168.47.100 I will be happy to open a ticket on this issue. Thank you, Dennis P. On Mon, Apr 20, 2009 at 8:05 AM, Moni Shoua wrote: > Dennis Portello wrote: > > Hello, > > > > I seem to be experiencing the exact issue discussed below (back in > > December). I'm using the 2.6.27 kernel and the bonding drivers available > > in that kernel. Was there ever a solution or patch to solve this? I have > Not really, a patch was sent a long time ago but it wasn't accepted. The > claim was that it takes > care of a phenomena that exists only with Redhat 4 (I tend to agree). > You can read the discussion here > http://kerneltrap.org/mailarchive/linux-netdev/2008/4/10/1391984 > > > been using the ib-bond scripts as well, but using other approaches like > > standard OS tools or adding the bond through sysfs all seem to have the > > same results. > > > As I recall, this shouldn't happen when working with ib-bond. However, > ib-bond is a deprecated tool > so I wouldn't tell you to use it as a solution but I do wonder why do you > still see the -22 status in d,esg > > Regular TCP/IP unicast works, though dmesg is full of warning about > > multicast failing. Multicast does not work at all. > Multicast that is not working is not related to the issue you describe. I > suggest that > you open a bug here https://bugs.openfabrics.org/ and describe what you > do and what you get > in details. > > thanks > MoniS > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From monis at Voltaire.COM Mon Apr 20 06:52:20 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Mon, 20 Apr 2009 16:52:20 +0300 Subject: [ofa-general] ***SPAM*** Re: IB Bonding errors with recent kernel In-Reply-To: <52436c7f0904200629q447d969cwbe12ce8bb606584b@mail.gmail.com> References: <52436c7f0904200421s53d65a31vdcbb26babd9be196@mail.gmail.com> <49EC6505.40406@Voltaire.COM> <52436c7f0904200629q447d969cwbe12ce8bb606584b@mail.gmail.com> Message-ID: <49EC7E14.2070707@Voltaire.COM> Dennis Portello wrote: > I can confirm that this issue exists beyond Redhat 4, I'm using Ubuntu > 8.10 (2.6.27). > > I'm using ib-bond and I've also tried adding he bonds directly with > > echo +bond0 > /sys/class/net/bonding_masters > echo 1 > /sys/class/net/bond0/bonding/mode > echo 100 > /sys/class/net/bond0/bonding/miimon > echo +ib0 > /sys/class/net/bond0/bonding/slaves > echo +ib1 > /sys/class/net/bond0/bonding/slaves > ifconfig bond0 192.168.47.102/24 > route add -net 224.0.0.0/3 gw 192.168.47.100 > > I will be happy to open a ticket on this issue. > > Thank you, > Dennis P. > I can't answer for Ubuntu since I don't have it installed. However, the script you sent should behave the same on another OS. I'll try that. I'm more interested in the multicast that doesn't work for you. I would appreciate if you open a detailed bug for this thanks From tziporet at dev.mellanox.co.il Mon Apr 20 06:53:08 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 20 Apr 2009 16:53:08 +0300 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Will this brick my switch? In-Reply-To: <2f3bf9a60904192337v6c816036yfda4a708947ccc9e@mail.gmail.com> References: <2f3bf9a60904192337v6c816036yfda4a708947ccc9e@mail.gmail.com> Message-ID: <49EC7E44.4030001@mellanox.co.il> Dotan Barak wrote: > On Sun, Apr 19, 2009 at 9:45 PM, Chris Worley wrote: > >> I've got a QDR switch from Mellanox that's a few months old... and has >> no markings, but looks like the only QDR switch described on their web >> pages. >> >> I went to burn the firmware, but got the message that I shouldn't: >> >> # ./flint -d lid-6 -i ~/fw-IS4-rel-7_2_000-MTS3600Q_A1.bin b >> >> Current FW version on flash: 7.0.142 >> New FW version: 7.2.0 >> >> You are about to replace current PSID on flash - "MT_0C00110003" >> with a different PSID - "MT_0C20110003". >> Note: It is highly recommended not to change the PSID. >> >> Do you want to continue ? (y/n) [n] : >> >> Is this okay? >> > This means that the configuration file that the burning tool is using > is different than the one in your switch. > > It is highly advised not to do it, and a find firmare that has your > configuration file > You have an old version of the IS4 switch (a0 and not a1). Please contact Mellanox support to get the correct FW or replacement of the switch. Tziporet From dennis.portello at gmail.com Mon Apr 20 06:55:58 2009 From: dennis.portello at gmail.com (Dennis Portello) Date: Mon, 20 Apr 2009 09:55:58 -0400 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: IB Bonding errors with recent kernel In-Reply-To: <49EC67EA.2030403@Voltaire.com> References: <52436c7f0904200421s53d65a31vdcbb26babd9be196@mail.gmail.com> <49EC67EA.2030403@Voltaire.com> Message-ID: <52436c7f0904200655o21923554i88eec25d40e5f1c@mail.gmail.com> Hello Or, Thanks for the reply. I enabled debug and these are the results of my test. (below) First off, I ran this same test on bonded ethernet and on a single IB interface with success. sudo route add -net 224.0.0.0/3 gw 192.168.47.102 socat STDIO UDP4-DATAGRAM:224.1.0.1:6666,bind=:6666,range= 192.168.47.0/24,ip-add-membership=224.1.0.1:192.168.47.102 sudo route add -net 224.0.0.0/3 gw 192.168.47.100 socat STDIO UDP4-DATAGRAM:224.1.0.1:6666,bind=:6666,range= 192.168.47.0/24,ip-add-membership=224.1.0.1:192.168.47.100 socat sets up a peer-peer multicast communication, the expected results are echoed data on the sending end and data on the receiving end. When attempting this test with bonded IB interfaces, I only get get the echoed data on the sending end and nothing on the recieving end. here are the results from dmesg [ 859.128720] bonding: bond3 is being created... [ 859.129468] bonding: bond3: setting mode to active-backup (1). [ 859.129501] bonding: bond3: Setting MII monitoring interval to 100. [ 859.141557] bonding: bond3: doing slave updates when interface is down. [ 859.141563] bonding: bond3: Adding slave ib0. [ 859.141566] bonding bond3: master_dev is not up in bond_enslave [ 859.141567] bonding: bond3: Warning: enslaved VLAN challenged slave ib0. Adding VLANs will be blocked as long as ib0 is part of bond bond3 [ 859.141570] bonding: bond3: Warning: The first slave device specified does not support setting the MAC address. Setting fail_over_mac to active.<7>ib0: bringing up interface [ 859.182437] ib0: starting multicast thread [ 859.182568] ib0: joining MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff [ 859.182580] ib0: restarting multicast task [ 859.182583] ib0: stopping multicast thread [ 859.182586] ib0: adding multicast entry for mgid ff12:401b:ffff:0000:0000:0000:0000:0001 [ 859.182589] ib0: starting multicast thread [ 859.182739] ib0: join completion for ff12:401b:ffff:0000:0000:0000:ffff:ffff (status 0) [ 859.182951] ib0: Created ah ffff8804379e8680 [ 859.182954] ib0: MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff AV ffff8804379e8680, LID 0xc000, SL 0 [ 859.183088] ib0: joining MGID ff12:401b:ffff:0000:0000:0000:0000:0001 [ 859.183222] ib0: join completion for ff12:401b:ffff:0000:0000:0000:0000:0001 (status 0) [ 859.183354] ib0: Created ah ffff8804389a9880 [ 859.183359] ib0: MGID ff12:401b:ffff:0000:0000:0000:0000:0001 AV ffff8804389a9880, LID 0xc001, SL 0 [ 859.184369] bonding: bond3: enslaving ib0 as a backup interface with a down link. [ 859.186365] ib0: successfully joined all multicast groups [ 859.186385] ib0: restarting multicast task [ 859.186386] ib0: stopping multicast thread [ 859.186389] ib0: starting multicast thread [ 859.186500] ib0: successfully joined all multicast groups [ 859.188608] bonding: bond3: doing slave updates when interface is down. [ 859.188613] bonding: bond3: Adding slave ib1. [ 859.188615] bonding bond3: master_dev is not up in bond_enslave [ 859.188617] bonding: bond3: Warning: enslaved VLAN challenged slave ib1. Adding VLANs will be blocked as long as ib1 is part of bond bond3 [ 859.221889] ib1: bringing up interface [ 859.222359] ib1: starting multicast thread [ 859.222483] ib1: joining MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff [ 859.222494] ib1: restarting multicast task [ 859.222498] ib1: stopping multicast thread [ 859.222500] ib1: adding multicast entry for mgid ff12:401b:ffff:0000:0000:0000:0000:0001 [ 859.222503] ib1: starting multicast thread [ 859.224240] bonding: bond3: enslaving ib1 as a backup interface with a down link. [ 859.224634] ib1: join completion for ff12:401b:ffff:0000:0000:0000:ffff:ffff (status 0) [ 859.224837] ib1: Created ah ffff880436cc8400 [ 859.224841] ib1: MGID ff12:401b:ffff:0000:0000:0000:ffff:ffff AV ffff880436cc8400, LID 0xc000, SL 0 [ 859.224968] ib1: joining MGID ff12:401b:ffff:0000:0000:0000:0000:0001 [ 859.225099] ib1: join completion for ff12:401b:ffff:0000:0000:0000:0000:0001 (status 0) [ 859.225223] ib1: Created ah ffff88043840fec0 [ 859.225228] ib1: MGID ff12:401b:ffff:0000:0000:0000:0000:0001 AV ffff88043840fec0, LID 0xc001, SL 0 [ 859.226956] ib1: successfully joined all multicast groups [ 859.226961] ib1: restarting multicast task [ 859.226962] ib1: stopping multicast thread [ 859.226964] ib1: starting multicast thread [ 859.227074] ib1: successfully joined all multicast groups [ 859.228034] ib0: mtu > 2044 will cause multicast packet drops. [ 859.229779] ib1: mtu > 2044 will cause multicast packet drops. [ 859.233134] ADDRCONF(NETDEV_UP): bond3: link is not ready [ 859.233153] bonding: bond3: link status definitely up for interface ib0. [ 859.233156] bonding: bond3: making interface ib0 the new active one. [ 859.233167] ib0: restarting multicast task [ 859.233170] ib0: stopping multicast thread [ 859.233172] ib0: adding multicast entry for mgid 0001:0000:0000:0000:0000:0000:0000:0000 [ 859.233175] ib0: starting multicast thread [ 859.233178] bonding: bond3: first active interface up! [ 859.233180] bonding: bond3: link status definitely up for interface ib1. [ 859.233289] ib0: joining MGID 0001:0000:0000:0000:0000:0000:0000:0000 [ 859.234904] ADDRCONF(NETDEV_CHANGE): bond3: link becomes ready [ 859.234944] ib0: restarting multicast task [ 859.234948] ib0: stopping multicast thread [ 859.234951] ib0: adding multicast entry for mgid ff12:601b:ffff:0000:0000:0001:ff00:f778 [ 859.234954] ib0: starting multicast thread [ 859.235069] ib0: joining MGID ff12:601b:ffff:0000:0000:0001:ff00:f778 [ 859.235090] ib0: join completion for 0001:0000:0000:0000:0000:0000:0000:0000 (status -22) [ 859.235095] ib0: multicast join failed for 0001:0000:0000:0000:0000:0000:0000:0000, status -22 [ 859.235162] ib0: restarting multicast task [ 859.235163] ib0: stopping multicast thread [ 859.235166] ib0: adding multicast entry for mgid ff12:401b:ffff:0000:0000:0000:0000:00fb [ 859.235168] ib0: starting multicast thread [ 859.235200] ib0: join completion for ff12:601b:ffff:0000:0000:0001:ff00:f778 (status 0) [ 859.235304] ib0: joining MGID 0001:0000:0000:0000:0000:0000:0000:0000 [ 859.235343] ib0: Created ah ffff88043a9b9440 [ 859.235347] ib0: MGID ff12:601b:ffff:0000:0000:0001:ff00:f778 AV ffff88043a9b9440, LID 0xc002, SL 0 [ 859.235408] ib0: join completion for 0001:0000:0000:0000:0000:0000:0000:0000 (status -22) [ 859.235412] ib0: multicast join failed for 0001:0000:0000:0000:0000:0000:0000:0000, status -22 [ 859.235481] ib0: joining MGID 0001:0000:0000:0000:0000:0000:0000:0000 [ 859.235592] ib0: join completion for 0001:0000:0000:0000:0000:0000:0000:0000 (status -22) [ 859.235596] ib0: multicast join failed for 0001:0000:0000:0000:0000:0000:0000:0000, status -22 [ 859.260028] ib0: setting up send only multicast group for ff12:601b:ffff:0000:0000:0000:0000:0016 [ 859.260042] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting join [ 859.260136] ib0: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22 [ 859.263792] ib0: setting up send only multicast group for ff12:401b:ffff:0000:0000:0000:0000:0016 [ 859.263806] ib0: no multicast record for ff12:401b:ffff:0000:0000:0000:0000:0016, starting join [ 859.263883] ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:0000:0016, status -22 [ 860.600025] ib0: setting up send only multicast group for ff12:601b:ffff:0000:0000:0000:0000:0002 [ 860.600035] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0002, starting join [ 860.600149] ib0: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22 [ 863.230303] ib0: joining MGID 0001:0000:0000:0000:0000:0000:0000:0000 [ 863.230406] ib0: join completion for 0001:0000:0000:0000:0000:0000:0000:0000 (status -22) [ 863.230411] ib0: multicast join failed for 0001:0000:0000:0000:0000:0000:0000:0000, status -22 [ 864.600035] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0002, starting join [ 864.600124] ib0: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22 [ 868.600034] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0002, starting join [ 868.600119] ib0: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22 [ 868.620031] ib0: no multicast record for ff12:401b:ffff:0000:0000:0000:0000:0016, starting join [ 868.620112] ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:0000:0016, status -22 [ 869.100039] ib0: no multicast record for ff12:601b:ffff:0000:0000:0000:0000:0016, starting join [ 869.100124] ib0: multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22 [ 869.600029] bond3: no IPv6 routers present [ 879.230231] ib0: joining MGID 0001:0000:0000:0000:0000:0000:0000:0000 [ 879.230349] ib0: join completion for 0001:0000:0000:0000:0000:0000:0000:0000 (status -22) [ 879.230355] ib0: multicast join failed for 0001:0000:0000:0000:0000:0000:0000:0000, status -22 [ 886.919993] ib0: restarting multicast task [ 886.919997] ib0: stopping multicast thread [ 886.920002] ib0: adding multicast entry for mgid ff12:401b:ffff:0000:0000:0000:0001:0001 [ 886.920005] ib0: starting multicast thread [ 886.920140] ib0: joining MGID 0001:0000:0000:0000:0000:0000:0000:0000 [ 886.920244] ib0: join completion for 0001:0000:0000:0000:0000:0000:0000:0000 (status -22) [ 886.920248] ib0: multicast join failed for 0001:0000:0000:0000:0000:0000:0000:0000, status -22 [ 886.934421] ib0: no multicast record for ff12:401b:ffff:0000:0000:0000:0000:0016, starting join [ 886.934520] ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:0000:0016, status -22 [ 889.000014] ib0: no multicast record for ff12:401b:ffff:0000:0000:0000:0000:0016, starting join [ 889.000102] ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:0000:0016, status -22 [ 899.053269] ib0: restarting multicast task [ 899.053273] ib0: stopping multicast thread [ 899.053277] ib0: deleting multicast group ff12:401b:ffff:0000:0000:0000:0001:0001 [ 899.053280] ib0: deleting multicast group ff12:401b:ffff:0000:0000:0000:0001:0001 [ 899.053285] ib0: starting multicast thread [ 899.053430] ib0: joining MGID 0001:0000:0000:0000:0000:0000:0000:0000 [ 899.053540] ib0: join completion for 0001:0000:0000:0000:0000:0000:0000:0000 (status -22) [ 899.053544] ib0: multicast join failed for 0001:0000:0000:0000:0000:0000:0000:0000, status -22 [ 899.073152] ib0: no multicast record for ff12:401b:ffff:0000:0000:0000:0000:0016, starting join [ 899.073241] ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:0000:0016, status -22 [ 903.420017] ib0: no multicast record for ff12:401b:ffff:0000:0000:0000:0000:0016, starting join [ 903.420100] ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:0000:0016, status -22 Thank you, Dennis P. On Mon, Apr 20, 2009 at 8:17 AM, Or Gerlitz wrote: > Dennis Portello wrote: > > Regular TCP/IP unicast works, though dmesg is full of warning about > > multicast failing. Multicast does not work at all. > > Unicast IP relies on ARP and IPoIB ARPs use the broadcast multicast group, > so > IB multicast does work on your setup... to see what IB multicast groups are > being > joined by your IPoIB devices, you can use the ipoib debugfs entries > > $ mount -t debugfs none /sys/kernel/debug > $ cat /sys/kernel/debug/ipoib/ibxxx_mcg > > see Documentation/infiniband/ipoib.txt for more info > > Or. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sashak at voltaire.com Mon Apr 20 06:58:14 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 20 Apr 2009 16:58:14 +0300 Subject: ***SPAM*** Re: [ofa-general] osmtest fails with latest opensm In-Reply-To: <49EC0B09.5070003@voltaire.com> References: <49EC0B09.5070003@voltaire.com> Message-ID: <20090420135814.GA25724@sk> Hi Doron, On 08:41 Mon 20 Apr , Doron Shoham wrote: > Apr 19 19:06:31 694993 [A30D5910] 0x01 -> osmtest_validate_port_data: > ERR 0039: Field mismatch port LID 0x0 Num:0x0 > Expected m_key 0x0800000000000000, received 0x0000000000000000 Apr 19 > 19:06:31 695002 [A30D5910] 0x01 -> > osmtest_validate_single_port_rec_lid: ERR 0109: > osmtest_validate_port_data failed (IB_ERROR) Apr 19 19:06:31 695011 > [A30D5910] 0x01 -> > osmtest_validate_single_port_recs: ERR 011B: > osmtest_validate_single_port_rec_lid (IB_ERROR) Apr 19 19:06:31 695019 > [A30D5910] 0x01 -> osmtest_run: ERR 0146: SA validation database failure > (IB_ERROR) > OSMTEST: TEST "All Validations" FAIL Is it something new (triggered by latest changes)? Eli's fix could be related to this, but I cannot reproduce the failure using the current (unfixed) master. Sasha From sashak at voltaire.com Mon Apr 20 07:04:01 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 20 Apr 2009 17:04:01 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] ib_types.h: fix commit 103891092f5f6f0b2cf56555e19fdf008f164c41 In-Reply-To: <49EC6EA8.8030403@gmail.com> References: <49EC6EA8.8030403@gmail.com> Message-ID: <20090420140401.GC25724@sk> On 15:46 Mon 20 Apr , Eli Dorfman (Voltaire) wrote: > fix wrong padding for SA portinfo record after addition > of max_credit_hint and link_rt_latency to SM portinfo > > Signed-off-by: Eli Dorfman Applied. Thanks. Sasha From hal.rosenstock at gmail.com Mon Apr 20 07:19:30 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Mon, 20 Apr 2009 10:19:30 -0400 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCHv2] opensm/PerfMgr: Better redirection support In-Reply-To: <20090417145717.GG17631@sk> References: <20090312202134.GC25024@comcast.net> <20090415125925.GF7353@sk> <20090416005422.GC10146@sk> <20090417145717.GG17631@sk> Message-ID: On Fri, Apr 17, 2009 at 10:57 AM, Sasha Khapyorsky wrote: > On 09:34 Thu 16 Apr     , Hal Rosenstock wrote: >> > >> > Yes, and you can use lid value as such flag - just simpler. >> >> When GID redirection is specified by client, LID must be 0 so I don't see this. > > 1. GID redirection is not implemented in this patch. > 2. In any case you will need to resolve LID value (using GID) in order > to send MAD. > So LID = 0 can be used as invalid redirection data flag. But this was > a minor comment. Yes, this is a minor tradeoff. One loses some information by overloading. One example for the GID redirection case, SA PR query in progress v. bad redirection. >> > My point was different - to separate redirection related data from main >> > flow. >> >> I'm still not sure what you mean by this. Encapsulate the redirection >> data better so it is obtained by some potentially common routine ? > > Yes. And also to not use "fake" redirection fields (specifically pkey_ix) > in non-redirected flow - this is why I think you need 'port' structure. The pkey index was needed before; it was just assumed to be 0 as redirection (nor any real pkey support) was supported. >> >> > PerfMgr is always running over discovered fabric so maybe local port >> >> > number should be detected later at start of PerfMgr process cycle just >> >> > using OpenSM DB. >> >> >> >> Why is that better than doing this at bind time of PerfMgr ? >> > >> > At least two reasons: faster and less code. >> >> Are you sure the OpenSM DB accesses will be faster than the vendor calls here ? > > Yes, it is direct memory read against opening and parsing many files > (+ memory allocations, etc.). Yes, it's orders of magnitude faster. >> Is bind performance sensitive anyhow ? > > Not at all, but all what you need here is just local port number - and 40 > (or so) lines of the code (which is 80% duplicated with pkey validation) > for doing this looks like overkill for me (not in sense of performance). > >> The performance comment is >> clearly relevant to the main flow though. > > Sure, but there you just need to read a value. > >> >> > Also what about letting "chance" for port to refresh redirection info? >> >> >> >> What do you mean ? >> > >> > When port has invalid redirection data, should you care about attempting >> > to refresh this? >> >> If the PMA gives bad redirection data (which BTW is noncompliant), it >> seems likely to do this again so I'm not sure about the value of this. >> Do you think that's a better thing to do ? > > I don't have a clear opinion (and so asked). Actually if I understood > your code correctly this means that if some port once gets bad > redirection data it will dropped from PerfMgr cycle forever, right? This is an implementation decision and I chose not to query. The invalid info can be cleared via the console which will allow this port to be retried. >> >> Redirection does not occur frequently. >> > >> > How could we know:) >> >> It's the current use case for PerfMgt. > > Let's suppose it happens just three times per one PerfMgr cycle - > 3 > 1 anyway. > Another important advantage is that in case when pkey tables are > prepared *before* actual PerfMgr cycle and will not slow down querying > itself. > > Another thought - could p_physp->pkeys be used for index > detection/validation? Yes, I was thinking that too when you said to switch the local port determination over to the OpenSM DB. >> > When OpenSM is in master mode it cannot change (PerfMgr is synchronized >> > with heavy sweep). >> > >> > It is possible with standby OpenSM, so what - this single request will >> > fail once. >> >> Some recovery for such failure would be needed. > > Not really - next PerfMgr cycle will fetch valid data. > >> Also, what about not active ? > > Same as standby (let's call it "non-master" modes). > >> >> > All above are not OpenSM errors, but wrong external data. I think it >> >> > should be logged as VERBOSE messages. >> >> >> >> I agree it's wrong external data but it seems serious enough to me to >> >> treat as an error. >> > >> > And some stupid port will be able to put OpenSM in endless error >> > printing. I don't think it is a good idea. >> >> It would be a non compliant PMA which I would think we'd want to know >> about sooner rather than later. > > If an admin want to care about this (and also about other such sort of > things) he/she will turn verbosity "on". But how does the admin even know that redirection is being used so is needed to be enabled ? That assumes the admin knows which devices require redirection. >> >> Seems like some sort of configuration error to me if this is disabled >> >> at the manager but the PMA wants to use it. >> > >> > PMA shouldn't dictate here. >> >> PMA does dictate redirection. Manager has no way to shut it off. > > But it should be able to ignore this (including "noisy" logging). That's the tradeoff. Your choice leads to silent failures. -- Hal >> If >> manager turns off it's handling of redirection, then it just doesn't >> work (that port is inaccessible by the manager). This argues for the >> default to be enabled. The current default is disabled since this code >> was deemed experimental. > > Right, and it should be consistent with this (now default) setting. > >> >> > BTW, why to bother with verifying redirection info when redirection >> >> > support is disabled anyway? >> >> >> >> I thought it was useful to know the redirection info was invalid >> >> rather than getting the disabled notification and then enabling and >> >> finding out. >> > >> > For PMAs debug purposes redirection support should be switched "on" >> > obviously. >> >> Why do you say debug purposes ? Isn't it any purpose ? > > I meant PMA support + PMA debug. > Sasha > From tziporet at mellanox.co.il Mon Apr 20 08:29:41 2009 From: tziporet at mellanox.co.il (Tziporet Koren) Date: Mon, 20 Apr 2009 18:29:41 +0300 Subject: [ofa-general] EWG/OFED meeting agenda for today - Apr 20, 09 Message-ID: <5D49E7A8952DC44FB38C38FA0D758EAD026677C5@mtlexch01.mtl.com> This is OFED meeting agenda for today (April 20): a. OFED 1.4.1 release status: Reminder for OFED 1.4.1 schedule: * RC1 & RC2 & RC3 - done * RC4 - planned for today - need to decide today * GA - Apr 23 - need to close date in the meeting today Critical bugs: bug_id bug_severity op_sys assigned_to short_short_desc 1595 critical Other Jeffrey.C.Becker at nasa.gov failed to compile nfs sles11 1589 critical RHEL 5 jon at opengridcomputing.com FRMR registration errors logged by cxgb3 during NFSRDMA iozone runs 1591 critical Other jon at opengridcomputing.com Can not compile OFED1.4.1 rc3 on ppc with SLES11 1604 critical Other vlad at mellanox.co.il Failed to build rnfs-utils on SLES11 1571 critical RHEL 5 vu at mellanox.com nfsrdma server crash @test5 connectathon basic test, 1528 major RHEL 5 jackm at mellanox.co.il IPoIB get stack when running Hadoop application. 1529 major RHEL 5 jackm at mellanox.co.il Opensm cannot be stopped following openib failure. 1545 major Other jackm at mellanox.co.il Performance degradation in ofed 1.4.1 in TCP BW for some packets size 1596 major Other Jeffrey.C.Becker at nasa.gov openibd stop failed when nfs is loaded 1579 major RHEL 5 jsquyres at cisco.com OpenMPI-1.3.1-1: segfault during close 1581 major Other ogerlitz at voltaire.com Unable to uninstall OFED1.4 due to dependecies on tgt and scsi-target-utils b. OFED 1.5: Update: * Vlad opened kernel branch based on 2.6.30-rc2 * Jack is working to have basic components compile (core & ipoib). Should have the first version this week * Once this will work we will change the user space (especially to the new management tree) c. Update on MPI memory registration requirement - Jeff S. d. Open discussion From sean.hefty at intel.com Mon Apr 20 09:09:26 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 20 Apr 2009 09:09:26 -0700 Subject: [ofa-general] [PATCH] rdma_cm: Add debugfs entries to monitor rdma_cm connections In-Reply-To: <49EB11E5.5000407@Voltaire.COM> References: <49E71FE1.90102@Voltaire.COM> <49EB11E5.5000407@Voltaire.COM> Message-ID: >rdma_id is a suffix that leaves room for more, or in other works - I just >wanted to leave room for other >debug information in the future (e.g. number of count of total incoming >connection on device) ok - makes sense >TP=TyPe (Device type) >PO=POrt (Port Number) >PS=PortSpace >ST=STate > >I tried to shorten the output line as much as possible to make the output looks >as easy to >read table (on most screen the output will be one line per rdma_id) >The same thought made me print only the numeric value and not it's string >value. I was able to figure these out by looking at the code, but if I look at the output of netstat, the headings and values are easy to interpret without needing to refer to source code. - Sean From swise at opengridcomputing.com Mon Apr 20 11:05:55 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 20 Apr 2009 13:05:55 -0500 Subject: [ofa-general] ofed-1.4.1 dapl and intel mpi 3.2.1 Message-ID: <49ECB983.8050909@opengridcomputing.com> Hey Arlin, Have you seen this? I'm unable to get IMPI running over dapl. dtest works with provider chelsio2, but IMPI is failing to load the libs: [impi at r1 ~]$ mpiexec -ppn 1 -genv I_MPI_DEVICE rdma:chelsio2 -env I_MPI_DEBUG 2 -n 2 /opt/intel/impi/3.2.1/tests/IMB-3.1/IMB-MPI1 pingpong [0] MPI startup(): cannot open dynamic library libdat.so [0] MPI startup(): cannot open dynamic library libdat2.so [1] MPI startup(): cannot open dynamic library libdat.so [1] MPI startup(): cannot open dynamic library libdat2.so [1] MPI startup(): DAPL provider on rank 1:r2-iw [0] MPI startup(): socket data transfer mode [1] MPI startup(): socket data transfer mode Got any ideas? Thanks, Steve. From swise at opengridcomputing.com Mon Apr 20 11:22:39 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 20 Apr 2009 13:22:39 -0500 Subject: [ofa-general] ofed-1.4.1 dapl and intel mpi 3.2.1 In-Reply-To: <49ECB983.8050909@opengridcomputing.com> References: <49ECB983.8050909@opengridcomputing.com> Message-ID: <49ECBD6F.8000206@opengridcomputing.com> Nevermind. I was compiling IMB as 32b and I didn't have the 32b dapl libs installed. Sorry for the noise. Steve. Steve Wise wrote: > Hey Arlin, > > Have you seen this? I'm unable to get IMPI running over dapl. dtest > works with provider chelsio2, but IMPI is failing to load the libs: > > > [impi at r1 ~]$ mpiexec -ppn 1 -genv I_MPI_DEVICE rdma:chelsio2 -env > I_MPI_DEBUG 2 -n 2 /opt/intel/impi/3.2.1/tests/IMB-3.1/IMB-MPI1 pingpong > [0] MPI startup(): cannot open dynamic library libdat.so > [0] MPI startup(): cannot open dynamic library libdat2.so > [1] MPI startup(): cannot open dynamic library libdat.so > [1] MPI startup(): cannot open dynamic library libdat2.so > [1] MPI startup(): DAPL provider on rank 1:r2-iw > [0] MPI startup(): socket data transfer mode > [1] MPI startup(): socket data transfer mode > > > Got any ideas? > > Thanks, > > Steve. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From arlin.r.davis at intel.com Mon Apr 20 12:28:41 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Mon, 20 Apr 2009 12:28:41 -0700 Subject: [ofa-general] RE: ofed-1.4.1 dapl and intel mpi 3.2.1 In-Reply-To: <49ECB983.8050909@opengridcomputing.com> References: <49ECB983.8050909@opengridcomputing.com> Message-ID: Steve, >Hey Arlin, > >Have you seen this? I'm unable to get IMPI running over dapl. dtest >works with provider chelsio2, but IMPI is failing to load the libs: > >[impi at r1 ~]$ mpiexec -ppn 1 -genv I_MPI_DEVICE rdma:chelsio2 -env >I_MPI_DEBUG 2 -n 2 >/opt/intel/impi/3.2.1/tests/IMB-3.1/IMB-MPI1 pingpong >[0] MPI startup(): cannot open dynamic library libdat.so >[0] MPI startup(): cannot open dynamic library libdat2.so >[1] MPI startup(): cannot open dynamic library libdat.so >[1] MPI startup(): cannot open dynamic library libdat2.so >[1] MPI startup(): DAPL provider on rank 1:r2-iw >[0] MPI startup(): socket data transfer mode >[1] MPI startup(): socket data transfer mode Hmmm, having problems finding libdat.so What packages are installed? # rpm -qa | grep dapl dapl-2.0.17-1 compat-dapl-devel-1.2.14-1 dapl-utils-2.0.17-1 compat-dapl-1.2.14-1 dapl-devel-2.0.17-1 dapl-debuginfo-2.0.17-1 dapl-devel-static-2.0.14-1 can you verify libdat is in library path.. # ldconfig -p | grep libdat libdat2.so.2 (libc6,x86-64) => /usr/lib64/libdat2.so.2 libdat2.so (libc6,x86-64) => /usr/lib64/libdat2.so libdat.so.1 (libc6,x86-64) => /usr/lib64/libdat.so.1 libdat.so (libc6,x86-64) => /usr/lib64/libdat.so -arlin From swise at opengridcomputing.com Mon Apr 20 13:08:22 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 20 Apr 2009 15:08:22 -0500 Subject: [ofa-general] Re: ofed-1.4.1 dapl and intel mpi 3.2.1 In-Reply-To: References: <49ECB983.8050909@opengridcomputing.com> Message-ID: <49ECD636.3030300@opengridcomputing.com> Davis, Arlin R wrote: > > Steve, > > >> Hey Arlin, >> >> Have you seen this? I'm unable to get IMPI running over dapl. dtest >> works with provider chelsio2, but IMPI is failing to load the libs: >> >> [impi at r1 ~]$ mpiexec -ppn 1 -genv I_MPI_DEVICE rdma:chelsio2 -env >> I_MPI_DEBUG 2 -n 2 >> /opt/intel/impi/3.2.1/tests/IMB-3.1/IMB-MPI1 pingpong >> [0] MPI startup(): cannot open dynamic library libdat.so >> [0] MPI startup(): cannot open dynamic library libdat2.so >> [1] MPI startup(): cannot open dynamic library libdat.so >> [1] MPI startup(): cannot open dynamic library libdat2.so >> [1] MPI startup(): DAPL provider > on rank 1:r2-iw >> [0] MPI startup(): socket data transfer mode >> [1] MPI startup(): socket data transfer mode >> > > Hmmm, having problems finding libdat.so > > What packages are installed? > # rpm -qa | grep dapl > dapl-2.0.17-1 > compat-dapl-devel-1.2.14-1 > dapl-utils-2.0.17-1 > compat-dapl-1.2.14-1 > dapl-devel-2.0.17-1 > dapl-debuginfo-2.0.17-1 > dapl-devel-static-2.0.14-1 > > can you verify libdat is in library path.. > > # ldconfig -p | grep libdat > libdat2.so.2 (libc6,x86-64) => /usr/lib64/libdat2.so.2 > libdat2.so (libc6,x86-64) => /usr/lib64/libdat2.so > libdat.so.1 (libc6,x86-64) => /usr/lib64/libdat.so.1 > libdat.so (libc6,x86-64) => /usr/lib64/libdat.so > > -arlin I was building IMB 32bit and didn't have the 32b dapl installed... I modified the mpitests Makefiles to point to the intel 64b compiler/includes/libs and it all works now. Thanks! Steve. From rdreier at cisco.com Mon Apr 20 13:53:30 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Apr 2009 13:53:30 -0700 Subject: [ofa-general] Re: [PATCH 2.6.30] RDMA/cxgb3: Adjust ord/ird if needed for peer2peer connections In-Reply-To: <20090409165218.17033.63125.stgit@build.ogc.int> (Steve Wise's message of "Thu, 09 Apr 2009 11:52:19 -0500") References: <20090409165218.17033.63125.stgit@build.ogc.int> Message-ID: thanks, applied From rdreier at cisco.com Mon Apr 20 13:58:12 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Apr 2009 13:58:12 -0700 Subject: [ofa-general] Re: [PATCH] ipoib: disable napi while cq is being drained In-Reply-To: <49DF6984.4090000@voltaire.com> (Yossi Etigin's message of "Fri, 10 Apr 2009 18:45:08 +0300") References: <49DF6984.4090000@voltaire.com> Message-ID: nice debugging and a nice solution. applied, thanks. > Fix bugzilla #1587. This is useful information to include in the changelog -- I added This fixes . when I merged the patch. In general I don't think we would ever want to put links to bugs that are fixed into the part after "---" that gets stripped. From rdreier at cisco.com Mon Apr 20 13:59:56 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Apr 2009 13:59:56 -0700 Subject: [ofa-general] [PATCH] RDMA/nes: application hang during large cluster test In-Reply-To: <20090413160947.GA4260@flatif-MOBL> (Faisal Latif's message of "Mon, 13 Apr 2009 11:09:47 -0500") References: <20090413160947.GA4260@flatif-MOBL> Message-ID: This patch seems to be conflating multiple fixes -- or am I misunderstanding? For example: > * Under heavy load, sometimes it takes longer to receive the response from > application to the MPA request. The rexmit timeout value is too low. so bumping the rexmit timeout would be one independent fix. > * check_seq(), does not check for condition if the seq# is wrapped. and this seems completely independent. etc etc. can you split this up into a series of patches that each fix one bug only? From faisal.latif at intel.com Mon Apr 20 14:02:58 2009 From: faisal.latif at intel.com (Latif, Faisal) Date: Mon, 20 Apr 2009 14:02:58 -0700 Subject: [ofa-general] [PATCH] RDMA/nes: application hang during large cluster test In-Reply-To: References: <20090413160947.GA4260@flatif-MOBL> Message-ID: <588992150B702C48B3312184F1B810AD03EB85178A@azsmsx501.amr.corp.intel.com> OK. I will split it into series of patches. Thanks Faisal >-----Original Message----- >From: Roland Dreier [mailto:rdreier at cisco.com] >Sent: Monday, April 20, 2009 4:00 PM >To: Latif, Faisal >Cc: general at lists.openfabrics.org >Subject: Re: [ofa-general] [PATCH] RDMA/nes: application hang during large >cluster test > >This patch seems to be conflating multiple fixes -- or am I >misunderstanding? For example: > > > * Under heavy load, sometimes it takes longer to receive the response >from > > application to the MPA request. The rexmit timeout value is too low. > >so bumping the rexmit timeout would be one independent fix. > > > * check_seq(), does not check for condition if the seq# is wrapped. > >and this seems completely independent. > >etc etc. > >can you split this up into a series of patches that each fix one bug only? From rdreier at cisco.com Mon Apr 20 14:11:59 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Apr 2009 14:11:59 -0700 Subject: [ofa-general] Re: Probable bug in mlx4 driver In-Reply-To: <49E84B48.9050702@morey-chaisemartin.com> (Nicolas Morey-Chaisemartin's message of "Fri, 17 Apr 2009 11:26:32 +0200") References: <49E84B48.9050702@morey-chaisemartin.com> Message-ID: > diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c > index 102bac9..ae692f1 100644 > --- a/drivers/net/mlx4/main.c > +++ b/drivers/net/mlx4/main.c > @@ -977,6 +977,7 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev) > goto retry; > } > > + kfree(entries); > goto no_msi; > } This part of the patch is correct I believe -- entries is leaked otherwise if enabling MSI-X fails. > @@ -993,7 +994,7 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev) > no_msi: > dev->caps.num_comp_vectors = 1; > > - for (i = 0; i < 2; ++i) > + for (i = 0; i < nreq; ++i) > priv->eq_table.eq[i].irq = dev->pdev->irq; > } This is incorrect -- if msi_x is not set, then the function will fall through to here and nreq will not even be initialized. If we are not using MSI-X, then only one completion event queue will ever be used, and so only the first two EQs need IRQs assigned. Care to resend the first half of the patch with a proper subject/changelog/signed-off-by/etc? (cf Documentation/SubmittingPatches) Thanks, Roland From rdreier at cisco.com Mon Apr 20 14:50:46 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Apr 2009 14:50:46 -0700 Subject: [ofa-general] Re: [PATCH] RDMA/nes: remove compiler warning nes_verbs.c:1955 In-Reply-To: <20090410170940.GA2896@ctung-MOBL> (Chien Tung's message of "Fri, 10 Apr 2009 12:09:40 -0500") References: <20090410170940.GA2896@ctung-MOBL> Message-ID: thanks, applied From rdreier at cisco.com Mon Apr 20 14:53:30 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Apr 2009 14:53:30 -0700 Subject: [ofa-general] Re: [PATCH V2] RDMA/nes: Physical memory registration is incorrect In-Reply-To: <20090410213147.GA3736@dewood-MOBL> (Don Wood's message of "Fri, 10 Apr 2009 16:31:47 -0500") References: <20090410213147.GA3736@dewood-MOBL> Message-ID: thanks, applied. From devel-ofed at morey-chaisemartin.com Mon Apr 20 14:59:14 2009 From: devel-ofed at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Mon, 20 Apr 2009 23:59:14 +0200 Subject: [ofa-general] Re: Probable bug in mlx4 driver In-Reply-To: References: <49E84B48.9050702@morey-chaisemartin.com> Message-ID: <49ECF032.5030705@morey-chaisemartin.com> Le 20/04/2009 23:11, Roland Dreier a écrit : > > diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c > > index 102bac9..ae692f1 100644 > > --- a/drivers/net/mlx4/main.c > > +++ b/drivers/net/mlx4/main.c > > @@ -977,6 +977,7 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev) > > goto retry; > > } > > > > + kfree(entries); > > goto no_msi; > > } > > This part of the patch is correct I believe -- entries is leaked > otherwise if enabling MSI-X fails. > > > @@ -993,7 +994,7 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev) > > no_msi: > > dev->caps.num_comp_vectors = 1; > > > > - for (i = 0; i< 2; ++i) > > + for (i = 0; i< nreq; ++i) > > priv->eq_table.eq[i].irq = dev->pdev->irq; > > } > > This is incorrect -- if msi_x is not set, then the function will fall > through to here and nreq will not even be initialized. If we are not > using MSI-X, then only one completion event queue will ever be used, and > so only the first two EQs need IRQs assigned. > Ok I got wrong there from the previous implementation. There were more much more irq set than this. for (i = 0; i < MLX4_NUM_EQ; ++i) where MLX4_NUM_EQ was the same number used when using msi_x. > Care to resend the first half of the patch with a proper > subject/changelog/signed-off-by/etc? (cf Documentation/SubmittingPatches) Sure. Who/What ML shall I send it to? Nicolas From rdreier at cisco.com Mon Apr 20 15:05:10 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Apr 2009 15:05:10 -0700 Subject: [ofa-general] Re: Probable bug in mlx4 driver In-Reply-To: <49ECF032.5030705@morey-chaisemartin.com> (Nicolas Morey-Chaisemartin's message of "Mon, 20 Apr 2009 23:59:14 +0200") References: <49E84B48.9050702@morey-chaisemartin.com> <49ECF032.5030705@morey-chaisemartin.com> Message-ID: > > Care to resend the first half of the patch with a proper > > subject/changelog/signed-off-by/etc? (cf Documentation/SubmittingPatches) > Sure. Who/What ML shall I send it to? Me and the general@ list. - R. From arlin.r.davis at intel.com Mon Apr 20 16:04:00 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Mon, 20 Apr 2009 16:04:00 -0700 Subject: [ofa-general] [ANNOUNCE] uDAPL v2.0 - dapl-2.0.18 release Message-ID: New release for uDAPL 2.0 available on the OFA download page and in my git tree. md5sum: 03908e3940ba4a908f38ec307cd48ad6 dapl-2.0.18.tar.gz Summary of changes: v2 - dapltest: reset server listen ports to avoid collisions during long runs v2 - dapltest: avoid duplicating ports, increment based on ep/thread count v2 - dapltest: fix assumptions that multiple EP's will connect in order v2 - common: sync missing with when removing items off of EVD pending queue v2 - scm: reduce open time with thread start up v2 - scm: getsockopt optlen needs initialized to size of optval v2 - scm: cr_thread cleanup v2 - OFED and WinOF(pre 2.1) code sync, Interoperability testing - dapltest with scm provider Vlad, please pull v2 package into OFED 1.4.1 RC4 and install the following: compat-dapl-1.2.14-1 compat-dapl-devel-1.2.14-1 dapl-2.0.18-1 dapl-utils-2.0.18-1 dapl-devel-2.0.18-1 dapl-debuginfo-2.0.18-1 See http://www.openfabrics.org/downloads/dapl/ more details. -arlin From rdreier at cisco.com Mon Apr 20 17:01:01 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Apr 2009 17:01:01 -0700 Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: Don't zero the qp attrs when moving to IDLE. In-Reply-To: <20090414195342.16529.35283.stgit@build.ogc.int> (Steve Wise's message of "Tue, 14 Apr 2009 14:53:42 -0500") References: <20090414195342.16529.35283.stgit@build.ogc.int> Message-ID: thanks, applied From Zhen.Liang at Sun.COM Mon Apr 20 20:08:25 2009 From: Zhen.Liang at Sun.COM (Liang Zhen) Date: Tue, 21 Apr 2009 11:08:25 +0800 Subject: [ofa-general] OFED1.3.1: soft lockup in completion handler Message-ID: <49ED38A9.7040208@sun.com> Hi there, I sufferred from soft lockup in completion handler, to be clear, I'm running with Mellanox Technologies MT25418 and ofed1.3.1, 16 cores AMD, and I have never seen this problem with <= 8 cores. The completion handler is like: void kiblnd_cq_completion (struct ib_cq *cq, void *arg) { spin_lock_irqsave(&conn->ibc_sched->ibs_lock, flags); conn->ibc_ready = 1; if (!conn->ibc_scheduled && (conn->ibc_nrx > 0 || conn->ibc_nsends_posted > 0)) { kiblnd_conn_addref(conn); /* +1 ref for sched_conns */ conn->ibc_scheduled = 1; list_add_tail(&conn->ibc_sched_list, &conn->ibc_sched->ibs_conns); wake_up(&conn->ibc_sched->ibs_waitq); } spin_unlock_irqrestore(&conn->ibc_sched->ibs_lock, flags); } As you see, compeltion handler basically did nothing except wake_up, ibs_waitq is per-CPU waitq, the thread on the queue is a CPU affinity thread. Call Trace: [] :ko2iblnd:kiblnd_cq_completion+0x43/0xb0 [] :mlx4_core:mlx4_eq_int+0x3b/0x26f [] :mlx4_core:mlx4_msi_x_interrupt+0xf/0x17 [] handle_IRQ_event+0x29/0x58 [] __do_IRQ+0xa4/0x103 [] :mlx4_core:poll_catas+0x0/0x13c [] :ib_cm:cm_work_handler+0x0/0xc52 [] do_IRQ+0xe7/0xf5 [] :ib_cm:cm_work_handler+0x0/0xc52 [] ret_from_intr+0x0/0xa [] :mlx4_core:poll_catas+0x1a/0x13c [] :mlx4_core:poll_catas+0x0/0x13c [] :ib_cm:cm_work_handler+0x0/0xc52 [] run_timer_softirq+0x133/0x1af [] __do_softirq+0x5e/0xd6 [] end_msi_irq_w_maskbit+0xf/0x1c [] call_softirq+0x1c/0x28 [] do_softirq+0x2c/0x85 [] do_IRQ+0xec/0xf5 [] ret_from_intr+0x0/0xa In our module, we have per-CPU thread and per-CPU waitq, each thread has it's own connection list to poll on, completion handler will dispatch connection to it's scheduler and wake the scheduler. I'm very sure that scheduler doesn't have any heavy & slow operation with holding ibs_lock, and I never got this problem on system with <= 8 cores. Looks like completion handler is always run on the same core, so completion handler race with all other cores on their own waitq and very likely to get soft lockup... I've tried to turn off irqbalancer and set /proc/irq/.../smp_affinity for more cores, but changed nothing and still soft lockup. After I installed ofed1.4.1 and create CQ with ib_create_cq(....comp_vector), the problem is gone and get really good performance. The problem now is, seems ofed1.4.1:mlx4 is the only driver can really support multiple completion vectors, but we can't expect all customers to have the same environment... Is there only other possible way to resolve this? Thanks Liang From rdreier at cisco.com Mon Apr 20 21:12:54 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Apr 2009 21:12:54 -0700 Subject: [ofa-general] [PATCH] mthca: increase INIT_HCA timeout In-Reply-To: <20090413184657.GE22355@sgi.com> (akepner@sgi.com's message of "Mon, 13 Apr 2009 11:46:57 -0700") References: <20090413184657.GE22355@sgi.com> Message-ID: thanks, applied From rdreier at cisco.com Mon Apr 20 21:14:18 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 20 Apr 2009 21:14:18 -0700 Subject: [ofa-general] [PATCH] mthca: increase INIT_HCA timeout In-Reply-To: (Roland Dreier's message of "Mon, 20 Apr 2009 21:12:54 -0700") References: <20090413184657.GE22355@sgi.com> Message-ID: err, replied to the wrong email... I meant I appled Jack's expanded patch. From nicolas.morey-chaisemartin at ext.bull.net Mon Apr 20 23:05:11 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey-Chaisemartin) Date: Tue, 21 Apr 2009 08:05:11 +0200 Subject: [ofa-general] [PATCH] Fixed memory leak in drivers/net/mlx4/main.c Message-ID: <49ED6217.5000506@ext.bull.net> When msi_x is enabled but not enough vectors are available, the vector array was not freed. Signed-off-by: Nicolas Morey-Chaisemartin --- Written on HEAD of ofed_1_5/linux-2.6.git drivers/net/mlx4/main.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index 102bac9..30bea96 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -976,7 +976,7 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev) nreq = err; goto retry; } - + kfree(entries); goto no_msi; } -- 1.6.2.GIT From vlad at dev.mellanox.co.il Mon Apr 20 23:33:38 2009 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Tue, 21 Apr 2009 09:33:38 +0300 Subject: [ofa-general] Re: [ANNOUNCE] uDAPL v2.0 - dapl-2.0.18 release In-Reply-To: References: Message-ID: <49ED68C2.1050004@dev.mellanox.co.il> Davis, Arlin R wrote: > > New release for uDAPL 2.0 available on the OFA download page and in my git tree. > > md5sum: 03908e3940ba4a908f38ec307cd48ad6 dapl-2.0.18.tar.gz > > Summary of changes: > > v2 - dapltest: reset server listen ports to avoid collisions during long runs > v2 - dapltest: avoid duplicating ports, increment based on ep/thread count > v2 - dapltest: fix assumptions that multiple EP's will connect in order > v2 - common: sync missing with when removing items off of EVD pending queue > v2 - scm: reduce open time with thread start up > v2 - scm: getsockopt optlen needs initialized to size of optval > v2 - scm: cr_thread cleanup > v2 - OFED and WinOF(pre 2.1) code sync, > Interoperability testing - dapltest with scm provider > > Vlad, please pull v2 package into OFED 1.4.1 RC4 and install the following: > > compat-dapl-1.2.14-1 > compat-dapl-devel-1.2.14-1 > dapl-2.0.18-1 > dapl-utils-2.0.18-1 > dapl-devel-2.0.18-1 > dapl-debuginfo-2.0.18-1 > > See http://www.openfabrics.org/downloads/dapl/ more details. > > -arlin > Done, Regards, Vladimir From nicolas.morey-chaisemartin at ext.bull.net Tue Apr 21 00:50:06 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey-Chaisemartin) Date: Tue, 21 Apr 2009 09:50:06 +0200 Subject: [ofa-general] [PATCH] ibutils: git-log calls have been changed to git log as git-xxx syntax is not working with latest git releases Message-ID: <49ED7AAE.3010707@ext.bull.net> Signed-off-by: Nicolas Morey-Chaisemartin --- ibdiag/src/Makefile.am | 2 +- ibdm/ibdm/Makefile.am | 2 +- ibis/src/Makefile.am | 2 +- ibmgtsim/src/Makefile.am | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/ibdiag/src/Makefile.am b/ibdiag/src/Makefile.am index def8b0a..7158bbd 100644 --- a/ibdiag/src/Makefile.am +++ b/ibdiag/src/Makefile.am @@ -42,7 +42,7 @@ GIT=$(shell which git) git_version.tcl : @MAINTAINER_MODE_TRUE@ FORCE if test x$(GIT) != x ; then \ - gitver=`cd $(srcdir) ; git-log | head -1 | cut -f2 -d\ `; \ + gitver=`cd $(srcdir) ; git log | head -1 | cut -f2 -d\ `; \ changes=`cd $(srcdir) ; git diff . | grep ^diff | wc -l`; \ else \ gitver=undefined; changes=0; \ diff --git a/ibdm/ibdm/Makefile.am b/ibdm/ibdm/Makefile.am index b0958fc..83e06c6 100644 --- a/ibdm/ibdm/Makefile.am +++ b/ibdm/ibdm/Makefile.am @@ -96,7 +96,7 @@ GIT=$(shell which git) $(srcdir)/git_version.h: @MAINTAINER_MODE_TRUE@ FORCE if test x$(GIT) != x ; then \ - gitver=`cd $(srcdir) ; git-log | head -1 | cut -f2 -d\ `; \ + gitver=`cd $(srcdir) ; git log | head -1 | cut -f2 -d\ `; \ changes=`cd $(srcdir) ; git diff . | grep ^diff | wc -l`; \ else \ gitver=undefined; changes=0; \ diff --git a/ibis/src/Makefile.am b/ibis/src/Makefile.am index 7f415f0..ab2e119 100644 --- a/ibis/src/Makefile.am +++ b/ibis/src/Makefile.am @@ -98,7 +98,7 @@ GIT=$(shell which git) $(srcdir)/git_version.h: @MAINTAINER_MODE_TRUE@ FORCE if test x$(GIT) != x ; then \ - gitver=`cd $(srcdir) ; git-log | head -1 | cut -f2 -d\ `; \ + gitver=`cd $(srcdir) ; git log | head -1 | cut -f2 -d\ `; \ changes=`cd $(srcdir) ; git diff . | grep ^diff | wc -l`; \ else \ gitver=undefined; changes=0; \ diff --git a/ibmgtsim/src/Makefile.am b/ibmgtsim/src/Makefile.am index 6585a11..88bddf7 100644 --- a/ibmgtsim/src/Makefile.am +++ b/ibmgtsim/src/Makefile.am @@ -95,7 +95,7 @@ GIT=$(shell which git) $(srcdir)/git_version.h: @MAINTAINER_MODE_TRUE@ FORCE if test x$(GIT) != x ; then \ - gitver=`cd $(srcdir) ; git-log | head -1 | cut -f2 -d\ `; \ + gitver=`cd $(srcdir) ; git log | head -1 | cut -f2 -d\ `; \ changes=`cd $(srcdir) ; git diff . | grep ^diff | wc -l`; \ else \ gitver=undefined; changes=0; \ -- 1.6.2.GIT From kliteyn at dev.mellanox.co.il Tue Apr 21 01:01:43 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Tue, 21 Apr 2009 11:01:43 +0300 Subject: [ofa-general] Re: ibutils building errors In-Reply-To: References: Message-ID: <49ED7D67.2060002@dev.mellanox.co.il> Hi Hal, I'm having troubles reproducing this. Works fine on x86_64 with gcc 4.1.0. What CPU and gcc version are you using? -- Yevgeny Hal Rosenstock wrote: > Hi Yevgeny, > > With the latest ibutils, I get the following errors when building: > > ibdm_wrap.cpp: In function `int _wrap_IBPort_width_set(void*, Tcl_Interp*, int, > Tcl_Obj* const*)': > ibdm_wrap.cpp:5410: invalid conversion from `const char*' to `char*' > ibdm_wrap.cpp: In function `int _wrap_IBPort_width_get(void*, Tcl_Interp*, int, > Tcl_Obj* const*)': > ibdm_wrap.cpp:5506: invalid conversion from `const char*' to `char*' > ibdm_wrap.cpp: In function `int _wrap_IBPort_speed_set(void*, Tcl_Interp*, int, > Tcl_Obj* const*)': > ibdm_wrap.cpp:5608: invalid conversion from `const char*' to `char*' > ibdm_wrap.cpp: In function `int _wrap_IBPort_speed_get(void*, Tcl_Interp*, int, > Tcl_Obj* const*)': > ibdm_wrap.cpp:5704: invalid conversion from `const char*' to `char*' > > Thanks for looking into this. > > -- Hal > From vlad at lists.openfabrics.org Tue Apr 21 03:21:51 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 21 Apr 2009 03:21:51 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090421-0200 daily build status Message-ID: <20090421102151.A5A0CE6105E@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From sashak at voltaire.com Tue Apr 21 03:22:35 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 21 Apr 2009 13:22:35 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] infiniband-diags/man/vendstat.8: Indicate IS4 config group config not persistent across IS4 reboot In-Reply-To: <20090419103241.GA25675@comcast.net> References: <20090419103241.GA25675@comcast.net> Message-ID: <20090421102235.GB5797@sk> On 06:32 Sun 19 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Tue Apr 21 03:22:55 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Tue, 21 Apr 2009 13:22:55 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH][TRIVIAL] opensm/osm_perfmgr.c: Eliminate duplicated error number In-Reply-To: <20090419145306.GA30667@comcast.net> References: <20090419145306.GA30667@comcast.net> Message-ID: <20090421102255.GC5797@sk> On 10:53 Sun 19 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From hal.rosenstock at gmail.com Tue Apr 21 04:07:37 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 21 Apr 2009 07:07:37 -0400 Subject: [ofa-general] ***SPAM*** Re: ibutils building errors In-Reply-To: <49ED7D67.2060002@dev.mellanox.co.il> References: <49ED7D67.2060002@dev.mellanox.co.il> Message-ID: Hi Yevgeny, On Tue, Apr 21, 2009 at 4:01 AM, Yevgeny Kliteynik wrote: > Hi Hal, > > I'm having troubles reproducing this. > Works fine on x86_64 with gcc 4.1.0. > > What CPU and gcc version are you using? x86 and gcc 3.2.2 -- Hal > -- Yevgeny > > Hal Rosenstock wrote: >> >> Hi Yevgeny, >> >> With the latest ibutils, I get the following errors when building: >> >> ibdm_wrap.cpp: In function `int _wrap_IBPort_width_set(void*, Tcl_Interp*, >> int, >>   Tcl_Obj* const*)': >> ibdm_wrap.cpp:5410: invalid conversion from `const char*' to `char*' >> ibdm_wrap.cpp: In function `int _wrap_IBPort_width_get(void*, Tcl_Interp*, >> int, >>   Tcl_Obj* const*)': >> ibdm_wrap.cpp:5506: invalid conversion from `const char*' to `char*' >> ibdm_wrap.cpp: In function `int _wrap_IBPort_speed_set(void*, Tcl_Interp*, >> int, >>   Tcl_Obj* const*)': >> ibdm_wrap.cpp:5608: invalid conversion from `const char*' to `char*' >> ibdm_wrap.cpp: In function `int _wrap_IBPort_speed_get(void*, Tcl_Interp*, >> int, >>   Tcl_Obj* const*)': >> ibdm_wrap.cpp:5704: invalid conversion from `const char*' to `char*' >> >> Thanks for looking into this. >> >> -- Hal >> > > From hnrose at comcast.net Tue Apr 21 04:01:10 2009 From: hnrose at comcast.net (hnrose at comcast.net) Date: Tue, 21 Apr 2009 07:01:10 -0400 Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_perfmgr.c: Add assert in sweep_hop_1 Message-ID: <20090421110110.GA363@comcast.net> as found in osm_state_mgr.c:state_mgr_sweep_hop_1 Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c index 8d5ed97..20ee57d 100644 --- a/opensm/opensm/osm_perfmgr.c +++ b/opensm/opensm/osm_perfmgr.c @@ -568,6 +568,8 @@ static int sweep_hop_1(osm_sm_t * sm) } p_node = p_port->p_node; + CL_ASSERT(p_node); + port_num = ib_node_info_get_local_port_num(&p_node->node_info); OSM_LOG(sm->p_log, OSM_LOG_DEBUG, From hnrose at comcast.net Tue Apr 21 04:03:05 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Tue, 21 Apr 2009 07:03:05 -0400 Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_perfmgr.c: Add assert in sweep_hop_1 Message-ID: <20090421110305.GA409@comcast.net> as found in osm_state_mgr.c:state_mgr_sweep_hop_1 Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c index 8d5ed97..20ee57d 100644 --- a/opensm/opensm/osm_perfmgr.c +++ b/opensm/opensm/osm_perfmgr.c @@ -568,6 +568,8 @@ static int sweep_hop_1(osm_sm_t * sm) } p_node = p_port->p_node; + CL_ASSERT(p_node); + port_num = ib_node_info_get_local_port_num(&p_node->node_info); OSM_LOG(sm->p_log, OSM_LOG_DEBUG, From hnrose at comcast.net Tue Apr 21 04:13:00 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Tue, 21 Apr 2009 07:13:00 -0400 Subject: [ofa-general] ***SPAM*** [PATCH] opensm/doc/performance-manager-HOWTO.txt: Indicate (previously implied) master state Message-ID: <20090421111300.GA483@comcast.net> Also, fix some typos Signed-off-by: Hal Rosenstock --- diff --git a/opensm/doc/performance-manager-HOWTO.txt b/opensm/doc/performance-manager-HOWTO.txt index 0b35e5f..11f4185 100644 --- a/opensm/doc/performance-manager-HOWTO.txt +++ b/opensm/doc/performance-manager-HOWTO.txt @@ -10,7 +10,7 @@ the subnet and stores them internally in OpenSM. Some of the features of the performance manager are: 1) Collect port data and error counters per v1.2 spec and store in - 64bit internal counts. + 64 bit internal counts. 2) Automatic reset of counters when they reach approximatly 3/4 full. (While not guarenteeing that counts will not be missed this does keep counts incrementing as best as possible given the current @@ -19,12 +19,13 @@ Some of the features of the performance manager are: errors. 4) Automatically detects "outside" resets of counters and adjusts to continue collecting data. - 5) Can be run when OpenSM is in standby or inactive states. + 5) Can be run when OpenSM is in standby or inactive states in + addition to master state. Known issues are: 1) Data counters will be lost on high data rate links. Sweeping the - fabric fast enough for a DDR link is not practical. + fabric fast enough for even a DDR link is not practical. 2) Default partition support only. @@ -147,7 +148,7 @@ collected. You can then use that data as appropriate. An example plugin can be configured at compile time using the "--enable-default-event-plugin" option on the configure line. This plugin is -very simple. It logs "events" recieved from the performance manager to a log -file. I don't recomend using this directly but rather use it as a templat to +very simple. It logs "events" received from the performance manager to a log +file. I don't recommend using this directly but rather use it as a template to create your own plugin. From monis at Voltaire.COM Tue Apr 21 05:04:20 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Tue, 21 Apr 2009 15:04:20 +0300 Subject: [ofa-general] ***SPAM*** Re: IB Bonding errors with recent kernel In-Reply-To: <52436c7f0904200629q447d969cwbe12ce8bb606584b@mail.gmail.com> References: <52436c7f0904200421s53d65a31vdcbb26babd9be196@mail.gmail.com> <49EC6505.40406@Voltaire.COM> <52436c7f0904200629q447d969cwbe12ce8bb606584b@mail.gmail.com> Message-ID: <49EDB644.8040604@Voltaire.COM> Dennis Portello wrote: > I can confirm that this issue exists beyond Redhat 4, I'm using Ubuntu > 8.10 (2.6.27). > > I'm using ib-bond and I've also tried adding he bonds directly with > > echo +bond0 > /sys/class/net/bonding_masters > echo 1 > /sys/class/net/bond0/bonding/mode > echo 100 > /sys/class/net/bond0/bonding/miimon > echo +ib0 > /sys/class/net/bond0/bonding/slaves > echo +ib1 > /sys/class/net/bond0/bonding/slaves > ifconfig bond0 192.168.47.102/24 > route add -net 224.0.0.0/3 gw 192.168.47.100 > I guess that what you see is a result of 2 issues. First, a garbage multicast addresses that is passed to ib0 by bond0 The second, a garbage mulicast address in the list of mcast addresses of interface ib0 prevents other legal addresses from joining the mcast group. To avoid this (at least as a workaround) you should make sure that interface bond0 won't be up before it has ib slaves or in other words, bond0 was never up between 'modprobe bonding' and 'echo +ib0 > /sys/class/net/bond0/bonding/slaves' Let me know if this helps From herbert at gondor.apana.org.au Tue Apr 21 05:49:19 2009 From: herbert at gondor.apana.org.au (Herbert Xu) Date: Tue, 21 Apr 2009 20:49:19 +0800 Subject: [ofa-general] NetEffect, iw_nes and kernel warning In-Reply-To: <20090421090918.GA6034@mail.wantstofly.org> References: <20090130065721.GA4886@gondor.apana.org.au> <20090130.135107.116108989.davem@davemloft.net> <20090421090918.GA6034@mail.wantstofly.org> Message-ID: <20090421124919.GA18444@gondor.apana.org.au> On Tue, Apr 21, 2009 at 11:09:18AM +0200, Lennert Buytenhek wrote: > On Fri, Jan 30, 2009 at 07:54:12PM -0800, Roland Dreier wrote: > > > > > I don't believe this is accurate. Calling skb_linearize() (on a kernel > > > > with CONFIG_HIGHMEM set) can end up calling local_bh_enable() in > > > > kunmap_skb_frag(), which can obviously cause problems if the initial > > > > context relies on having BHs disabled (as hard_start_xmit does). > > > > > > local_bh_{enable,disable}() nests, so this is not a problem > > > > Duh. OK, then the only bugs seem to be that iw_nes does skb_linearize > > with irqs off (due to being an LLTX driver), and mv643xx_eth leaks an > > skb on its error path if skb_linearize fails. > > (Found this when deleting old netdev@ mail...) mv643xx_eth returns > NETDEV_TX_BUSY if skb_linearize fails, so the qdisc will requeue the > skb, and we shouldn't free it. Am I missing something? I don't think the issue here is the leak. Calling skb_linearize is simply illegal if you support netpoll because netpoll will call the xmit routine with IRQs off. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From herbert at gondor.apana.org.au Tue Apr 21 05:50:08 2009 From: herbert at gondor.apana.org.au (Herbert Xu) Date: Tue, 21 Apr 2009 20:50:08 +0800 Subject: [ofa-general] NetEffect, iw_nes and kernel warning In-Reply-To: <20090421124919.GA18444@gondor.apana.org.au> References: <20090130065721.GA4886@gondor.apana.org.au> <20090130.135107.116108989.davem@davemloft.net> <20090421090918.GA6034@mail.wantstofly.org> <20090421124919.GA18444@gondor.apana.org.au> Message-ID: <20090421125008.GA18493@gondor.apana.org.au> On Tue, Apr 21, 2009 at 08:49:19PM +0800, Herbert Xu wrote: > > I don't think the issue here is the leak. Calling skb_linearize is > simply illegal if you support netpoll because netpoll will call the > xmit routine with IRQs off. On the other hand if netpoll never generates a packet that requires linearisation, maybe it will work :) -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From hal.rosenstock at gmail.com Tue Apr 21 06:20:08 2009 From: hal.rosenstock at gmail.com (Hal Rosenstock) Date: Tue, 21 Apr 2009 09:20:08 -0400 Subject: ***SPAM*** Re: [ofa-general] ***SPAM*** Re: [PATCHv2] opensm/PerfMgr: Better redirection support In-Reply-To: References: <20090312202134.GC25024@comcast.net> <20090415125925.GF7353@sk> <20090416005422.GC10146@sk> <20090417145717.GG17631@sk> Message-ID: On Mon, Apr 20, 2009 at 10:19 AM, Hal Rosenstock wrote: >> Another thought - could p_physp->pkeys be used for index >> detection/validation? > > Yes, I was thinking that too when you said to switch the local port > determination over to the OpenSM DB. The tradeoff here is the additional locking/sync point on the OpenSM DB. -- Hal From buytenh at wantstofly.org Tue Apr 21 02:09:18 2009 From: buytenh at wantstofly.org (Lennert Buytenhek) Date: Tue, 21 Apr 2009 11:09:18 +0200 Subject: [ofa-general] NetEffect, iw_nes and kernel warning In-Reply-To: References: <20090130065721.GA4886@gondor.apana.org.au> <20090130.135107.116108989.davem@davemloft.net> Message-ID: <20090421090918.GA6034@mail.wantstofly.org> On Fri, Jan 30, 2009 at 07:54:12PM -0800, Roland Dreier wrote: > > > I don't believe this is accurate. Calling skb_linearize() (on a kernel > > > with CONFIG_HIGHMEM set) can end up calling local_bh_enable() in > > > kunmap_skb_frag(), which can obviously cause problems if the initial > > > context relies on having BHs disabled (as hard_start_xmit does). > > > > local_bh_{enable,disable}() nests, so this is not a problem > > Duh. OK, then the only bugs seem to be that iw_nes does skb_linearize > with irqs off (due to being an LLTX driver), and mv643xx_eth leaks an > skb on its error path if skb_linearize fails. (Found this when deleting old netdev@ mail...) mv643xx_eth returns NETDEV_TX_BUSY if skb_linearize fails, so the qdisc will requeue the skb, and we shouldn't free it. Am I missing something? From kraai at ftbfs.org Tue Apr 21 05:54:41 2009 From: kraai at ftbfs.org (Matt Kraai) Date: Tue, 21 Apr 2009 05:54:41 -0700 Subject: [ofa-general] [PATCH] RDMA/nes: Remove root_256's unused pbl_count_256 parameter Message-ID: <1240318481-29913-1-git-send-email-kraai@ftbfs.org> --- drivers/infiniband/hw/nes/nes_verbs.c | 5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index 7e5b5ba..0b852f4 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -1895,8 +1895,7 @@ static int nes_destroy_cq(struct ib_cq *ib_cq) static u32 root_256(struct nes_device *nesdev, struct nes_root_vpbl *root_vpbl, struct nes_root_vpbl *new_root, - u16 pbl_count_4k, - u16 pbl_count_256) + u16 pbl_count_4k) { u64 leaf_pbl; int i, j, k; @@ -2012,7 +2011,7 @@ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd, } if (use_256_pbls && use_two_level) { - if (root_256(nesdev, root_vpbl, &new_root, pbl_count_4k, pbl_count_256) == 1) { + if (root_256(nesdev, root_vpbl, &new_root, pbl_count_4k) == 1) { if (new_root.pbl_pbase != 0) root_vpbl = &new_root; } else { -- 1.6.2.3 From kraai at ftbfs.org Tue Apr 21 06:09:01 2009 From: kraai at ftbfs.org (Matt Kraai) Date: Tue, 21 Apr 2009 06:09:01 -0700 Subject: [ofa-general] [PATCH] RDMA/nes: Remove root_256's unused pbl_count_256 parameter In-Reply-To: <1240318481-29913-1-git-send-email-kraai@ftbfs.org> References: <1240318481-29913-1-git-send-email-kraai@ftbfs.org> Message-ID: <1240319341-22363-1-git-send-email-kraai@ftbfs.org> Signed-off-by: Matt Kraai --- drivers/infiniband/hw/nes/nes_verbs.c | 5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) Sorry, I forgot to include a Signed-off-by line in the previous submission. diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index 7e5b5ba..0b852f4 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -1895,8 +1895,7 @@ static int nes_destroy_cq(struct ib_cq *ib_cq) static u32 root_256(struct nes_device *nesdev, struct nes_root_vpbl *root_vpbl, struct nes_root_vpbl *new_root, - u16 pbl_count_4k, - u16 pbl_count_256) + u16 pbl_count_4k) { u64 leaf_pbl; int i, j, k; @@ -2012,7 +2011,7 @@ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd, } if (use_256_pbls && use_two_level) { - if (root_256(nesdev, root_vpbl, &new_root, pbl_count_4k, pbl_count_256) == 1) { + if (root_256(nesdev, root_vpbl, &new_root, pbl_count_4k) == 1) { if (new_root.pbl_pbase != 0) root_vpbl = &new_root; } else { -- 1.6.2.3 From kraai at ftbfs.org Tue Apr 21 06:09:24 2009 From: kraai at ftbfs.org (Matt Kraai) Date: Tue, 21 Apr 2009 06:09:24 -0700 Subject: [ofa-general] [PATCH] RDMA/nes: Guard tmp_addr definition with CONFIG_INFINIBAND_NES_DEBUG Message-ID: <1240319364-22416-1-git-send-email-kraai@ftbfs.org> If CONFIG_INFINIBAND_NES_DEBUG is not defined, nes_debug is defined away. Since the invocation of nes_debug in find_listener is the only use of tmp_addr, GCC complains that the latter is never used: drivers/infiniband/hw/nes/nes_cm.c: In function ‘find_listener’: drivers/infiniband/hw/nes/nes_cm.c:857: warning: unused variable ‘tmp_addr’ To avoid this, only define tmp_addr if it will be used. That is, if CONFIG_INFINIBAND_NES_DEBUG is defined. Signed-off-by: Matt Kraai --- drivers/infiniband/hw/nes/nes_cm.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index dbd9a75..9765027 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -854,7 +854,9 @@ static struct nes_cm_listener *find_listener(struct nes_cm_core *cm_core, { unsigned long flags; struct nes_cm_listener *listen_node; +#ifdef CONFIG_INFINIBAND_NES_DEBUG __be32 tmp_addr = cpu_to_be32(dst_addr); +#endif /* walk list and find cm_node associated with this session ID */ spin_lock_irqsave(&cm_core->listen_list_lock, flags); -- 1.6.2.3 From chien.tin.tung at intel.com Tue Apr 21 07:01:52 2009 From: chien.tin.tung at intel.com (Tung, Chien Tin) Date: Tue, 21 Apr 2009 07:01:52 -0700 Subject: [ofa-general] RE: [PATCH] RDMA/nes: Remove root_256's unused pbl_count_256 parameter In-Reply-To: <1240319341-22363-1-git-send-email-kraai@ftbfs.org> References: <1240318481-29913-1-git-send-email-kraai@ftbfs.org> <1240319341-22363-1-git-send-email-kraai@ftbfs.org> Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA383033CEA5599@azsmsx501.amr.corp.intel.com> >Signed-off-by: Matt Kraai >--- > drivers/infiniband/hw/nes/nes_verbs.c | 5 ++--- > 1 files changed, 2 insertions(+), 3 deletions(-) > > Sorry, I forgot to include a Signed-off-by line in the previous > submission. > >diff --git a/drivers/infiniband/hw/nes/nes_verbs.c >b/drivers/infiniband/hw/nes/nes_verbs.c >index 7e5b5ba..0b852f4 100644 >--- a/drivers/infiniband/hw/nes/nes_verbs.c >+++ b/drivers/infiniband/hw/nes/nes_verbs.c >@@ -1895,8 +1895,7 @@ static int nes_destroy_cq(struct ib_cq *ib_cq) > static u32 root_256(struct nes_device *nesdev, > struct nes_root_vpbl *root_vpbl, > struct nes_root_vpbl *new_root, >- u16 pbl_count_4k, >- u16 pbl_count_256) >+ u16 pbl_count_4k) > { > u64 leaf_pbl; > int i, j, k; >@@ -2012,7 +2011,7 @@ static int nes_reg_mr(struct nes_device >*nesdev, struct nes_pd *nespd, > } > > if (use_256_pbls && use_two_level) { >- if (root_256(nesdev, root_vpbl, &new_root, >pbl_count_4k, pbl_count_256) == 1) { >+ if (root_256(nesdev, root_vpbl, &new_root, >pbl_count_4k) == 1) { > if (new_root.pbl_pbase != 0) > root_vpbl = &new_root; > } else { >-- >1.6.2.3 Thanks for the patch. Acked-by: Chien Tung From celine.bourde at ext.bull.net Tue Apr 21 06:59:47 2009 From: celine.bourde at ext.bull.net (Celine Bourde) Date: Tue, 21 Apr 2009 15:59:47 +0200 Subject: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition Message-ID: <49EDD153.6040001@ext.bull.net> Hi, I can't mount an NFS/RDMA partition. I've a linux 2.6.27 kernel with ofa_kernel-1.4.1 (provided by OFED-1.4.1-rc3). [root at host]# modinfo xprtrdma filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/xprtrdma/xprtrdma.ko [root at thost]# modinfo svcrdma filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/xprtrdma/svcrdma.ko [..] I've applied http://www.openfabrics.org//downloads/OFED/ofed-1.4/OFED-1.4-docs/nfs-rdma.release-notes.txt instructions. Every steps (loading modules, /etc/exports implementation, starting nfs daemon, etc..) seems to be ok, but when I do the last command (nfs-utils-1.1.4): mount -o rdma,port=2050 192.168.0.215:/export /tmp/nfs_client/ The mount processus blocks. My output is the following : " rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16 rpcrdma: connection to 192.168.0.215:2050 closed (-103) rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16 rpcrdma: connection to 192.168.0.215:2050 closed (-103) rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16 rpcrdma: connection to 192.168.0.215:2050 closed (-103) rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16" [root at host]#strace mount -o rdma,port=2050 192.168.0.215:/vol0 /mnt [...] read(3, "# This file controls the state o"..., 4096) = 447 read(3, "", 4096) = 0 close(3) = 0 munmap(0x7f7268080000, 4096) = 0 open("/proc/mounts", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7268080000 read(3, "rootfs / rootfs rw 0 0\n/dev/root"..., 1024) = 693 read(3, "", 1024) = 0 close(3) = 0 munmap(0x7f7268080000, 4096) = 0 open("/usr/lib/locale/locale-archive", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=56405456, ...}) = 0 mmap(NULL, 56405456, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f7264a99000 close(3) = 0 umask(022) = 022 open("/dev/null", O_RDWR) = 3 close(3) = 0 getuid() = 0 geteuid() = 0 getgid() = 0 getegid() = 0 prctl(0x3, 0, 0, 0, 0) = 1 open("/etc/blkid/blkid.tab", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=643, ...}) = 0 fcntl(3, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE) fstat(3, {st_mode=S_IFREG|0644, st_size=643, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7268080000 lseek(3, 0, SEEK_CUR) = 0 read(3, " Hi Yevgeny, I was wondering about the following in osm_qos_parser_y: Both __parser_add_pkey_range_to_port_map and __parser_add_partition_list_to_port_map seem to add the port map to both the full and limited (partial) ports for that partition. Am I reading this right ? Should it be that way ? Thanks. -- Hal From hnrose at comcast.net Tue Apr 21 07:47:36 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Tue, 21 Apr 2009 10:47:36 -0400 Subject: [ofa-general] ***SPAM*** [PATCH] opensm/man/opensm.8.in: Add mention of backing documentation for QoS policy file and performance manager Message-ID: <20090421144736.GA14785@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in index 5548613..66d2fe6 100644 --- a/opensm/man/opensm.8.in +++ b/opensm/man/opensm.8.in @@ -1,4 +1,4 @@ -.TH OPENSM 8 "June 13, 2008" "OpenIB" "OpenIB Management" +.TH OPENSM 8 "April 22, 2009" "OpenIB" "OpenIB Management" .SH NAME opensm \- InfiniBand subnet manager and administration (SM/SA) @@ -289,7 +289,9 @@ This option enables QoS setup. It is disabled by default. .TP \fB\-Y\fR, \fB\-\-qos_policy_file\fR This option defines the optional QoS policy file. The default -name is \fB\%@OPENSM_CONFIG_DIR@/@QOS_POLICY_FILE@\fP. +name is \fB\%@OPENSM_CONFIG_DIR@/@QOS_POLICY_FILE@\fP. See +QoS_management_in_OpenSM.txt in opensm doc for more information on +configuring QoS policy via this file. .TP \fB\-N\fR, \fB\-\-no_part_enforce\fR This option disables partition enforcement on switch external ports. @@ -311,7 +313,8 @@ this is NOT currently implemented in the performance manager. .TP \fB\-perfmgr\fR Enable the perfmgr. Only takes effect if --enable-perfmgr was specified at -configure time. +configure time. See performance-manager-HOWTO.txt in opensm doc for +more information on running perfmgr. .TP \fB\-perfmgr_sweep_time_s\fR Specify the sweep time for the performance manager in seconds From hnrose at comcast.net Tue Apr 21 08:12:53 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Tue, 21 Apr 2009 11:12:53 -0400 Subject: [ofa-general] ***SPAM*** [PATCH] opensm/include/opensm/osm_pkey.h: Fix commentary typo Message-ID: <20090421151253.GA15735@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/include/opensm/osm_pkey.h b/opensm/include/opensm/osm_pkey.h index 94b7207..d20b495 100644 --- a/opensm/include/opensm/osm_pkey.h +++ b/opensm/include/opensm/osm_pkey.h @@ -259,7 +259,7 @@ static inline ib_pkey_table_t *osm_pkey_tbl_block_get(const osm_pkey_tbl_t * * [in] Pointer to osm_pkey_tbl_t object. * * block -* [in] The lock number to get +* [in] The block number to get * * RETURN VALUES * The IB pkey table of that pkey table element From eli at dev.mellanox.co.il Tue Apr 21 08:25:36 2009 From: eli at dev.mellanox.co.il (Eli Cohen) Date: Tue, 21 Apr 2009 18:25:36 +0300 Subject: [ofa-general] OFED1.3.1: soft lockup in completion handler In-Reply-To: <49ED38A9.7040208@sun.com> References: <49ED38A9.7040208@sun.com> Message-ID: <20090421152536.GA11622@mtls03> On Tue, Apr 21, 2009 at 11:08:25AM +0800, Liang Zhen wrote: > In our module, we have per-CPU thread and per-CPU waitq, each thread has > it's own connection list to poll on, completion handler will dispatch > connection to it's scheduler and wake the scheduler. > I'm very sure that scheduler doesn't have any heavy & slow operation > with holding ibs_lock, and I never got this problem on system with <= 8 > cores. > Looks like completion handler is always run on the same core, so > completion handler race with all other cores on their own waitq and very > likely to get soft lockup... Where is the other place that you acquire conn->ibc_sched->ibs_lock? Is it in the per CPU thread? Maybe you should try to decrease the time when the lock is acquired at the thread. Can you send all references to the code aquiring the lock? > I've tried to turn off irqbalancer and set /proc/irq/.../smp_affinity > for more cores, but changed nothing and still soft lockup. > > After I installed ofed1.4.1 and create CQ with > ib_create_cq(....comp_vector), the problem is gone and get really good > performance. The problem now is, seems ofed1.4.1:mlx4 is the only driver > can really support multiple completion vectors, but we can't expect all > customers to have the same environment... > Is there only other possible way to resolve this? > > Thanks > Liang > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From hnrose at comcast.net Tue Apr 21 08:46:12 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Tue, 21 Apr 2009 11:46:12 -0400 Subject: [ofa-general] ***SPAM*** [PATCH] opensm/osm_pkey_mgr.c: Fix pkey endian in log message Message-ID: <20090421154612.GA21292@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/opensm/osm_pkey_mgr.c b/opensm/opensm/osm_pkey_mgr.c index ae16eb6..9b76ed7 100644 --- a/opensm/opensm/osm_pkey_mgr.c +++ b/opensm/opensm/osm_pkey_mgr.c @@ -119,7 +119,7 @@ pkey_mgr_process_physical_port(IN osm_log_t * p_log, OSM_LOG(p_log, OSM_LOG_ERROR, "ERR 0503: " "Failed to obtain P_Key 0x%04x block and index for node " "0x%016" PRIx64 " port %u\n", - ib_pkey_get_base(pkey), + cl_ntoh16(ib_pkey_get_base(pkey)), cl_ntoh64(osm_node_get_node_guid(p_node)), osm_physp_get_port_num(p_physp)); return; From ido at uchicago.edu Tue Apr 21 09:42:48 2009 From: ido at uchicago.edu (Ido Rosen) Date: Tue, 21 Apr 2009 12:42:48 -0400 Subject: [ofa-general] ***SPAM*** hello and gitweb improvements Message-ID: <592164a70904210942h23a12657s212574041f2a9cdf@mail.gmail.com> Hi everyone, I'm going to be helping Jeff B. and Jeff S. improve things at openfabrics.org in my spare time. The first improvement's gone through already: The git.OpenFabrics.org gitweb now caches properly. It should be significantly faster. Try it out. You'll also be getting some additional collaboration and bug tracking / project maintenance features after things have stabilized on the new server. I plan to fix the ***SPAM*** problem this weekend. Also, if anyone is using sofa.openfabrics.org (the new server, not the current openfabrics.org), please back up all of your work from there and don't rely on it existing beyond this weekend. We're probably going to wipe it out this or next weekend. If that's a problem, email me privately and we can delay the upgrade a few extra days. Cheers, Ido From Jeffrey.C.Becker at nasa.gov Tue Apr 21 09:43:57 2009 From: Jeffrey.C.Becker at nasa.gov (Jeff Becker) Date: Tue, 21 Apr 2009 09:43:57 -0700 Subject: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition In-Reply-To: <49EDD153.6040001@ext.bull.net> References: <49EDD153.6040001@ext.bull.net> Message-ID: <49EDF7CD.9040907@nasa.gov> Hi Celine Celine Bourde wrote: > Hi, > > I can't mount an NFS/RDMA partition. > > I've a linux 2.6.27 kernel with ofa_kernel-1.4.1 (provided by OFED-1.4.1-rc3). > [root at host]# modinfo xprtrdma > filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/xprtrdma/xprtrdma.ko > [root at thost]# modinfo svcrdma > filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/xprtrdma/svcrdma.ko > [..] > > I've applied > http://www.openfabrics.org//downloads/OFED/ofed-1.4/OFED-1.4-docs/nfs-rdma.release-notes.txt > instructions. > > Every steps (loading modules, /etc/exports implementation, starting nfs daemon, > etc..) seems to be ok, but when I do the last command (nfs-utils-1.1.4): > mount -o rdma,port=2050 192.168.0.215:/export /tmp/nfs_client/ > The mount processus blocks. My output is the following : > " > rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16 > rpcrdma: connection to 192.168.0.215:2050 closed (-103) > rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16 > rpcrdma: connection to 192.168.0.215:2050 closed (-103) > rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16 > rpcrdma: connection to 192.168.0.215:2050 closed (-103) > rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16" > On the server did you remember to: echo rdma 2050 > /proc/fs/nfsd/portlist ? Also, make sure that none of the stock 2.6.27 NFS modules are loaded - they should all be from OFED 1.4.1 -jeff > [root at host]#strace mount -o rdma,port=2050 192.168.0.215:/vol0 /mnt > [...] > read(3, "# This file controls the state o"..., 4096) = 447 > read(3, "", 4096) = 0 > close(3) = 0 > munmap(0x7f7268080000, 4096) = 0 > open("/proc/mounts", O_RDONLY) = 3 > fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7268080000 > read(3, "rootfs / rootfs rw 0 0\n/dev/root"..., 1024) = 693 > read(3, "", 1024) = 0 > close(3) = 0 > munmap(0x7f7268080000, 4096) = 0 > open("/usr/lib/locale/locale-archive", O_RDONLY) = 3 > fstat(3, {st_mode=S_IFREG|0644, st_size=56405456, ...}) = 0 > mmap(NULL, 56405456, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f7264a99000 > close(3) = 0 > umask(022) = 022 > open("/dev/null", O_RDWR) = 3 > close(3) = 0 > getuid() = 0 > geteuid() = 0 > getgid() = 0 > getegid() = 0 > prctl(0x3, 0, 0, 0, 0) = 1 > open("/etc/blkid/blkid.tab", O_RDONLY) = 3 > fstat(3, {st_mode=S_IFREG|0644, st_size=643, ...}) = 0 > fcntl(3, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE) > fstat(3, {st_mode=S_IFREG|0644, st_size=643, ...}) = 0 > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7268080000 > lseek(3, 0, SEEK_CUR) = 0 > read(3, " read(3, "", 4096) = 0 > close(3) = 0 > munmap(0x7f7268080000, 4096) = 0 > getuid() = 0 > geteuid() = 0 > lstat("/etc/mtab", {st_mode=S_IFREG|0644, st_size=296, ...}) = 0 > stat("192.168.0.215:/vol0", 0x7fff700810c0) = -1 ENOENT (No such file or directory) > stat("/sbin/mount.nfs", {st_mode=S_IFREG|S_ISUID|0511, st_size=287532, ...}) = 0 > clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f72680647f0) = 4991 > wait4(-1, > > and it blocks. > > Any Idea ? > > Thanks for your help. > > Céline Bourde. > > > > From jsquyres at cisco.com Tue Apr 21 09:46:54 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Tue, 21 Apr 2009 12:46:54 -0400 Subject: [ofa-general] ***SPAM*** hello and gitweb improvements In-Reply-To: <592164a70904210942h23a12657s212574041f2a9cdf@mail.gmail.com> References: <592164a70904210942h23a12657s212574041f2a9cdf@mail.gmail.com> Message-ID: Re-sending to EWG as well, just in case anyone reads that more than the high-volume "general" list. *** YOU NEED TO READ THE MAIL BELOW IF YOU'VE ALREADY GOT A GIT TREE ON sofa.openfabrics.org!! On Apr 21, 2009, at 12:42 PM, Ido Rosen wrote: > Hi everyone, > > I'm going to be helping Jeff B. and Jeff S. improve things at > openfabrics.org in my spare time. The first improvement's gone > through already: The git.OpenFabrics.org gitweb now caches properly. > It should be significantly faster. Try it out. > > You'll also be getting some additional collaboration and bug tracking > / project maintenance features after things have stabilized on the new > server. > > I plan to fix the ***SPAM*** problem this weekend. > > Also, if anyone is using sofa.openfabrics.org (the new server, not the > current openfabrics.org), please back up all of your work from there > and don't rely on it existing beyond this weekend. We're probably > going to wipe it out this or next weekend. If that's a problem, email > me privately and we can delay the upgrade a few extra days. > > Cheers, > Ido > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Jeff Squyres Cisco Systems From rdreier at cisco.com Tue Apr 21 10:13:47 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 21 Apr 2009 10:13:47 -0700 Subject: [ofa-general] Re: [PATCH] Fixed memory leak in drivers/net/mlx4/main.c In-Reply-To: <49ED6217.5000506@ext.bull.net> (Nicolas Morey-Chaisemartin's message of "Tue, 21 Apr 2009 08:05:11 +0200") References: <49ED6217.5000506@ext.bull.net> Message-ID: Thanks, applied. however: > Content-Type: text/plain; charset=ISO-8859-1; format=flowed I think this showed why your MUA mangled the patch. I had to delete one of the two spaces at the beginning of the context lines to get it to apply properly. I fixed this up by hand but in the future please try to get your setup so that I can just apply the patch directly from your email (you can experiment with mailing patches to yourself and trying to apply them with "git apply --check") - R. From rdreier at cisco.com Tue Apr 21 10:43:41 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 21 Apr 2009 10:43:41 -0700 Subject: [ofa-general] [PATCH] RDMA/nes: Remove root_256's unused pbl_count_256 parameter In-Reply-To: <1240319341-22363-1-git-send-email-kraai@ftbfs.org> (Matt Kraai's message of "Tue, 21 Apr 2009 06:09:01 -0700") References: <1240318481-29913-1-git-send-email-kraai@ftbfs.org> <1240319341-22363-1-git-send-email-kraai@ftbfs.org> Message-ID: thanks, applied. From weiny2 at llnl.gov Tue Apr 21 11:27:17 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 21 Apr 2009 11:27:17 -0700 Subject: ***SPAM*** Re: [ofa-general] link width problem of Qlogic 9024 unmanaged switch In-Reply-To: References: <20090417104429.666be0d7.weiny2@llnl.gov> Message-ID: <20090421112717.79274bc3.weiny2@llnl.gov> Hi Yicheng, On Fri, 17 Apr 2009 15:07:54 -0500 Yicheng Jia wrote: > Hi Ira, > > >Yea that is going to be a problem. The problem is that effectively you > just > > disabled the connection to the switch. A reset disables then enables > the > > port. Once the port is disabled the command can't talk to the switch > any > > longer. You will have to either reset the switch (power cycle) or go to > > another node and enable the port. From the output you sent me it looks > like > > you don't have any other nodes on the switch, so I take it you are > resetting > > the switch to get the link to come back? > > Thanks your explanation now everything is clear. Can I do "reset" by > down/enable instead of down/disable/enable so that I can reset the peer > port on the switch? Not that I know of. > > > I thought there was a warning in the man page or in the help regarding > this > > situation but I don't see it now. > > There's a warning if I try to reset a port which is not on the switch. > > > BTW, What are you trying to achieve with this command? > > Sometime there's 1x link on the subnet after reboot our system, which > consists of several HCA nodes directly connected with the switch. By > reboot, I mean restart each node. I am trying to achieve 4x link width by > using this command on 1x link port. Do you have any better idea of > resolving this problem? The way we do it around here is to go to another node and issue the request. I have a perl script in my "pragmatic infiniband utilities" (ibbouncelinks.pl) which will skip the port it is running on. That tarball can be found here: https://computing.llnl.gov/linux/piu.html I did not code an option to look for 1X links but I think it would be simple to do so. Thanks, Ira > > Thanks! > Yicheng Jia > > > > > > Ira Weiny > 04/17/2009 12:45 PM > > To > Yicheng Jia > cc > general at lists.openfabrics.org, Hal Rosenstock > Subject > Re: ***SPAM*** Re: [ofa-general] link width problem of Qlogic 9024 > unmanaged switch > > > > > > > On Fri, 17 Apr 2009 12:06:43 -0500 > Yicheng Jia wrote: > > > Hi Ira, > > > > Here is the output of "iblinkinfo.pl -R": > > > > ++++++++++++++++++++++++++++++++++++++++++++++++ > > [root at ib_manager ~]# iblinkinfo.pl -R > > Switch 0x00066a00d90009c1 InfiniCon System InfinIO 9024 Lite: > > 7 1[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 6 1[ ] > "MT2520 4 InfiniHostLx Mellanox Technologies" ( ) > > 2[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] > "" ( ) > > 3[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] > "" ( ) > > 4[ ] ==( 4X 2.5 Gbps Down / Polling)==> [ ] > "" ( ) > > [snip] > > > > > And the "ibstat" output: > > +++++++++++++++++++++++++++++++++++ > > [root at ib_manager ~]# ibstat > > CA 'mthca0' > > CA type: MT25204 > > Number of ports: 1 > > Firmware version: 1.2.0 > > Hardware version: a0 > > Node GUID: 0x0002c90200230784 > > System image GUID: 0x0002c90200230787 > > Port 1: > > State: Active > > Physical state: LinkUp > > Rate: 10 > > Base lid: 6 > > LMC: 0 > > SM lid: 6 > > Capability mask: 0x02500a6a > > Port GUID: 0x0002c90200230785 > > ++++++++++++++++++++++++++++++++++++++++++++++ > > > > The reset command I am using is "ibportstate 7 1 reset", I also tried > > "ibportstate -D 0,1 1 reset", and it fails with the same result. > > > > Yea that is going to be a problem. The problem is that effectively you > just > disabled the connection to the switch. A reset disables then enables the > port. Once the port is disabled the command can't talk to the switch any > longer. You will have to either reset the switch (power cycle) or go to > another node and enable the port. From the output you sent me it looks > like > you don't have any other nodes on the switch, so I take it you are > resetting > the switch to get the link to come back? > > I thought there was a warning in the man page or in the help regarding > this > situation but I don't see it now. > > Also, this becomes worse if you disable the port the SM is on. (Which I > see > you are doing.) So you will have a noticeable delay while the SM rescans > the > network which it is now seeing "again" for the first time. > > BTW, What are you trying to achieve with this command? > > Hope this helps, > Ira > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. > For more information please visit http:// www. ers.ibm.com > _____________________________________________________________________________ > > > > _____________________________________________________________________________ > Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http:// www. ers.ibm.com > _____________________________________________________________________________ -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab weiny2 at llnl.gov From sean.hefty at intel.com Tue Apr 21 12:02:18 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 21 Apr 2009 12:02:18 -0700 Subject: [ofa-general] [PATCH 1/4] ib-mgmt/ibn3 branch: diags updated for continued windows support Message-ID: <5A8239DBACE14A9281405317269D03A3@amr.corp.intel.com> Signed-off-by: Sean Hefty --- This patch is based on the ibn3 branch infiniband-diags/src/ibaddr.c | 1 + infiniband-diags/src/iblinkinfo.c | 4 ++-- infiniband-diags/src/ibnetdiscover.c | 2 +- infiniband-diags/src/ibsendtrap.c | 4 ++-- infiniband-diags/src/vendstat.c | 4 ++-- 5 files changed, 8 insertions(+), 7 deletions(-) diff --git a/infiniband-diags/src/ibaddr.c b/infiniband-diags/src/ibaddr.c index bb22be9..7909a52 100644 --- a/infiniband-diags/src/ibaddr.c +++ b/infiniband-diags/src/ibaddr.c @@ -39,6 +39,7 @@ #include #include #include +#include #include #include diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c index 1e43788..c6ce81b 100644 --- a/infiniband-diags/src/iblinkinfo.c +++ b/infiniband-diags/src/iblinkinfo.c @@ -48,7 +48,7 @@ #include #include -#include +#include #include char *argv0 = "iblinkinfotest"; @@ -284,7 +284,7 @@ main(int argc, char **argv) { "compat", 0, 0, 3}, { "from", 1, 0, 'f'}, { "R", 0, 0, 'R'}, - { } + { 0 } }; f = stdout; diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 99750f0..2ca696e 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -210,7 +210,7 @@ out_chassis(ibnd_fabric_t *fabric, int chassisnum) uint64_t guid; fprintf(f, "\nChassis %d", chassisnum); - guid = ibnd_get_chassis_guid(fabric, chassisnum); + guid = ibnd_get_chassis_guid(fabric, (unsigned char) chassisnum); if (guid) fprintf(f, " (guid 0x%" PRIx64 ")", guid); fprintf(f, "\n"); diff --git a/infiniband-diags/src/ibsendtrap.c b/infiniband-diags/src/ibsendtrap.c index d0afca0..13f125f 100644 --- a/infiniband-diags/src/ibsendtrap.c +++ b/infiniband-diags/src/ibsendtrap.c @@ -73,7 +73,7 @@ static void build_trap129(ib_mad_notice_attr_t * n, uint16_t lid) n->issuer_lid = cl_hton16(lid); n->data_details.ntc_129_131.lid = cl_hton16(lid); n->data_details.ntc_129_131.pad = 0; - n->data_details.ntc_129_131.port_num = error_port; + n->data_details.ntc_129_131.port_num = (uint8_t) error_port; } static int send_trap(const char *name, @@ -100,7 +100,7 @@ static int send_trap(const char *name, trap_rpc.dataoffs = IB_SMP_DATA_OFFS; memset(¬ice, 0, sizeof(notice)); - build(¬ice, selfportid.lid); + build(¬ice, (uint16_t) selfportid.lid); return mad_send_via(&trap_rpc, &sm_port, NULL, ¬ice, srcport); } diff --git a/infiniband-diags/src/vendstat.c b/infiniband-diags/src/vendstat.c index 240c4cb..0bf9616 100644 --- a/infiniband-diags/src/vendstat.c +++ b/infiniband-diags/src/vendstat.c @@ -184,8 +184,8 @@ void config_counter_groups(ib_portid_t *portid, int port) cg_config = (is4_config_counter_groups_t *)&buf; printf("counter_groups_config: configuring group0 %d group1 %d\n", cg0, cg1); - cg_config->group_selects[0].group_select = cg0; - cg_config->group_selects[1].group_select = cg1; + cg_config->group_selects[0].group_select = (uint8_t) cg0; + cg_config->group_selects[1].group_select = (uint8_t) cg1; if (!ib_vendor_call_via(&buf, portid, &call, srcport)) IBERROR("config counter group set"); From sean.hefty at intel.com Tue Apr 21 12:04:08 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 21 Apr 2009 12:04:08 -0700 Subject: [ofa-general] [PATCH 2/4] ib-mgmt/ibn3 branch: libibmad update for windows support In-Reply-To: <5A8239DBACE14A9281405317269D03A3@amr.corp.intel.com> References: <5A8239DBACE14A9281405317269D03A3@amr.corp.intel.com> Message-ID: <66501A5EE779464F883074D515517735@amr.corp.intel.com> Signed-off-by: Sean Hefty --- libibmad/src/portid.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/libibmad/src/portid.c b/libibmad/src/portid.c index de9e2d3..6f8fea2 100644 --- a/libibmad/src/portid.c +++ b/libibmad/src/portid.c @@ -38,6 +38,7 @@ #include #include #include +#include #include From sean.hefty at intel.com Tue Apr 21 12:05:21 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 21 Apr 2009 12:05:21 -0700 Subject: [ofa-general] [PATCH 3/4] ib-mgmt/ibn3 branch: libibmad: remove ib_resolve_guid function prototype In-Reply-To: <5A8239DBACE14A9281405317269D03A3@amr.corp.intel.com> References: <5A8239DBACE14A9281405317269D03A3@amr.corp.intel.com> Message-ID: <6DFC19ADE28143D88C4C62F571906C49@amr.corp.intel.com> This function isn't implemented. Signed-off-by: Sean Hefty --- libibmad/include/infiniband/mad.h | 3 --- libibmad/src/libibmad.map | 1 - 2 files changed, 0 insertions(+), 4 deletions(-) diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index b8290a7..188b66b 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -844,9 +844,6 @@ MAD_EXPORT int ib_path_query_via(const struct ibmad_port *srcport, /* resolve.c */ MAD_EXPORT int ib_resolve_smlid(ib_portid_t * sm_id, int timeout) DEPRECATED; -MAD_EXPORT int ib_resolve_guid(ib_portid_t * portid, uint64_t * guid, - ib_portid_t * sm_id, int timeout) - DEPRECATED; MAD_EXPORT int ib_resolve_portid_str(ib_portid_t * portid, char *addr_str, enum MAD_DEST dest, ib_portid_t * sm_id) DEPRECATED; diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map index 4306dbc..daa9319 100644 --- a/libibmad/src/libibmad.map +++ b/libibmad/src/libibmad.map @@ -58,7 +58,6 @@ IBMAD_1.3 { mad_register_server; mad_register_client_via; mad_register_server_via; - ib_resolve_guid; ib_resolve_portid_str; ib_resolve_self; ib_resolve_smlid; From sean.hefty at intel.com Tue Apr 21 12:06:42 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Tue, 21 Apr 2009 12:06:42 -0700 Subject: [ofa-general] [PATCH 4/4] ib-mgmt/ibn3 branch: libibnetdisc add windows support In-Reply-To: <5A8239DBACE14A9281405317269D03A3@amr.corp.intel.com> References: <5A8239DBACE14A9281405317269D03A3@amr.corp.intel.com> Message-ID: Allow libibnetdisc to build and run on Windows as part of the WinOF distribution Signed-off-by: Sean Hefty --- .../libibnetdisc/include/infiniband/ibnetdisc.h | 48 ++++++++++++----------- infiniband-diags/libibnetdisc/src/chassis.c | 4 +- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 18 ++++---- infiniband-diags/libibnetdisc/src/libibnetdisc.map | 8 --- 4 files changed, 39 insertions(+), 39 deletions(-) diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h index a882994..370ae31 100644 --- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h @@ -37,6 +37,7 @@ #include #include #include +#include struct ib_fabric; /* forward declare */ struct chassis; /* forward declare */ @@ -140,11 +141,12 @@ typedef struct ib_fabric { /** ========================================================================= * Initialization (fabric operations) */ -void ibnd_debug(int i); -void ibnd_show_progress(int i); +MAD_EXPORT void ibnd_debug(int i); +MAD_EXPORT void ibnd_show_progress(int i); -ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int dev_port, - int timeout_ms, ib_portid_t *from, int hops); +MAD_EXPORT ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int dev_port, + int timeout_ms, + ib_portid_t *from, int hops); /** * dev_name: (required) local device name to use to access the fabric * dev_port: (required) local device port to use to access the fabric @@ -156,33 +158,35 @@ ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int dev_port, * hops: (optional) Specify how much of the fabric to traverse. * negative value == scan entire fabric */ -void ibnd_destroy_fabric(ibnd_fabric_t *fabric); +MAD_EXPORT void ibnd_destroy_fabric(ibnd_fabric_t *fabric); /** ========================================================================= * Node operations */ -ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t *fabric, uint64_t guid); -ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t *fabric, char *dr_str); -ibnd_node_t *ibnd_update_node(ibnd_node_t *node); +MAD_EXPORT ibnd_node_t *ibnd_find_node_guid(ibnd_fabric_t *fabric, uint64_t guid); +MAD_EXPORT ibnd_node_t *ibnd_find_node_dr(ibnd_fabric_t *fabric, char *dr_str); +MAD_EXPORT ibnd_node_t *ibnd_update_node(ibnd_node_t *node); typedef void (*ibnd_iter_node_func_t)(ibnd_node_t *node, void *user_data); -void ibnd_iter_nodes(ibnd_fabric_t *fabric, - ibnd_iter_node_func_t func, - void *user_data); -void ibnd_iter_nodes_type(ibnd_fabric_t *fabric, - ibnd_iter_node_func_t func, - int node_type, - void *user_data); +MAD_EXPORT void ibnd_iter_nodes(ibnd_fabric_t *fabric, + ibnd_iter_node_func_t func, + void *user_data); +MAD_EXPORT void ibnd_iter_nodes_type(ibnd_fabric_t *fabric, + ibnd_iter_node_func_t func, + int node_type, + void *user_data); /** ========================================================================= * Chassis queries */ -uint64_t ibnd_get_chassis_guid(ibnd_fabric_t *fabric, unsigned char chassisnum); -char *ibnd_get_chassis_type(ibnd_node_t *node); -char *ibnd_get_chassis_slot_str(ibnd_node_t *node, char *str, size_t size); - -int ibnd_is_xsigo_guid(uint64_t guid); -int ibnd_is_xsigo_tca(uint64_t guid); -int ibnd_is_xsigo_hca(uint64_t guid); +MAD_EXPORT uint64_t ibnd_get_chassis_guid(ibnd_fabric_t *fabric, + unsigned char chassisnum); +MAD_EXPORT char *ibnd_get_chassis_type(ibnd_node_t *node); +MAD_EXPORT char *ibnd_get_chassis_slot_str(ibnd_node_t *node, + char *str, size_t size); + +MAD_EXPORT int ibnd_is_xsigo_guid(uint64_t guid); +MAD_EXPORT int ibnd_is_xsigo_tca(uint64_t guid); +MAD_EXPORT int ibnd_is_xsigo_hca(uint64_t guid); #endif /* _IBNETDISC_H_ */ diff --git a/infiniband-diags/libibnetdisc/src/chassis.c b/infiniband-diags/libibnetdisc/src/chassis.c index 6b4930e..dbb0abe 100644 --- a/infiniband-diags/libibnetdisc/src/chassis.c +++ b/infiniband-diags/libibnetdisc/src/chassis.c @@ -156,6 +156,8 @@ static int is_xsigo_switch(uint64_t guid) static uint64_t xsigo_chassisguid(ibnd_node_t *node) { uint64_t sysimgguid = mad_get_field64(node->info, 0, IB_NODE_SYSTEM_GUID_F); + uint64_t remote_sysimgguid; + if (!is_xsigo_ca(sysimgguid)) { /* Byte 3 is NodeType and byte 4 is PortType */ /* If NodeType is 1 (switch), PortType is masked */ @@ -172,7 +174,7 @@ static uint64_t xsigo_chassisguid(ibnd_node_t *node) return sysimgguid; /* If peer port is Leaf 1, use its chassis GUID */ - uint64_t remote_sysimgguid = mad_get_field64( + remote_sysimgguid = mad_get_field64( node->ports[1]->remoteport->node->info, 0, IB_NODE_SYSTEM_GUID_F); if (is_xsigo_leafone(remote_sysimgguid)) diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index 479bae7..77a92e0 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -59,12 +59,13 @@ static int timeout_ms = 2000; static int show_progress = 0; +int ibdebug; void decode_port_info(ibnd_port_t *port) { - port->base_lid = mad_get_field(port->info, 0, IB_PORT_LID_F); - port->lmc = mad_get_field(port->info, 0, IB_PORT_LMC_F); + port->base_lid = (uint16_t) mad_get_field(port->info, 0, IB_PORT_LID_F); + port->lmc = (uint8_t) mad_get_field(port->info, 0, IB_PORT_LMC_F); } static int @@ -72,11 +73,12 @@ get_port_info(struct ibnd_fabric *fabric, struct ibnd_port *port, int portnum, ib_portid_t *portid) { char width[64], speed[64]; + int iwidth; + int ispeed; + port->port.portnum = portnum; - int iwidth = mad_get_field(port->port.info, 0, - IB_PORT_LINK_WIDTH_ACTIVE_F); - int ispeed = mad_get_field(port->port.info, 0, - IB_PORT_LINK_SPEED_ACTIVE_F); + iwidth = mad_get_field(port->port.info, 0, IB_PORT_LINK_WIDTH_ACTIVE_F); + ispeed = mad_get_field(port->port.info, 0, IB_PORT_LINK_SPEED_ACTIVE_F); if (!smp_query_via(port->port.info, portid, IB_ATTR_PORT_INFO, portnum, timeout_ms, fabric->ibmad_port)) @@ -150,8 +152,8 @@ query_node(struct ibnd_fabric *fabric, struct ibnd_node *inode, return -1; decode_port_info(port); - port->base_lid = node->smalid; /* LID is still defined by port 0 */ - port->lmc = node->smalmc; + port->base_lid = (uint16_t) node->smalid; /* LID is still defined by port 0 */ + port->lmc = (uint8_t) node->smalmc; if (!smp_query_via(node->switchinfo, portid, IB_ATTR_SWITCH_INFO, 0, timeout_ms, fabric->ibmad_port)) diff --git a/infiniband-diags/libibnetdisc/src/libibnetdisc.map b/infiniband-diags/libibnetdisc/src/libibnetdisc.map index 5e8c315..bd108ab 100644 --- a/infiniband-diags/libibnetdisc/src/libibnetdisc.map +++ b/infiniband-diags/libibnetdisc/src/libibnetdisc.map @@ -3,24 +3,16 @@ IBNETDISC_1.0 { ibnd_debug; ibnd_show_progress; ibnd_discover_fabric; - ibnd_cache_fabric; - ibnd_read_fabric; ibnd_destroy_fabric; ibnd_find_node_guid; ibnd_update_node; ibnd_find_node_dr; - ibnd_linkwidth_str; - ibnd_linkspeed_str; - ibnd_node_type_str; - ibnd_node_type_str_short; ibnd_is_xsigo_guid; ibnd_is_xsigo_tca; ibnd_is_xsigo_hca; ibnd_get_chassis_guid; ibnd_get_chassis_type; ibnd_get_chassis_slot_str; - ibnd_linkstate_str; - ibnd_physstate_str; ibnd_iter_nodes; ibnd_iter_nodes_type; local: *; From chien.tin.tung at intel.com Tue Apr 21 12:24:56 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Tue, 21 Apr 2009 14:24:56 -0500 Subject: [ofa-general] [PATCH] RDMA/nes: Fix resource issues in nes_create_cq and nes_destroy_cq Message-ID: <20090421192456.GA4072@ctung-MOBL> From: Miroslaw Walukiewicz In error paths where a CQ is not created, pbl is not freeed properly. In nes_destroy_cq, add the corresponding check for nescq->mcrqf to not call nes_free_resource when it is already done in nes_create_cq. Signed-off-by: Miroslaw Walukiewicz Signed-off-by: Chien Tung --- drivers/infiniband/hw/nes/nes_verbs.c | 26 +++++++++++++++++++++++++- 1 files changed, 25 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index f04bb1a..a613080 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -1627,6 +1627,7 @@ static struct ib_cq *nes_create_cq(struct ib_device *ibdev, int entries, nescq->hw_cq.cq_number = nes_ucontext->mcrqf & 0xffff; else nescq->hw_cq.cq_number = nesvnic->mcrq_qp_id + nes_ucontext->mcrqf-1; + nescq->mcrqf = nes_ucontext->mcrqf; nes_free_resource(nesadapter, nesadapter->allocated_cqs, cq_num); } nes_debug(NES_DBG_CQ, "CQ Virtual Address = %08lX, size = %u.\n", @@ -1682,6 +1683,12 @@ static struct ib_cq *nes_create_cq(struct ib_device *ibdev, int entries, if (!context) pci_free_consistent(nesdev->pcidev, nescq->cq_mem_size, mem, nescq->hw_cq.cq_pbase); + else { + pci_free_consistent(nesdev->pcidev, nespbl->pbl_size, + nespbl->pbl_vbase, nespbl->pbl_pbase); + kfree(nespbl); + } + nes_free_resource(nesadapter, nesadapter->allocated_cqs, cq_num); kfree(nescq); return ERR_PTR(-ENOMEM); @@ -1705,6 +1712,11 @@ static struct ib_cq *nes_create_cq(struct ib_device *ibdev, int entries, if (!context) pci_free_consistent(nesdev->pcidev, nescq->cq_mem_size, mem, nescq->hw_cq.cq_pbase); + else { + pci_free_consistent(nesdev->pcidev, nespbl->pbl_size, + nespbl->pbl_vbase, nespbl->pbl_pbase); + kfree(nespbl); + } nes_free_resource(nesadapter, nesadapter->allocated_cqs, cq_num); kfree(nescq); return ERR_PTR(-ENOMEM); @@ -1722,6 +1734,11 @@ static struct ib_cq *nes_create_cq(struct ib_device *ibdev, int entries, if (!context) pci_free_consistent(nesdev->pcidev, nescq->cq_mem_size, mem, nescq->hw_cq.cq_pbase); + else { + pci_free_consistent(nesdev->pcidev, nespbl->pbl_size, + nespbl->pbl_vbase, nespbl->pbl_pbase); + kfree(nespbl); + } nes_free_resource(nesadapter, nesadapter->allocated_cqs, cq_num); kfree(nescq); return ERR_PTR(-ENOMEM); @@ -1774,6 +1791,11 @@ static struct ib_cq *nes_create_cq(struct ib_device *ibdev, int entries, if (!context) pci_free_consistent(nesdev->pcidev, nescq->cq_mem_size, mem, nescq->hw_cq.cq_pbase); + else { + pci_free_consistent(nesdev->pcidev, nespbl->pbl_size, + nespbl->pbl_vbase, nespbl->pbl_pbase); + kfree(nespbl); + } nes_free_resource(nesadapter, nesadapter->allocated_cqs, cq_num); kfree(nescq); return ERR_PTR(-EIO); @@ -1855,7 +1877,9 @@ static int nes_destroy_cq(struct ib_cq *ib_cq) set_wqe_32bit_value(cqp_wqe->wqe_words, NES_CQP_WQE_OPCODE_IDX, opcode); set_wqe_32bit_value(cqp_wqe->wqe_words, NES_CQP_WQE_ID_IDX, (nescq->hw_cq.cq_number | ((u32)PCI_FUNC(nesdev->pcidev->devfn) << 16))); - nes_free_resource(nesadapter, nesadapter->allocated_cqs, nescq->hw_cq.cq_number); + if (!nescq->mcrqf) + nes_free_resource(nesadapter, nesadapter->allocated_cqs, nescq->hw_cq.cq_number); + atomic_set(&cqp_request->refcount, 2); nes_post_cqp_request(nesdev, cqp_request); -- 1.5.3.3 From rdreier at cisco.com Tue Apr 21 15:14:27 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 21 Apr 2009 15:14:27 -0700 Subject: [ofa-general] [PATCH] RDMA/nes: Fix resource issues in nes_create_cq and nes_destroy_cq In-Reply-To: <20090421192456.GA4072@ctung-MOBL> (Chien Tung's message of "Tue, 21 Apr 2009 14:24:56 -0500") References: <20090421192456.GA4072@ctung-MOBL> Message-ID: thanks, applied. Would be nice to try and clean up the code in the future so a single error path was there, so you didn't have to duplicate the cleanup code so many times. - R. From rdreier at cisco.com Tue Apr 21 15:47:15 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 21 Apr 2009 15:47:15 -0700 Subject: [ofa-general] [PATCH] RDMA/nes: Fix resource issues in nes_create_cq and nes_destroy_cq In-Reply-To: (Roland Dreier's message of "Tue, 21 Apr 2009 15:14:27 -0700") References: <20090421192456.GA4072@ctung-MOBL> Message-ID: err... drivers/infiniband/hw/nes/nes_verbs.c: In function 'nes_create_cq': drivers/infiniband/hw/nes/nes_verbs.c:1630: error: 'struct nes_cq' has no member named 'mcrqf' drivers/infiniband/hw/nes/nes_verbs.c: In function 'nes_destroy_cq': drivers/infiniband/hw/nes/nes_verbs.c:1880: error: 'struct nes_cq' has no member named 'mcrqf' was there a header file change that you forgot to include with the patch? - R. From chien.tin.tung at intel.com Tue Apr 21 16:13:09 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Tue, 21 Apr 2009 18:13:09 -0500 Subject: [ofa-general] [PATCH v2] RDMA/nes: Fix resource issues in nes_create_cq and nes_destroy_cq Message-ID: <20090421231308.GA6064@ctung-MOBL> From: Miroslaw Walukiewicz In error paths where a CQ is not created, pbl is not freeed properly. In nes_destroy_cq, add the corresponding check for nescq->mcrqf to not call nes_free_resource when it is already done in nes_create_cq. Signed-off-by: Miroslaw Walukiewicz Signed-off-by: Chien Tung --- V2 change: include missing mcrqf in nes_cq structure. drivers/infiniband/hw/nes/nes_verbs.c | 26 +++++++++++++++++++++++++- drivers/infiniband/hw/nes/nes_verbs.h | 1 + 2 files changed, 26 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index 504e31d..8b460c2 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -1627,6 +1627,7 @@ static struct ib_cq *nes_create_cq(struct ib_device *ibdev, int entries, nescq->hw_cq.cq_number = nes_ucontext->mcrqf & 0xffff; else nescq->hw_cq.cq_number = nesvnic->mcrq_qp_id + nes_ucontext->mcrqf-1; + nescq->mcrqf = nes_ucontext->mcrqf; nes_free_resource(nesadapter, nesadapter->allocated_cqs, cq_num); } nes_debug(NES_DBG_CQ, "CQ Virtual Address = %08lX, size = %u.\n", @@ -1682,6 +1683,12 @@ static struct ib_cq *nes_create_cq(struct ib_device *ibdev, int entries, if (!context) pci_free_consistent(nesdev->pcidev, nescq->cq_mem_size, mem, nescq->hw_cq.cq_pbase); + else { + pci_free_consistent(nesdev->pcidev, nespbl->pbl_size, + nespbl->pbl_vbase, nespbl->pbl_pbase); + kfree(nespbl); + } + nes_free_resource(nesadapter, nesadapter->allocated_cqs, cq_num); kfree(nescq); return ERR_PTR(-ENOMEM); @@ -1705,6 +1712,11 @@ static struct ib_cq *nes_create_cq(struct ib_device *ibdev, int entries, if (!context) pci_free_consistent(nesdev->pcidev, nescq->cq_mem_size, mem, nescq->hw_cq.cq_pbase); + else { + pci_free_consistent(nesdev->pcidev, nespbl->pbl_size, + nespbl->pbl_vbase, nespbl->pbl_pbase); + kfree(nespbl); + } nes_free_resource(nesadapter, nesadapter->allocated_cqs, cq_num); kfree(nescq); return ERR_PTR(-ENOMEM); @@ -1722,6 +1734,11 @@ static struct ib_cq *nes_create_cq(struct ib_device *ibdev, int entries, if (!context) pci_free_consistent(nesdev->pcidev, nescq->cq_mem_size, mem, nescq->hw_cq.cq_pbase); + else { + pci_free_consistent(nesdev->pcidev, nespbl->pbl_size, + nespbl->pbl_vbase, nespbl->pbl_pbase); + kfree(nespbl); + } nes_free_resource(nesadapter, nesadapter->allocated_cqs, cq_num); kfree(nescq); return ERR_PTR(-ENOMEM); @@ -1774,6 +1791,11 @@ static struct ib_cq *nes_create_cq(struct ib_device *ibdev, int entries, if (!context) pci_free_consistent(nesdev->pcidev, nescq->cq_mem_size, mem, nescq->hw_cq.cq_pbase); + else { + pci_free_consistent(nesdev->pcidev, nespbl->pbl_size, + nespbl->pbl_vbase, nespbl->pbl_pbase); + kfree(nespbl); + } nes_free_resource(nesadapter, nesadapter->allocated_cqs, cq_num); kfree(nescq); return ERR_PTR(-EIO); @@ -1855,7 +1877,9 @@ static int nes_destroy_cq(struct ib_cq *ib_cq) set_wqe_32bit_value(cqp_wqe->wqe_words, NES_CQP_WQE_OPCODE_IDX, opcode); set_wqe_32bit_value(cqp_wqe->wqe_words, NES_CQP_WQE_ID_IDX, (nescq->hw_cq.cq_number | ((u32)PCI_FUNC(nesdev->pcidev->devfn) << 16))); - nes_free_resource(nesadapter, nesadapter->allocated_cqs, nescq->hw_cq.cq_number); + if (!nescq->mcrqf) + nes_free_resource(nesadapter, nesadapter->allocated_cqs, nescq->hw_cq.cq_number); + atomic_set(&cqp_request->refcount, 2); nes_post_cqp_request(nesdev, cqp_request); diff --git a/drivers/infiniband/hw/nes/nes_verbs.h b/drivers/infiniband/hw/nes/nes_verbs.h index 5e48f67..41c07f2 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.h +++ b/drivers/infiniband/hw/nes/nes_verbs.h @@ -112,6 +112,7 @@ struct nes_cq { spinlock_t lock; u8 virtual_cq; u8 pad[3]; + u32 mcrqf; }; struct nes_wq { -- 1.5.3.3 From chien.tin.tung at intel.com Tue Apr 21 16:15:05 2009 From: chien.tin.tung at intel.com (Tung, Chien Tin) Date: Tue, 21 Apr 2009 16:15:05 -0700 Subject: [ofa-general] [PATCH] RDMA/nes: Fix resource issues in nes_create_cq and nes_destroy_cq In-Reply-To: References: <20090421192456.GA4072@ctung-MOBL> Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA383034BDD2416@azsmsx501.amr.corp.intel.com> >Would be nice to try and clean up the code in the future so a single >error path was there, so you didn't have to duplicate the cleanup code >so many times. Agreed. Another item for cleanup. Sorry about the missing header. Same mistake was made in my ofed git and I didn't carry the subsequent commit with this one. Chien From rdreier at cisco.com Tue Apr 21 16:41:32 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 21 Apr 2009 16:41:32 -0700 Subject: [ofa-general] Re: [PATCH v2] RDMA/nes: Fix resource issues in nes_create_cq and nes_destroy_cq In-Reply-To: <20090421231308.GA6064@ctung-MOBL> (Chien Tung's message of "Tue, 21 Apr 2009 18:13:09 -0500") References: <20090421231308.GA6064@ctung-MOBL> Message-ID: OK, applied this fixed patch, thanks! From chien.tin.tung at intel.com Tue Apr 21 17:32:50 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Tue, 21 Apr 2009 19:32:50 -0500 Subject: [ofa-general] [PATCH 1/4] RDMA/nes: modify thermo mitigation to flip SerDes1 ref clk to internal Message-ID: <20090422003250.GA4036@ctung-MOBL> Change thermo mitigation code to flip the SerDes1 reference clock to internal to match the change in commit a4849fc157cdbe4fb68cfe37e7222697f003deb5 Signed-off-by: Chien Tung --- I _did_ test compile this patch series. :-) drivers/infiniband/hw/nes/nes_hw.c | 7 ++----- 1 files changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_hw.c b/drivers/infiniband/hw/nes/nes_hw.c index d6fc9ae..7e20a7f 100644 --- a/drivers/infiniband/hw/nes/nes_hw.c +++ b/drivers/infiniband/hw/nes/nes_hw.c @@ -550,11 +550,8 @@ struct nes_adapter *nes_init_adapter(struct nes_device *nesdev, u8 hw_rev) { msleep(1); } if (int_cnt > 1) { - u32 sds; spin_lock_irqsave(&nesadapter->phy_lock, flags); - sds = nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL1); - sds |= 0x00000040; - nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL1, sds); + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL1, 0x0000F0C8); mh_detected++; reset_value = nes_read32(nesdev->regs+NES_SOFTWARE_RESET); reset_value |= 0x0000003d; @@ -579,7 +576,7 @@ struct nes_adapter *nes_init_adapter(struct nes_device *nesdev, u8 hw_rev) { if (++ext_cnt > int_cnt) { spin_lock_irqsave(&nesadapter->phy_lock, flags); nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL1, - 0x0000F0C8); + 0x0000F088); mh_detected++; reset_value = nes_read32(nesdev->regs+NES_SOFTWARE_RESET); reset_value |= 0x0000003d; -- 1.5.3.3 From chien.tin.tung at intel.com Tue Apr 21 17:32:53 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Tue, 21 Apr 2009 19:32:53 -0500 Subject: [ofa-general] [PATCH 2/4] RDMA/nes: correct CDR loop filter setting for port 1 Message-ID: <20090422003253.GA2160@ctung-MOBL> In commit 1b9493248cf5e9f1ecc045488100cbf3ccd91be1, there is a mistake in the clean up code that removed port 1 CDR loop filter settings for 10G cards other than CX4. Put the correct setting back for appropriate PHY types. Signed-off-by: Chien Tung --- drivers/infiniband/hw/nes/nes_hw.c | 14 ++++++++------ 1 files changed, 8 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_hw.c b/drivers/infiniband/hw/nes/nes_hw.c index 7e20a7f..b5d9c4b 100644 --- a/drivers/infiniband/hw/nes/nes_hw.c +++ b/drivers/infiniband/hw/nes/nes_hw.c @@ -761,6 +761,9 @@ static int nes_init_serdes(struct nes_device *nesdev, u8 hw_rev, u8 port_count, return 0; /* init serdes 1 */ + if (!(OneG_Mode && (nesadapter->phy_type[1] != NES_PHY_TYPE_PUMA_1G))) + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_CDR_CONTROL1, 0x000000FF); + switch (nesadapter->phy_type[1]) { case NES_PHY_TYPE_ARGUS: case NES_PHY_TYPE_SFP_D: @@ -768,21 +771,20 @@ static int nes_init_serdes(struct nes_device *nesdev, u8 hw_rev, u8 port_count, nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_EMP1, 0x00000000); break; case NES_PHY_TYPE_CX4: - sds = nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL1); - sds &= 0xFFFFFFBF; - nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL1, sds); if (wide_ppm_offset) nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_CDR_CONTROL1, 0x000FFFAA); - else - nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_CDR_CONTROL1, 0x000000FF); break; case NES_PHY_TYPE_PUMA_1G: sds = nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL1); sds |= 0x000000100; nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL1, sds); } - if (!OneG_Mode) + if (!OneG_Mode) { nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_TX_HIGHZ_LANE_MODE1, 0x11110000); + sds = nes_read_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL1); + sds &= 0xFFFFFFBF; + nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL1, sds); + } } else { /* init serdes 0 */ nes_write_indexed(nesdev, NES_IDX_ETH_SERDES_COMMON_CONTROL0, 0x00000008); -- 1.5.3.3 From chien.tin.tung at intel.com Tue Apr 21 17:32:57 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Tue, 21 Apr 2009 19:32:57 -0500 Subject: [ofa-general] [PATCH 3/4] RDMA/nes: Enable repause timer for port 1 Message-ID: <20090422003257.GA5872@ctung-MOBL> Enable repause timer for port 1. Without this setting, under stress, can cause the chip to misbehave. Signed-off-by: Chien Tung --- drivers/infiniband/hw/nes/nes_hw.c | 6 ++++++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_hw.c b/drivers/infiniband/hw/nes/nes_hw.c index b5d9c4b..2aa0216 100644 --- a/drivers/infiniband/hw/nes/nes_hw.c +++ b/drivers/infiniband/hw/nes/nes_hw.c @@ -912,6 +912,12 @@ static void nes_init_csr_ne020(struct nes_device *nesdev, u8 hw_rev, u8 port_cou u32temp &= 0x7fffffff; u32temp |= 0x7fff0010; nes_write_indexed(nesdev, 0x000021f8, u32temp); + if (port_count > 1) { + u32temp = nes_read_indexed(nesdev, 0x000023f8); + u32temp &= 0x7fffffff; + u32temp |= 0x7fff0010; + nes_write_indexed(nesdev, 0x000023f8, u32temp); + } } } -- 1.5.3.3 From chien.tin.tung at intel.com Tue Apr 21 17:33:00 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Tue, 21 Apr 2009 19:33:00 -0500 Subject: [ofa-general] [PATCH 4/4] RDMA/nes: set trace length to 1 inch for SFP_D Message-ID: <20090422003300.GA3884@ctung-MOBL> With updated PHY firmware for SFP_D, setting trace length to 1 inch for SFP_D provides a more stable link. Signed-off-by: Chien Tung --- drivers/infiniband/hw/nes/nes_hw.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_hw.c b/drivers/infiniband/hw/nes/nes_hw.c index 2aa0216..b832a7b 100644 --- a/drivers/infiniband/hw/nes/nes_hw.c +++ b/drivers/infiniband/hw/nes/nes_hw.c @@ -1371,13 +1371,14 @@ int nes_init_phy(struct nes_device *nesdev) if (phy_type == NES_PHY_TYPE_ARGUS) { nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc302, 0x000C); nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc319, 0x0008); + nes_write_10G_phy_reg(nesdev, phy_index, 0x3, 0x0027, 0x0001); } else { nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc302, 0x0004); nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc319, 0x0038); + nes_write_10G_phy_reg(nesdev, phy_index, 0x3, 0x0027, 0x0013); } nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xc31a, 0x0098); nes_write_10G_phy_reg(nesdev, phy_index, 0x3, 0x0026, 0x0E00); - nes_write_10G_phy_reg(nesdev, phy_index, 0x3, 0x0027, 0x0001); /* setup LEDs */ nes_write_10G_phy_reg(nesdev, phy_index, 0x1, 0xd006, 0x0007); -- 1.5.3.3 From chien.tin.tung at intel.com Tue Apr 21 18:00:44 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Tue, 21 Apr 2009 20:00:44 -0500 Subject: [ofa-general] [PATCH] RDMA/nes: fix fw_ver in /sys Message-ID: <20090422010044.GA4412@ctung-MOBL> /sys/class/infiniband/nes?/fw_ver is not displaying firmware version properly (0.0.0). Fill in the correct firmware version number. Signed-off-by: Chien Tung --- drivers/infiniband/hw/nes/nes_verbs.c | 7 +++---- 1 files changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index 8b460c2..64d5cfd 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -2808,10 +2808,9 @@ static ssize_t show_fw_ver(struct device *dev, struct device_attribute *attr, struct nes_vnic *nesvnic = nesibdev->nesvnic; nes_debug(NES_DBG_INIT, "\n"); - return sprintf(buf, "%x.%x.%x\n", - (int)(nesvnic->nesdev->nesadapter->fw_ver >> 32), - (int)(nesvnic->nesdev->nesadapter->fw_ver >> 16) & 0xffff, - (int)(nesvnic->nesdev->nesadapter->fw_ver & 0xffff)); + return sprintf(buf, "%u.%u\n", + (nesvnic->nesdev->nesadapter->firmware_version >> 16), + (nesvnic->nesdev->nesadapter->firmware_version & 0x000000ff)); } -- 1.5.3.3 From chien.tin.tung at intel.com Tue Apr 21 18:17:09 2009 From: chien.tin.tung at intel.com (Chien Tung) Date: Tue, 21 Apr 2009 20:17:09 -0500 Subject: [ofa-general] [PATCH] RDMA/nes: remove compile warning nes_cm.c:857 without INFINIBAND_NES_DEBUG Message-ID: <20090422011709.GA4228@ctung-MOBL> Remove the NES_DEBUG that is causing the compile warning without INFINIBAND_NES_DEBUG defined. Signed-off-by: Chien Tung --- drivers/infiniband/hw/nes/nes_cm.c | 4 ---- 1 files changed, 0 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index dbd9a75..7da5437 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -854,7 +854,6 @@ static struct nes_cm_listener *find_listener(struct nes_cm_core *cm_core, { unsigned long flags; struct nes_cm_listener *listen_node; - __be32 tmp_addr = cpu_to_be32(dst_addr); /* walk list and find cm_node associated with this session ID */ spin_lock_irqsave(&cm_core->listen_list_lock, flags); @@ -871,9 +870,6 @@ static struct nes_cm_listener *find_listener(struct nes_cm_core *cm_core, } spin_unlock_irqrestore(&cm_core->listen_list_lock, flags); - nes_debug(NES_DBG_CM, "Unable to find listener for %pI4:%x\n", - &tmp_addr, dst_port); - /* no listener */ return NULL; } -- 1.5.3.3 From jgunthorpe at obsidianresearch.com Tue Apr 21 22:28:45 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Tue, 21 Apr 2009 23:28:45 -0600 Subject: [ofa-general] [PATCH] ibutils: git-log calls have been changed to git log as git-xxx syntax is not working with latest git releases In-Reply-To: <49ED7AAE.3010707@ext.bull.net> References: <49ED7AAE.3010707@ext.bull.net> Message-ID: <20090422052845.GA13093@obsidianresearch.com> On Tue, Apr 21, 2009 at 09:50:06AM +0200, Nicolas Morey-Chaisemartin wrote: > Signed-off-by: Nicolas Morey-Chaisemartin > > ibdiag/src/Makefile.am | 2 +- > ibdm/ibdm/Makefile.am | 2 +- > ibis/src/Makefile.am | 2 +- > ibmgtsim/src/Makefile.am | 2 +- > 4 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/ibdiag/src/Makefile.am b/ibdiag/src/Makefile.am > index def8b0a..7158bbd 100644 > +++ b/ibdiag/src/Makefile.am > @@ -42,7 +42,7 @@ GIT=$(shell which git) > > git_version.tcl : @MAINTAINER_MODE_TRUE@ FORCE > if test x$(GIT) != x ; then \ > - gitver=`cd $(srcdir) ; git-log | head -1 | cut -f2 -d\ `; \ > + gitver=`cd $(srcdir) ; git log | head -1 | cut -f2 -d\ `; \ > changes=`cd $(srcdir) ; git diff . | grep ^diff | wc -l`; \ Gah, that is an awful choice of command for this purpose anyhow. All of those should just be: git rev-parse --verify HEAD Which gives the same output, dramatically faster. Jason From jgunthorpe at obsidianresearch.com Tue Apr 21 22:32:03 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Tue, 21 Apr 2009 23:32:03 -0600 Subject: [ofa-general] ***SPAM*** hello and gitweb improvements In-Reply-To: <592164a70904210942h23a12657s212574041f2a9cdf@mail.gmail.com> References: <592164a70904210942h23a12657s212574041f2a9cdf@mail.gmail.com> Message-ID: <20090422053203.GB13093@obsidianresearch.com> On Tue, Apr 21, 2009 at 12:42:48PM -0400, Ido Rosen wrote: > Hi everyone, > > I'm going to be helping Jeff B. and Jeff S. improve things at > openfabrics.org in my spare time. The first improvement's gone > through already: The git.OpenFabrics.org gitweb now caches properly. > It should be significantly faster. Try it out. Oh thank you that is so much better. One vaugely related wish - it sure would be nice to have some way to know which are the repositories used to build the source tar balls that the distributions take.. Jason From ido at uchicago.edu Tue Apr 21 23:37:03 2009 From: ido at uchicago.edu (Ido Rosen) Date: Wed, 22 Apr 2009 02:37:03 -0400 Subject: [ofa-general] ***SPAM*** hello and gitweb improvements In-Reply-To: <20090422053203.GB13093@obsidianresearch.com> References: <592164a70904210942h23a12657s212574041f2a9cdf@mail.gmail.com> <20090422053203.GB13093@obsidianresearch.com> Message-ID: <592164a70904212337n3f9896c8g679fec2c98964369@mail.gmail.com> Individual repository owners can edit their .git/description files. Just label your repository appropriately by editing that file and it'll show up on gitweb. On Wed, Apr 22, 2009 at 1:32 AM, Jason Gunthorpe wrote: > On Tue, Apr 21, 2009 at 12:42:48PM -0400, Ido Rosen wrote: >> Hi everyone, >> >> I'm going to be helping Jeff B. and Jeff S. improve things at >> openfabrics.org in my spare time.  The first improvement's gone >> through already: The git.OpenFabrics.org gitweb now caches properly. >> It should be significantly faster.  Try it out. > > Oh thank you that is so much better. > > One vaugely related wish - it sure would be nice to have some way to > know which are the repositories used to build the source tar balls > that the distributions take.. > > Jason > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From vlad at lists.openfabrics.org Wed Apr 22 03:23:42 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 22 Apr 2009 03:23:42 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090422-0200 daily build status Message-ID: <20090422102342.5B9D3E61500@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From chien.tin.tung at intel.com Wed Apr 22 06:41:16 2009 From: chien.tin.tung at intel.com (Tung, Chien Tin) Date: Wed, 22 Apr 2009 06:41:16 -0700 Subject: [ofa-general] ***SPAM*** hello and gitweb improvements In-Reply-To: <592164a70904212337n3f9896c8g679fec2c98964369@mail.gmail.com> References: <592164a70904210942h23a12657s212574041f2a9cdf@mail.gmail.com> <20090422053203.GB13093@obsidianresearch.com> <592164a70904212337n3f9896c8g679fec2c98964369@mail.gmail.com> Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA383034BDD2877@azsmsx501.amr.corp.intel.com> I want to take a moment and thank all our volunteer administrators: Vlad, Jeff B. and Jeff S. It is a thankless job. Ido, I hope you know what you are getting yourself into. :-) >Individual repository owners can edit their .git/description files. >Just label your repository appropriately by editing that file and >it'll show up on gitweb. On to the thankless portion, question on gitweb. My scm directory does not show up on gitweb (~ctung). What do I need to do to get my directory hooked in? Chien From paran at nsc.liu.se Wed Apr 22 07:11:34 2009 From: paran at nsc.liu.se (=?ISO-8859-1?Q?P=E4r_Andersson?=) Date: Wed, 22 Apr 2009 16:11:34 +0200 Subject: [ofa-general] OFED version for RHEL/CentOS 5.3? In-Reply-To: <49EB175F.9050901@mellanox.co.il> References: <49E917F9.9070208@nsc.liu.se> <49EB175F.9050901@mellanox.co.il> Message-ID: <49EF2596.3020507@nsc.liu.se> Tziporet Koren wrote: > You will need to take 1.4.1 since 1.4 does not supporting CentOS 3.5 Good to know. >> How compatible is 1.4 with 1.3, if we should install that? Will MPI >> libraries and other applications continue to work or need to be >> recompiled? > I think you best options is take libraries and kernel from CentOS 5.3. > If not I would move to OFED 1.4.1 (GA soon) Thanks for the input, I installed the CentOS 5.3 RPMs for everything. I also found a way to get an idea of what OFED version that is in CentOS: # rpm -qa openib openib-1.3.2-0.20080728.0355.3.el5 Definitely newer than our own 1.3.1 packages. Regards, Pär From celine.bourde at ext.bull.net Wed Apr 22 07:41:37 2009 From: celine.bourde at ext.bull.net (Celine Bourde) Date: Wed, 22 Apr 2009 16:41:37 +0200 Subject: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition] Message-ID: <49EF2CA1.50408@ext.bull.net> > On the server did you remember to: > echo rdma 2050 > /proc/fs/nfsd/portlist Yes. My nfs utils version : [root at my_host]#/sbin/mount.nfs -V mount.nfs (linux nfs-utils 1.1.4) If I try mount.nfs command : [root at my_host ~]# strace mount.nfs 192.168.0.215:/vol0 /mnt/ -o rdma,port=2050 [..] socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 3 bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 fcntl(3, F_GETFL) = 0x2 (flags O_RDWR) fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 connect(3, {sa_family=AF_INET, sin_port=htons(997), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 fcntl(3, F_SETFL, O_RDWR) = 0 connect(3, {sa_family=AF_UNSPEC, sa_data="\0o\177\0\0\1\0\0\0\0\0\0\0\0"}, 16) = 0 sendto(3, "uw\211\257\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., 40, 0, {sa_family=AF_INET, sin_port=htons(997), sin_addr=inet_addr("127.0.0.1")}, 16) = 40 poll([{fd=3, events=POLLIN}], 1, 3000) = 1 ([{fd=3, revents=POLLIN}]) recvfrom(3, "uw\211\257\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 400, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(997), sin_addr=inet_addr("127.0.0.1")}, [16]) = 24 close(3) = 0 mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, "rdma,port=2050,addr=192.168.0.21" and it blocks again. >> ? Also, make sure that none of the stock 2.6.27 NFS modules are loaded - >> they should all >> be from OFED 1.4.1 > All nfs modules (OFED 1.4.1) are loaded (and dependencies). Modules are listed bellow -> updates directory means it comes from OFED version and no kernel version) [root at my_host ~]# modinfo ib_ipoib filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko [root at my_host ~]# modinfo ib_sa filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_sa.ko [root at my_host ~]# modinfo iw_cm filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/iw_cm.ko [root at my_host ~]# modinfo ib_addr filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_addr.ko [root at my_host ~]# modinfo mlx4_core filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/net/mlx4/mlx4_core.ko [root at my_host ~]# modinfo ib_core filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_core.ko [root at my_host ~]# modinfo ib_cm filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_cm.ko [root at my_host ~]# modinfo rdma_cm filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/rdma_cm.ko [root at my_host ~]# modinfo xprtrdma filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/xprtrdma/xprtrdma.ko [root at my_host ~]# modinfo svcrdma filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/xprtrdma/svcrdma.ko [root at my_host ofa_kernel-1.4.1]# modinfo nfsd filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/nfsd/nfsd.ko [root at my_host ofa_kernel-1.4.1]# modinfo lockd filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/lockd/lockd.ko [root at my_host ofa_kernel-1.4.1]# modinfo nfs_acl filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/nfs_common/nfs_acl.ko [root at my_host ofa_kernel-1.4.1]# modinfo sunrpc filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/sunrpc.ko root at my_host ofa_kernel-1.4.1]# modinfo exportfs filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/exportfs/exportfs.ko [root at my_host ofa_kernel-1.4.1]# modinfo auth_rpcgss filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/auth_gss/auth_rpcgss.ko Any other Idea ? Thanks. Céline. From jon at opengridcomputing.com Wed Apr 22 07:54:34 2009 From: jon at opengridcomputing.com (Jon Mason) Date: Wed, 22 Apr 2009 09:54:34 -0500 Subject: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition] In-Reply-To: <49EF2CA1.50408@ext.bull.net> References: <49EF2CA1.50408@ext.bull.net> Message-ID: <20090422145434.GB18072@opengridcomputing.com> On Wed, Apr 22, 2009 at 04:41:37PM +0200, Celine Bourde wrote: >> On the server did you remember to: >> echo rdma 2050 > /proc/fs/nfsd/portlist > > Yes. > > My nfs utils version : > > [root at my_host]#/sbin/mount.nfs -V > mount.nfs (linux nfs-utils 1.1.4) > > If I try mount.nfs command : > > [root at my_host ~]# strace mount.nfs 192.168.0.215:/vol0 /mnt/ -o rdma,port=2050 Does it work without rdma? > [..] > socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 3 > bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 > fcntl(3, F_GETFL) = 0x2 (flags O_RDWR) > fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > connect(3, {sa_family=AF_INET, sin_port=htons(997), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 > fcntl(3, F_SETFL, O_RDWR) = 0 > connect(3, {sa_family=AF_UNSPEC, sa_data="\0o\177\0\0\1\0\0\0\0\0\0\0\0"}, 16) = 0 > sendto(3, "uw\211\257\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., 40, 0, {sa_family=AF_INET, sin_port=htons(997), sin_addr=inet_addr("127.0.0.1")}, 16) = 40 > poll([{fd=3, events=POLLIN}], 1, 3000) = 1 ([{fd=3, revents=POLLIN}]) > recvfrom(3, "uw\211\257\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 400, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(997), sin_addr=inet_addr("127.0.0.1")}, [16]) = 24 > close(3) = 0 > mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, "rdma,port=2050,addr=192.168.0.21" > > and it blocks again. > >>> ? Also, make sure that none of the stock 2.6.27 NFS modules are loaded - >>> they should all >>> be from OFED 1.4.1 >> > > All nfs modules (OFED 1.4.1) are loaded (and dependencies). > Modules are listed bellow -> updates directory means it > comes from OFED version and no kernel version) > > [root at my_host ~]# modinfo ib_ipoib > filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko > > [root at my_host ~]# modinfo ib_sa > filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_sa.ko > > [root at my_host ~]# modinfo iw_cm > filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/iw_cm.ko > > [root at my_host ~]# modinfo ib_addr > filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_addr.ko > > [root at my_host ~]# modinfo mlx4_core > filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/net/mlx4/mlx4_core.ko > > [root at my_host ~]# modinfo ib_core > filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_core.ko > > [root at my_host ~]# modinfo ib_cm > filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_cm.ko > > [root at my_host ~]# modinfo rdma_cm > filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/rdma_cm.ko > > [root at my_host ~]# modinfo xprtrdma > filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/xprtrdma/xprtrdma.ko > > [root at my_host ~]# modinfo svcrdma > filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/xprtrdma/svcrdma.ko > > [root at my_host ofa_kernel-1.4.1]# modinfo nfsd > filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/nfsd/nfsd.ko > > [root at my_host ofa_kernel-1.4.1]# modinfo lockd > filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/lockd/lockd.ko > > [root at my_host ofa_kernel-1.4.1]# modinfo nfs_acl > filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/nfs_common/nfs_acl.ko > > [root at my_host ofa_kernel-1.4.1]# modinfo sunrpc > filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/sunrpc.ko > > root at my_host ofa_kernel-1.4.1]# modinfo exportfs > filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/exportfs/exportfs.ko > > [root at my_host ofa_kernel-1.4.1]# modinfo auth_rpcgss > filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/auth_gss/auth_rpcgss.ko > > Any other Idea ? Does dmesg say anything interesting? > Thanks. > > Céline. > > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From Jeffrey.C.Becker at nasa.gov Wed Apr 22 09:34:38 2009 From: Jeffrey.C.Becker at nasa.gov (Jeff Becker) Date: Wed, 22 Apr 2009 09:34:38 -0700 Subject: [ofa-general] ***SPAM*** hello and gitweb improvements In-Reply-To: <60BEFF3FBD4C6047B0F13F205CAFA383034BDD2877@azsmsx501.amr.corp.intel.com> References: <592164a70904210942h23a12657s212574041f2a9cdf@mail.gmail.com> <20090422053203.GB13093@obsidianresearch.com> <592164a70904212337n3f9896c8g679fec2c98964369@mail.gmail.com> <60BEFF3FBD4C6047B0F13F205CAFA383034BDD2877@azsmsx501.amr.corp.intel.com> Message-ID: <49EF471E.5000001@nasa.gov> Hi Tung, Chien Tin wrote: > I want to take a moment and thank all our volunteer administrators: > Vlad, Jeff B. and Jeff S. It is a thankless job. Ido, I hope you > know what you are getting yourself into. :-) > You're welcome! > > >> Individual repository owners can edit their .git/description files. >> Just label your repository appropriately by editing that file and >> it'll show up on gitweb. >> > > On to the thankless portion, question on gitweb. My scm directory > does not show up on gitweb (~ctung). What do I need to do to get > my directory hooked in? > I just checked on our server (not the new sofa.openfabrics.org), and /data/scm/~ctung is symbolically linked to your ~/scm directory as it should. If you're talking about the old server, gitweb should work for you. Note it does NOT on sofa.openfabrics.org (yet). -jeff > Chien_______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From Jeffrey.C.Becker at nasa.gov Wed Apr 22 10:02:40 2009 From: Jeffrey.C.Becker at nasa.gov (Jeff Becker) Date: Wed, 22 Apr 2009 10:02:40 -0700 Subject: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition] In-Reply-To: <20090422145434.GB18072@opengridcomputing.com> References: <49EF2CA1.50408@ext.bull.net> <20090422145434.GB18072@opengridcomputing.com> Message-ID: <49EF4DB0.6010505@nasa.gov> Jon Mason wrote: > On Wed, Apr 22, 2009 at 04:41:37PM +0200, Celine Bourde wrote: > >>> On the server did you remember to: >>> echo rdma 2050 > /proc/fs/nfsd/portlist >>> >> Yes. >> >> My nfs utils version : >> >> [root at my_host]#/sbin/mount.nfs -V >> mount.nfs (linux nfs-utils 1.1.4) >> >> If I try mount.nfs command : >> >> [root at my_host ~]# strace mount.nfs 192.168.0.215:/vol0 /mnt/ -o rdma,port=2050 >> > > Does it work without rdma? > Since I have 2.6.27.3 installed on my test machines, I'll reboot and check it out. -jeff > >> [..] >> socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 3 >> bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 >> fcntl(3, F_GETFL) = 0x2 (flags O_RDWR) >> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 >> connect(3, {sa_family=AF_INET, sin_port=htons(997), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 >> fcntl(3, F_SETFL, O_RDWR) = 0 >> connect(3, {sa_family=AF_UNSPEC, sa_data="\0o\177\0\0\1\0\0\0\0\0\0\0\0"}, 16) = 0 >> sendto(3, "uw\211\257\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., 40, 0, {sa_family=AF_INET, sin_port=htons(997), sin_addr=inet_addr("127.0.0.1")}, 16) = 40 >> poll([{fd=3, events=POLLIN}], 1, 3000) = 1 ([{fd=3, revents=POLLIN}]) >> recvfrom(3, "uw\211\257\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 400, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(997), sin_addr=inet_addr("127.0.0.1")}, [16]) = 24 >> close(3) = 0 >> mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, "rdma,port=2050,addr=192.168.0.21" >> >> and it blocks again. >> >> >>>> ? Also, make sure that none of the stock 2.6.27 NFS modules are loaded - >>>> they should all >>>> be from OFED 1.4.1 >>>> >> All nfs modules (OFED 1.4.1) are loaded (and dependencies). >> Modules are listed bellow -> updates directory means it >> comes from OFED version and no kernel version) >> >> [root at my_host ~]# modinfo ib_ipoib >> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko >> >> [root at my_host ~]# modinfo ib_sa >> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_sa.ko >> >> [root at my_host ~]# modinfo iw_cm >> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/iw_cm.ko >> >> [root at my_host ~]# modinfo ib_addr >> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_addr.ko >> >> [root at my_host ~]# modinfo mlx4_core >> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/net/mlx4/mlx4_core.ko >> >> [root at my_host ~]# modinfo ib_core >> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_core.ko >> >> [root at my_host ~]# modinfo ib_cm >> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_cm.ko >> >> [root at my_host ~]# modinfo rdma_cm >> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/rdma_cm.ko >> >> [root at my_host ~]# modinfo xprtrdma >> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/xprtrdma/xprtrdma.ko >> >> [root at my_host ~]# modinfo svcrdma >> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/xprtrdma/svcrdma.ko >> >> [root at my_host ofa_kernel-1.4.1]# modinfo nfsd >> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/nfsd/nfsd.ko >> >> [root at my_host ofa_kernel-1.4.1]# modinfo lockd >> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/lockd/lockd.ko >> >> [root at my_host ofa_kernel-1.4.1]# modinfo nfs_acl >> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/nfs_common/nfs_acl.ko >> >> [root at my_host ofa_kernel-1.4.1]# modinfo sunrpc >> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/sunrpc.ko >> >> root at my_host ofa_kernel-1.4.1]# modinfo exportfs >> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/exportfs/exportfs.ko >> >> [root at my_host ofa_kernel-1.4.1]# modinfo auth_rpcgss >> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/auth_gss/auth_rpcgss.ko >> >> Any other Idea ? >> > > Does dmesg say anything interesting? > > >> Thanks. >> >> Céline. >> >> >> >> >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> From Jeffrey.C.Becker at nasa.gov Wed Apr 22 10:13:30 2009 From: Jeffrey.C.Becker at nasa.gov (Jeff Becker) Date: Wed, 22 Apr 2009 10:13:30 -0700 Subject: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition] In-Reply-To: <49EF4DB0.6010505@nasa.gov> References: <49EF2CA1.50408@ext.bull.net> <20090422145434.GB18072@opengridcomputing.com> <49EF4DB0.6010505@nasa.gov> Message-ID: <49EF503A.2010103@nasa.gov> Hi Celine Jeff Becker wrote: > Jon Mason wrote: > >> On Wed, Apr 22, 2009 at 04:41:37PM +0200, Celine Bourde wrote: >> >> >>>> On the server did you remember to: >>>> echo rdma 2050 > /proc/fs/nfsd/portlist >>>> >>>> >>> Yes. >>> >>> My nfs utils version : >>> >>> [root at my_host]#/sbin/mount.nfs -V >>> mount.nfs (linux nfs-utils 1.1.4) >>> >>> If I try mount.nfs command : >>> >>> [root at my_host ~]# strace mount.nfs 192.168.0.215:/vol0 /mnt/ -o rdma,port=2050 >>> >>> >> Does it work without rdma? >> >> > > Since I have 2.6.27.3 installed on my test machines, I'll reboot and > check it out. > Is your system SLES11 or kernel.org 2.6.27? I just encountered a build problem on my kernel.org system. The install.pl script thinks that it's supposed to use the SLES11 backport which is wrong, and the build fails. I'll see if I can patch the install.pl script. -jeff > -jeff > > >> >> >>> [..] >>> socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 3 >>> bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 >>> fcntl(3, F_GETFL) = 0x2 (flags O_RDWR) >>> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 >>> connect(3, {sa_family=AF_INET, sin_port=htons(997), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 >>> fcntl(3, F_SETFL, O_RDWR) = 0 >>> connect(3, {sa_family=AF_UNSPEC, sa_data="\0o\177\0\0\1\0\0\0\0\0\0\0\0"}, 16) = 0 >>> sendto(3, "uw\211\257\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., 40, 0, {sa_family=AF_INET, sin_port=htons(997), sin_addr=inet_addr("127.0.0.1")}, 16) = 40 >>> poll([{fd=3, events=POLLIN}], 1, 3000) = 1 ([{fd=3, revents=POLLIN}]) >>> recvfrom(3, "uw\211\257\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 400, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(997), sin_addr=inet_addr("127.0.0.1")}, [16]) = 24 >>> close(3) = 0 >>> mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, "rdma,port=2050,addr=192.168.0.21" >>> >>> and it blocks again. >>> >>> >>> >>>>> ? Also, make sure that none of the stock 2.6.27 NFS modules are loaded - >>>>> they should all >>>>> be from OFED 1.4.1 >>>>> >>>>> >>> All nfs modules (OFED 1.4.1) are loaded (and dependencies). >>> Modules are listed bellow -> updates directory means it >>> comes from OFED version and no kernel version) >>> >>> [root at my_host ~]# modinfo ib_ipoib >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko >>> >>> [root at my_host ~]# modinfo ib_sa >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_sa.ko >>> >>> [root at my_host ~]# modinfo iw_cm >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/iw_cm.ko >>> >>> [root at my_host ~]# modinfo ib_addr >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_addr.ko >>> >>> [root at my_host ~]# modinfo mlx4_core >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/net/mlx4/mlx4_core.ko >>> >>> [root at my_host ~]# modinfo ib_core >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_core.ko >>> >>> [root at my_host ~]# modinfo ib_cm >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_cm.ko >>> >>> [root at my_host ~]# modinfo rdma_cm >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/rdma_cm.ko >>> >>> [root at my_host ~]# modinfo xprtrdma >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/xprtrdma/xprtrdma.ko >>> >>> [root at my_host ~]# modinfo svcrdma >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/xprtrdma/svcrdma.ko >>> >>> [root at my_host ofa_kernel-1.4.1]# modinfo nfsd >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/nfsd/nfsd.ko >>> >>> [root at my_host ofa_kernel-1.4.1]# modinfo lockd >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/lockd/lockd.ko >>> >>> [root at my_host ofa_kernel-1.4.1]# modinfo nfs_acl >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/nfs_common/nfs_acl.ko >>> >>> [root at my_host ofa_kernel-1.4.1]# modinfo sunrpc >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/sunrpc.ko >>> >>> root at my_host ofa_kernel-1.4.1]# modinfo exportfs >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/exportfs/exportfs.ko >>> >>> [root at my_host ofa_kernel-1.4.1]# modinfo auth_rpcgss >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/auth_gss/auth_rpcgss.ko >>> >>> Any other Idea ? >>> >>> >> Does dmesg say anything interesting? >> >> >> >>> Thanks. >>> >>> Céline. >>> >>> >>> >>> >>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> >>> > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From leonida at tx.technion.ac.il Wed Apr 22 05:18:17 2009 From: leonida at tx.technion.ac.il (Leonid Azriel) Date: Wed, 22 Apr 2009 15:18:17 +0300 Subject: [ofa-general] Registering physical memory region Message-ID: <49EF0B09.2000804@tx.technion.ac.il> Hi, Is there a way to register physical memory with HCA from a user application. Tried mmap to map it to the virtual memory, but ibv_reg_mr fails with bad address. The memory region is physically located in the IO (PCI) space. Please advise. Thanks, Leonid. From Jeffrey.C.Becker at nasa.gov Wed Apr 22 10:22:50 2009 From: Jeffrey.C.Becker at nasa.gov (Jeff Becker) Date: Wed, 22 Apr 2009 10:22:50 -0700 Subject: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition] In-Reply-To: <49EF503A.2010103@nasa.gov> References: <49EF2CA1.50408@ext.bull.net> <20090422145434.GB18072@opengridcomputing.com> <49EF4DB0.6010505@nasa.gov> <49EF503A.2010103@nasa.gov> Message-ID: <49EF526A.4050206@nasa.gov> Jeff Becker wrote: > Hi Celine > > Jeff Becker wrote: > >> Jon Mason wrote: >> >> >>> On Wed, Apr 22, 2009 at 04:41:37PM +0200, Celine Bourde wrote: >>> >>> >>> >>>>> On the server did you remember to: >>>>> echo rdma 2050 > /proc/fs/nfsd/portlist >>>>> >>>>> >>>>> >>>> Yes. >>>> >>>> My nfs utils version : >>>> >>>> [root at my_host]#/sbin/mount.nfs -V >>>> mount.nfs (linux nfs-utils 1.1.4) >>>> >>>> If I try mount.nfs command : >>>> >>>> [root at my_host ~]# strace mount.nfs 192.168.0.215:/vol0 /mnt/ -o rdma,port=2050 >>>> >>>> >>>> >>> Does it work without rdma? >>> >>> >>> >> Since I have 2.6.27.3 installed on my test machines, I'll reboot and >> check it out. >> >> > > Is your system SLES11 or kernel.org 2.6.27? I just encountered a build > problem on my kernel.org > system. The install.pl script thinks that it's supposed to use the > SLES11 backport which is wrong, and the > build fails. I'll see if I can patch the install.pl script. > > -jeff > >> -jeff >> Sorry - my bad. Since my test machines are a SLES base (with alternate kernels), install.pl correctly detects this, and causes it to think SLES11. I'll build and install directly from the OFED source. -jeff >> >> >>> >>> >>> >>>> [..] >>>> socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 3 >>>> bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 >>>> fcntl(3, F_GETFL) = 0x2 (flags O_RDWR) >>>> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 >>>> connect(3, {sa_family=AF_INET, sin_port=htons(997), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 >>>> fcntl(3, F_SETFL, O_RDWR) = 0 >>>> connect(3, {sa_family=AF_UNSPEC, sa_data="\0o\177\0\0\1\0\0\0\0\0\0\0\0"}, 16) = 0 >>>> sendto(3, "uw\211\257\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., 40, 0, {sa_family=AF_INET, sin_port=htons(997), sin_addr=inet_addr("127.0.0.1")}, 16) = 40 >>>> poll([{fd=3, events=POLLIN}], 1, 3000) = 1 ([{fd=3, revents=POLLIN}]) >>>> recvfrom(3, "uw\211\257\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 400, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(997), sin_addr=inet_addr("127.0.0.1")}, [16]) = 24 >>>> close(3) = 0 >>>> mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, "rdma,port=2050,addr=192.168.0.21" >>>> >>>> and it blocks again. >>>> >>>> >>>> >>>> >>>>>> ? Also, make sure that none of the stock 2.6.27 NFS modules are loaded - >>>>>> they should all >>>>>> be from OFED 1.4.1 >>>>>> >>>>>> >>>>>> >>>> All nfs modules (OFED 1.4.1) are loaded (and dependencies). >>>> Modules are listed bellow -> updates directory means it >>>> comes from OFED version and no kernel version) >>>> >>>> [root at my_host ~]# modinfo ib_ipoib >>>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko >>>> >>>> [root at my_host ~]# modinfo ib_sa >>>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_sa.ko >>>> >>>> [root at my_host ~]# modinfo iw_cm >>>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/iw_cm.ko >>>> >>>> [root at my_host ~]# modinfo ib_addr >>>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_addr.ko >>>> >>>> [root at my_host ~]# modinfo mlx4_core >>>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/net/mlx4/mlx4_core.ko >>>> >>>> [root at my_host ~]# modinfo ib_core >>>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_core.ko >>>> >>>> [root at my_host ~]# modinfo ib_cm >>>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_cm.ko >>>> >>>> [root at my_host ~]# modinfo rdma_cm >>>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/rdma_cm.ko >>>> >>>> [root at my_host ~]# modinfo xprtrdma >>>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/xprtrdma/xprtrdma.ko >>>> >>>> [root at my_host ~]# modinfo svcrdma >>>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/xprtrdma/svcrdma.ko >>>> >>>> [root at my_host ofa_kernel-1.4.1]# modinfo nfsd >>>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/nfsd/nfsd.ko >>>> >>>> [root at my_host ofa_kernel-1.4.1]# modinfo lockd >>>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/lockd/lockd.ko >>>> >>>> [root at my_host ofa_kernel-1.4.1]# modinfo nfs_acl >>>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/nfs_common/nfs_acl.ko >>>> >>>> [root at my_host ofa_kernel-1.4.1]# modinfo sunrpc >>>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/sunrpc.ko >>>> >>>> root at my_host ofa_kernel-1.4.1]# modinfo exportfs >>>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/exportfs/exportfs.ko >>>> >>>> [root at my_host ofa_kernel-1.4.1]# modinfo auth_rpcgss >>>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/auth_gss/auth_rpcgss.ko >>>> >>>> Any other Idea ? >>>> >>>> >>>> >>> Does dmesg say anything interesting? >>> >>> >>> >>> >>>> Thanks. >>>> >>>> Céline. >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>>> >>>> >>>> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> >> > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From sashak at voltaire.com Wed Apr 22 10:32:25 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 22 Apr 2009 20:32:25 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm/man/opensm.8.in: Add mention of backing documentation for QoS policy file and performance manager In-Reply-To: <20090421144736.GA14785@comcast.net> References: <20090421144736.GA14785@comcast.net> Message-ID: <20090422173225.GA15862@sk> On 10:47 Tue 21 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From worleys at gmail.com Wed Apr 22 10:36:30 2009 From: worleys at gmail.com (Chris Worley) Date: Wed, 22 Apr 2009 11:36:30 -0600 Subject: [ofa-general] ***SPAM*** How to tell what OFED rev a distro derived IB modules? Message-ID: Using an Ubuntu 8.10 distro w/ a 2.6.27-11 kernel, I'm wondering: from what OFED version were their built-in IB modules derived? The only version message I see is from the MLX4 driver: mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April 4, 2008) Would that imply 1.3? Thanks, Chris From sashak at voltaire.com Wed Apr 22 10:33:13 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 22 Apr 2009 20:33:13 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm/include/opensm/osm_pkey.h: Fix commentary typo In-Reply-To: <20090421151253.GA15735@comcast.net> References: <20090421151253.GA15735@comcast.net> Message-ID: <20090422173313.GB15862@sk> On 11:12 Tue 21 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Wed Apr 22 10:38:14 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Wed, 22 Apr 2009 20:38:14 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH] opensm/osm_pkey_mgr.c: Fix pkey endian in log message In-Reply-To: <20090421154612.GA21292@comcast.net> References: <20090421154612.GA21292@comcast.net> Message-ID: <20090422173814.GC15862@sk> On 11:46 Tue 21 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From chien.tin.tung at intel.com Wed Apr 22 11:24:59 2009 From: chien.tin.tung at intel.com (Tung, Chien Tin) Date: Wed, 22 Apr 2009 11:24:59 -0700 Subject: [ofa-general] ***SPAM*** hello and gitweb improvements In-Reply-To: <49EF471E.5000001@nasa.gov> References: <592164a70904210942h23a12657s212574041f2a9cdf@mail.gmail.com> <20090422053203.GB13093@obsidianresearch.com> <592164a70904212337n3f9896c8g679fec2c98964369@mail.gmail.com> <60BEFF3FBD4C6047B0F13F205CAFA383034BDD2877@azsmsx501.amr.corp.intel.com> <49EF471E.5000001@nasa.gov> Message-ID: <60BEFF3FBD4C6047B0F13F205CAFA383034C28BA92@azsmsx501.amr.corp.intel.com> >> On to the thankless portion, question on gitweb. My scm directory >> does not show up on gitweb (~ctung). What do I need to do to get >> my directory hooked in? >> > >I just checked on our server (not the new sofa.openfabrics.org), and >/data/scm/~ctung >is symbolically linked to your ~/scm directory as it should. If you're >talking about >the old server, gitweb should work for you. Note it does NOT on >sofa.openfabrics.org (yet). I see my links now. Thanks. Chien From faisal.latif at intel.com Wed Apr 22 11:59:58 2009 From: faisal.latif at intel.com (Faisal Latif) Date: Wed, 22 Apr 2009 13:59:58 -0500 Subject: [ofa-general] [PATCH 1/4] RDMA/nes: Do not set apbvt entry for loopback Message-ID: <20090422185958.GA22016@flatif-MOBL> When connect request comes, apbvt is only set for non-loopback connections. Signed-off-by: Faisal Latif --- drivers/infiniband/hw/nes/nes_cm.c | 10 ++++++---- 1 files changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index dbd9a75..aa3c631 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -2959,6 +2959,7 @@ int nes_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) struct nes_device *nesdev; struct nes_cm_node *cm_node; struct nes_cm_info cm_info; + int apbvt_set = 0; ibqp = nes_get_qp(cm_id->device, conn_param->qpn); if (!ibqp) @@ -2996,9 +2997,11 @@ int nes_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) conn_param->private_data_len); if (cm_id->local_addr.sin_addr.s_addr != - cm_id->remote_addr.sin_addr.s_addr) + cm_id->remote_addr.sin_addr.s_addr) { nes_manage_apbvt(nesvnic, ntohs(cm_id->local_addr.sin_port), PCI_FUNC(nesdev->pcidev->devfn), NES_MANAGE_APBVT_ADD); + apbvt_set = 1; + } /* set up the connection params for the node */ cm_info.loc_addr = htonl(cm_id->local_addr.sin_addr.s_addr); @@ -3015,8 +3018,7 @@ int nes_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) conn_param->private_data_len, (void *)conn_param->private_data, &cm_info); if (!cm_node) { - if (cm_id->local_addr.sin_addr.s_addr != - cm_id->remote_addr.sin_addr.s_addr) + if (apbvt_set) nes_manage_apbvt(nesvnic, ntohs(cm_id->local_addr.sin_port), PCI_FUNC(nesdev->pcidev->devfn), NES_MANAGE_APBVT_DEL); @@ -3025,7 +3027,7 @@ int nes_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) return -ENOMEM; } - cm_node->apbvt_set = 1; + cm_node->apbvt_set = apbvt_set; nesqp->cm_node = cm_node; cm_node->nesqp = nesqp; nes_add_ref(&nesqp->ibqp); -- 1.5.3.3 From faisal.latif at intel.com Wed Apr 22 12:05:14 2009 From: faisal.latif at intel.com (Faisal Latif) Date: Wed, 22 Apr 2009 14:05:14 -0500 Subject: [ofa-general] [PATCH 2/4] RDMA/nes: Check for seq# wrap around Message-ID: <20090422190514.GA20864@flatif-MOBL> Check_seq() is not checking if the seq# have wrapped. Signed-off-by: Faisal Latif --- drivers/infiniband/hw/nes/nes_cm.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index aa3c631..851d62d 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -56,6 +56,7 @@ #include #include #include +#include #include "nes.h" @@ -1518,7 +1519,7 @@ static int check_seq(struct nes_cm_node *cm_node, struct tcphdr *tcph, rcv_wnd = cm_node->tcp_cntxt.rcv_wnd; if (ack_seq != loc_seq_num) err = 1; - else if ((seq + rcv_wnd) < rcv_nxt) + else if (!between(seq, rcv_nxt, (rcv_nxt+rcv_wnd))) err = 1; if (err) { nes_debug(NES_DBG_CM, "%s[%u] create abort for cm_node=%p " -- 1.5.3.3 From faisal.latif at intel.com Wed Apr 22 12:07:08 2009 From: faisal.latif at intel.com (Faisal Latif) Date: Wed, 22 Apr 2009 14:07:08 -0500 Subject: [ofa-general] [PATCH 3/4] RDMA/nes: increase rexmit timeout interval Message-ID: <20090422190708.GA17456@flatif-MOBL> Under heavy cluster testing, it may takes longer to receive response to MPA request. Changing it to wait longer after each rexmit to max time value. Signed-off-by: Faisal Latif --- drivers/infiniband/hw/nes/nes_cm.c | 6 +++++- drivers/infiniband/hw/nes/nes_cm.h | 1 + 2 files changed, 6 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index 851d62d..bcfd3c6 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -541,6 +541,7 @@ static void nes_cm_timer_tick(unsigned long pass) struct list_head *list_node; struct nes_cm_core *cm_core = g_cm_core; u32 settimer = 0; + unsigned long timetosend; int ret = NETDEV_TX_OK; struct list_head timer_list; @@ -645,8 +646,11 @@ static void nes_cm_timer_tick(unsigned long pass) send_entry->retrycount); if (send_entry->send_retrans) { send_entry->retranscount--; + timetosend = (NES_RETRY_TIMEOUT << + (NES_DEFAULT_RETRANS - send_entry->retranscount)); + send_entry->timetosend = jiffies + - NES_RETRY_TIMEOUT; + min(timetosend, NES_MAX_TIMEOUT); if (nexttimeout > send_entry->timetosend || !settimer) { nexttimeout = send_entry->timetosend; diff --git a/drivers/infiniband/hw/nes/nes_cm.h b/drivers/infiniband/hw/nes/nes_cm.h index 80bba18..8b7e7c0 100644 --- a/drivers/infiniband/hw/nes/nes_cm.h +++ b/drivers/infiniband/hw/nes/nes_cm.h @@ -149,6 +149,7 @@ struct nes_timer_entry { #endif #define NES_SHORT_TIME (10) #define NES_LONG_TIME (2000*HZ/1000) +#define NES_MAX_TIMEOUT ((unsigned long) (12*HZ)) #define NES_CM_HASHTABLE_SIZE 1024 #define NES_CM_TCP_TIMER_INTERVAL 3000 -- 1.5.3.3 From faisal.latif at intel.com Wed Apr 22 12:09:58 2009 From: faisal.latif at intel.com (Faisal Latif) Date: Wed, 22 Apr 2009 14:09:58 -0500 Subject: [ofa-general] [PATCH 4/4] RDMA/nes: Fix hang issues for large cluster dynamic connections Message-ID: <20090422190958.GA21652@flatif-MOBL> Running large cluster setup, we are hanging after many hours of testing. Fixing required going over the code and making sure the rexmit entry was properly removed based on the cm_node's state and packet received. Also when recieing FIN packet, making sure seq# and there were no errors before calling handle_fin(). Following are the changes done in nes_cm.c. * handle_ack_pkt() need to return error value, so in case of error, handle_fin() is not called. Some celanup done while going over the code. * handle_rst_pkt(), handling of cm_node's NES_CM_STATE_LAST_ACK is missing. * process_packet(), in case of FIN only packet is received, call check_seq() before processing. * in handle_fin_pkt(), we are calling cleanup_retrans_entry() for all conditions, even if the packets needs to be dropped. Signed-off-by: Faisal Latif --- drivers/infiniband/hw/nes/nes_cm.c | 56 ++++++++++++++++-------------------- 1 files changed, 25 insertions(+), 31 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index bcfd3c6..d1577a4 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -1330,18 +1330,20 @@ static void handle_fin_pkt(struct nes_cm_node *cm_node) nes_debug(NES_DBG_CM, "Received FIN, cm_node = %p, state = %u. " "refcnt=%d\n", cm_node, cm_node->state, atomic_read(&cm_node->ref_count)); - cm_node->tcp_cntxt.rcv_nxt++; - cleanup_retrans_entry(cm_node); switch (cm_node->state) { case NES_CM_STATE_SYN_RCVD: case NES_CM_STATE_SYN_SENT: case NES_CM_STATE_ESTABLISHED: case NES_CM_STATE_MPAREQ_SENT: case NES_CM_STATE_MPAREJ_RCVD: + cm_node->tcp_cntxt.rcv_nxt++; + cleanup_retrans_entry(cm_node); cm_node->state = NES_CM_STATE_LAST_ACK; send_fin(cm_node, NULL); break; case NES_CM_STATE_FIN_WAIT1: + cm_node->tcp_cntxt.rcv_nxt++; + cleanup_retrans_entry(cm_node); cm_node->state = NES_CM_STATE_CLOSING; send_ack(cm_node, NULL); /* Wait for ACK as this is simultanous close.. @@ -1349,11 +1351,15 @@ static void handle_fin_pkt(struct nes_cm_node *cm_node) * Just rm the node.. Done.. */ break; case NES_CM_STATE_FIN_WAIT2: + cm_node->tcp_cntxt.rcv_nxt++; + cleanup_retrans_entry(cm_node); cm_node->state = NES_CM_STATE_TIME_WAIT; send_ack(cm_node, NULL); schedule_nes_timer(cm_node, NULL, NES_TIMER_TYPE_CLOSE, 1, 0); break; case NES_CM_STATE_TIME_WAIT: + cm_node->tcp_cntxt.rcv_nxt++; + cleanup_retrans_entry(cm_node); cm_node->state = NES_CM_STATE_CLOSED; rem_ref_cm_node(cm_node->cm_core, cm_node); break; @@ -1389,7 +1395,6 @@ static void handle_rst_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb, passive_state = atomic_add_return(1, &cm_node->passive_state); if (passive_state == NES_SEND_RESET_EVENT) create_event(cm_node, NES_CM_EVENT_RESET); - cleanup_retrans_entry(cm_node); cm_node->state = NES_CM_STATE_CLOSED; dev_kfree_skb_any(skb); break; @@ -1403,17 +1408,16 @@ static void handle_rst_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb, active_open_err(cm_node, skb, reset); break; case NES_CM_STATE_CLOSED: - cleanup_retrans_entry(cm_node); drop_packet(skb); break; + case NES_CM_STATE_LAST_ACK: + cm_node->cm_id->rem_ref(cm_node->cm_id); case NES_CM_STATE_TIME_WAIT: - cleanup_retrans_entry(cm_node); cm_node->state = NES_CM_STATE_CLOSED; rem_ref_cm_node(cm_node->cm_core, cm_node); drop_packet(skb); break; case NES_CM_STATE_FIN_WAIT1: - cleanup_retrans_entry(cm_node); nes_debug(NES_DBG_CM, "Bad state %s[%u]\n", __func__, __LINE__); default: drop_packet(skb); @@ -1460,6 +1464,7 @@ static void handle_rcv_mpa(struct nes_cm_node *cm_node, struct sk_buff *skb) NES_PASSIVE_STATE_INDICATED); break; case NES_CM_STATE_MPAREQ_SENT: + cleanup_retrans_entry(cm_node); if (res_type == NES_MPA_REQUEST_REJECT) { type = NES_CM_EVENT_MPA_REJECT; cm_node->state = NES_CM_STATE_MPAREJ_RCVD; @@ -1657,49 +1662,39 @@ static void handle_synack_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb, } } -static void handle_ack_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb, +static int handle_ack_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb, struct tcphdr *tcph) { int datasize = 0; u32 inc_sequence; u32 rem_seq_ack; u32 rem_seq; - int ret; + int ret = 0; int optionsize; optionsize = (tcph->doff << 2) - sizeof(struct tcphdr); if (check_seq(cm_node, tcph, skb)) - return; + return -EINVAL; skb_pull(skb, tcph->doff << 2); inc_sequence = ntohl(tcph->seq); rem_seq = ntohl(tcph->seq); rem_seq_ack = ntohl(tcph->ack_seq); datasize = skb->len; - cleanup_retrans_entry(cm_node); switch (cm_node->state) { case NES_CM_STATE_SYN_RCVD: /* Passive OPEN */ + cleanup_retrans_entry(cm_node); ret = handle_tcp_options(cm_node, tcph, skb, optionsize, 1); if (ret) break; cm_node->tcp_cntxt.rem_ack_num = ntohl(tcph->ack_seq); - if (cm_node->tcp_cntxt.rem_ack_num != - cm_node->tcp_cntxt.loc_seq_num) { - nes_debug(NES_DBG_CM, "rem_ack_num != loc_seq_num\n"); - cleanup_retrans_entry(cm_node); - send_reset(cm_node, skb); - return; - } cm_node->state = NES_CM_STATE_ESTABLISHED; - cleanup_retrans_entry(cm_node); if (datasize) { cm_node->tcp_cntxt.rcv_nxt = inc_sequence + datasize; handle_rcv_mpa(cm_node, skb); - } else { /* rcvd ACK only */ + } else /* rcvd ACK only */ dev_kfree_skb_any(skb); - cleanup_retrans_entry(cm_node); - } break; case NES_CM_STATE_ESTABLISHED: /* Passive OPEN */ @@ -1711,15 +1706,12 @@ static void handle_ack_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb, drop_packet(skb); break; case NES_CM_STATE_MPAREQ_SENT: - cleanup_retrans_entry(cm_node); cm_node->tcp_cntxt.rem_ack_num = ntohl(tcph->ack_seq); if (datasize) { cm_node->tcp_cntxt.rcv_nxt = inc_sequence + datasize; handle_rcv_mpa(cm_node, skb); - } else { /* Could be just an ack pkt.. */ - cleanup_retrans_entry(cm_node); + } else /* Could be just an ack pkt.. */ dev_kfree_skb_any(skb); - } break; case NES_CM_STATE_LISTENING: case NES_CM_STATE_CLOSED: @@ -1727,11 +1719,10 @@ static void handle_ack_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb, send_reset(cm_node, skb); break; case NES_CM_STATE_LAST_ACK: + case NES_CM_STATE_CLOSING: cleanup_retrans_entry(cm_node); cm_node->state = NES_CM_STATE_CLOSED; cm_node->cm_id->rem_ref(cm_node->cm_id); - case NES_CM_STATE_CLOSING: - cleanup_retrans_entry(cm_node); rem_ref_cm_node(cm_node->cm_core, cm_node); drop_packet(skb); break; @@ -1746,9 +1737,11 @@ static void handle_ack_pkt(struct nes_cm_node *cm_node, struct sk_buff *skb, case NES_CM_STATE_MPAREQ_RCVD: case NES_CM_STATE_UNKNOWN: default: + cleanup_retrans_entry(cm_node); drop_packet(skb); break; } + return ret; } @@ -1854,6 +1847,7 @@ static void process_packet(struct nes_cm_node *cm_node, struct sk_buff *skb, enum nes_tcpip_pkt_type pkt_type = NES_PKT_TYPE_UNKNOWN; struct tcphdr *tcph = tcp_hdr(skb); u32 fin_set = 0; + int ret = 0; skb_pull(skb, ip_hdr(skb)->ihl << 2); nes_debug(NES_DBG_CM, "process_packet: cm_node=%p state =%d syn=%d " @@ -1879,17 +1873,17 @@ static void process_packet(struct nes_cm_node *cm_node, struct sk_buff *skb, handle_synack_pkt(cm_node, skb, tcph); break; case NES_PKT_TYPE_ACK: - handle_ack_pkt(cm_node, skb, tcph); - if (fin_set) + ret = handle_ack_pkt(cm_node, skb, tcph); + if (fin_set && !ret) handle_fin_pkt(cm_node); break; case NES_PKT_TYPE_RST: handle_rst_pkt(cm_node, skb, tcph); break; default: - drop_packet(skb); - if (fin_set) + if ((fin_set) && (!check_seq(cm_node, tcph, skb))) handle_fin_pkt(cm_node); + drop_packet(skb); break; } } -- 1.5.3.3 From Jeffrey.C.Becker at nasa.gov Wed Apr 22 14:59:52 2009 From: Jeffrey.C.Becker at nasa.gov (Jeff Becker) Date: Wed, 22 Apr 2009 14:59:52 -0700 Subject: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition] In-Reply-To: <49EF4DB0.6010505@nasa.gov> References: <49EF2CA1.50408@ext.bull.net> <20090422145434.GB18072@opengridcomputing.com> <49EF4DB0.6010505@nasa.gov> Message-ID: <49EF9358.10400@nasa.gov> Salut Celine! Jeff Becker wrote: > Jon Mason wrote: > >> On Wed, Apr 22, 2009 at 04:41:37PM +0200, Celine Bourde wrote: >> >> >>>> On the server did you remember to: >>>> echo rdma 2050 > /proc/fs/nfsd/portlist >>>> >>>> >>> Yes. >>> >>> My nfs utils version : >>> >>> [root at my_host]#/sbin/mount.nfs -V >>> mount.nfs (linux nfs-utils 1.1.4) >>> >>> If I try mount.nfs command : >>> >>> [root at my_host ~]# strace mount.nfs 192.168.0.215:/vol0 /mnt/ -o rdma,port=2050 >>> >>> >> Does it work without rdma? >> >> > > Since I have 2.6.27.3 installed on my test machines, I'll reboot and > check it out. > > -jeff > > I was able to bring up an RDMA mount on 2.6.27.3, and run the basic connectathon tests successfully. One difference, is my nfs-utils is 1.1.6 but I doubt that should matter. -jeff >> >> >>> [..] >>> socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 3 >>> bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 >>> fcntl(3, F_GETFL) = 0x2 (flags O_RDWR) >>> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 >>> connect(3, {sa_family=AF_INET, sin_port=htons(997), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 >>> fcntl(3, F_SETFL, O_RDWR) = 0 >>> connect(3, {sa_family=AF_UNSPEC, sa_data="\0o\177\0\0\1\0\0\0\0\0\0\0\0"}, 16) = 0 >>> sendto(3, "uw\211\257\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., 40, 0, {sa_family=AF_INET, sin_port=htons(997), sin_addr=inet_addr("127.0.0.1")}, 16) = 40 >>> poll([{fd=3, events=POLLIN}], 1, 3000) = 1 ([{fd=3, revents=POLLIN}]) >>> recvfrom(3, "uw\211\257\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 400, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(997), sin_addr=inet_addr("127.0.0.1")}, [16]) = 24 >>> close(3) = 0 >>> mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, "rdma,port=2050,addr=192.168.0.21" >>> >>> and it blocks again. >>> >>> >>> >>>>> ? Also, make sure that none of the stock 2.6.27 NFS modules are loaded - >>>>> they should all >>>>> be from OFED 1.4.1 >>>>> >>>>> >>> All nfs modules (OFED 1.4.1) are loaded (and dependencies). >>> Modules are listed bellow -> updates directory means it >>> comes from OFED version and no kernel version) >>> >>> [root at my_host ~]# modinfo ib_ipoib >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko >>> >>> [root at my_host ~]# modinfo ib_sa >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_sa.ko >>> >>> [root at my_host ~]# modinfo iw_cm >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/iw_cm.ko >>> >>> [root at my_host ~]# modinfo ib_addr >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_addr.ko >>> >>> [root at my_host ~]# modinfo mlx4_core >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/net/mlx4/mlx4_core.ko >>> >>> [root at my_host ~]# modinfo ib_core >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_core.ko >>> >>> [root at my_host ~]# modinfo ib_cm >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/ib_cm.ko >>> >>> [root at my_host ~]# modinfo rdma_cm >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/drivers/infiniband/core/rdma_cm.ko >>> >>> [root at my_host ~]# modinfo xprtrdma >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/xprtrdma/xprtrdma.ko >>> >>> [root at my_host ~]# modinfo svcrdma >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/xprtrdma/svcrdma.ko >>> >>> [root at my_host ofa_kernel-1.4.1]# modinfo nfsd >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/nfsd/nfsd.ko >>> >>> [root at my_host ofa_kernel-1.4.1]# modinfo lockd >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/lockd/lockd.ko >>> >>> [root at my_host ofa_kernel-1.4.1]# modinfo nfs_acl >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/nfs_common/nfs_acl.ko >>> >>> [root at my_host ofa_kernel-1.4.1]# modinfo sunrpc >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/sunrpc.ko >>> >>> root at my_host ofa_kernel-1.4.1]# modinfo exportfs >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/fs/exportfs/exportfs.ko >>> >>> [root at my_host ofa_kernel-1.4.1]# modinfo auth_rpcgss >>> filename: /lib/modules/2.6.27_ofa_compil/updates/kernel/net/sunrpc/auth_gss/auth_rpcgss.ko >>> >>> Any other Idea ? >>> >>> >> Does dmesg say anything interesting? >> >> >> >>> Thanks. >>> >>> Céline. >>> >>> >>> >>> >>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> >>> > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From swise at opengridcomputing.com Wed Apr 22 16:34:08 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 22 Apr 2009 18:34:08 -0500 Subject: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition] In-Reply-To: <49EF9358.10400@nasa.gov> References: <49EF2CA1.50408@ext.bull.net> <20090422145434.GB18072@opengridcomputing.com> <49EF4DB0.6010505@nasa.gov> <49EF9358.10400@nasa.gov> Message-ID: <49EFA970.8020601@opengridcomputing.com> > I was able to bring up an RDMA mount on 2.6.27.3, and run the basic > connectathon > tests successfully. One difference, is my nfs-utils is 1.1.6 but I doubt > that should > matter. > > -jeff > > Celine, can you establish regular rdma connections over your IB link? Like rping? STeve. From weiny2 at llnl.gov Wed Apr 22 17:03:07 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 22 Apr 2009 17:03:07 -0700 Subject: [ofa-general] libibmad: ib_resolve_portid_str_via; Bug? Message-ID: <20090422170307.548e8a55.weiny2@llnl.gov> Sasha, Below is a patch which fixes an issue I had when using ib_resolve_portid_str_via. When resolving via IB_DEST_GUID the ib_resolve_guid_via function optionally uses the portid to attempt to set a different subnet prefix. IMO I don't think portid should be an in/out parameter in ib_resolve_portid_str_via. I happened to pass a portid object which was on the stack and had some garbage data in it. It took me a while to figure out that ib_resolve_portid_str_via was attempting to use that garbage data. To make this more clear I added ib_resolve_gid_via and another MAD_DEST type. What do you think? Right now the gid resolving is untested. Ira diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index b6f4b60..ec76d0f 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -677,6 +677,7 @@ enum MAD_DEST { IB_DEST_DRPATH, IB_DEST_GUID, IB_DEST_DRSLID, + IB_DEST_GID }; enum MAD_NODE_TYPE { @@ -861,6 +862,9 @@ MAD_EXPORT int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, MAD_EXPORT int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, ib_portid_t * sm_id, int timeout, const struct ibmad_port *srcport); +MAD_EXPORT int ib_resolve_gid_via(ib_portid_t * portid, ibmad_gid_t gid, + ib_portid_t * sm_id, int timeout, + const struct ibmad_port *srcport); MAD_EXPORT int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, enum MAD_DEST dest, ib_portid_t * sm_id, const struct ibmad_port *srcport); diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map index 6b77784..0ce3957 100644 --- a/libibmad/src/libibmad.map +++ b/libibmad/src/libibmad.map @@ -100,6 +100,7 @@ IBMAD_1.3 { ib_path_query_via; ib_resolve_smlid_via; ib_resolve_guid_via; + ib_resolve_gid_via; ib_resolve_portid_str_via; ib_resolve_self_via; mad_field_name; diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c index f34c247..4d40b2b 100644 --- a/libibmad/src/resolve.c +++ b/libibmad/src/resolve.c @@ -68,6 +68,26 @@ int ib_resolve_smlid(ib_portid_t * sm_id, int timeout) return ib_resolve_smlid_via(sm_id, timeout, ibmp); } +int ib_resolve_gid_via(ib_portid_t * portid, ibmad_gid_t gid, + ib_portid_t * sm_id, int timeout, + const struct ibmad_port *srcport) +{ + ib_portid_t sm_portid; + char buf[IB_SA_DATA_SIZE] = { 0 }; + + if (!sm_id) { + sm_id = &sm_portid; +++ b/libibmad/src/libibmad.map @@ -100,6 +100,7 @@ IBMAD_1.3 { ib_path_query_via; ib_resolve_smlid_via; ib_resolve_guid_via; + ib_resolve_gid_via; ib_resolve_portid_str_via; ib_resolve_self_via; mad_field_name; diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c index f34c247..4d40b2b 100644 --- a/libibmad/src/resolve.c +++ b/libibmad/src/resolve.c @@ -68,6 +68,26 @@ int ib_resolve_smlid(ib_portid_t * sm_id, int timeout) return ib_resolve_smlid_via(sm_id, timeout, ibmp); } +int ib_resolve_gid_via(ib_portid_t * portid, ibmad_gid_t gid, + ib_portid_t * sm_id, int timeout, + const struct ibmad_port *srcport) +{ + ib_portid_t sm_portid; + char buf[IB_SA_DATA_SIZE] = { 0 }; + + if (!sm_id) { + sm_id = &sm_portid; + if (ib_resolve_smlid_via(sm_id, timeout, srcport) < 0) + return -1; + } + + if ((portid->lid = + ib_path_query_via(srcport, gid, gid, sm_id, buf)) < 0) + return -1; + + return 0; +} + int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, ib_portid_t * sm_id, int timeout, const struct ibmad_port *srcport) @@ -80,11 +100,9 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, if (ib_resolve_smlid_via(sm_id, timeout, srcport) < 0) return -1; } - if (*(uint64_t *) & portid->gid == 0) - mad_set_field64(portid->gid, 0, IB_GID_PREFIX_F, - IB_DEFAULT_SUBN_PREFIX); - if (guid) - mad_set_field64(portid->gid, 0, IB_GID_GUID_F, *guid); + + mad_set_field64(portid->gid, 0, IB_GID_PREFIX_F, IB_DEFAULT_SUBN_PREFIX); + mad_set_field64(portid->gid, 0, IB_GID_GUID_F, *guid); if ((portid->lid = ib_path_query_via(srcport, portid->gid, portid->gid, sm_id, @@ -98,12 +116,15 @@ int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, enum MAD_DEST dest_type, ib_portid_t * sm_id, const struct ibmad_port *srcport) { + ibmad_gid_t gid; uint64_t guid; int lid; char *routepath; ib_portid_t selfportid = { 0 }; int selfport = 0; + memset(portid, 0, sizeof *portid); + switch (dest_type) { case IB_DEST_LID: lid = strtol(addr_str, 0, 0); @@ -138,6 +159,10 @@ int ib_resolve_portid_str_via(ib_portid_t * portid, char *addr_str, return -1; return 0; + case IB_DEST_GID: + if (inet_pton(AF_INET6, addr_str, &gid) <= 0) + return -1; + return ib_resolve_gid_via(portid, gid, sm_id, 0, srcport); default: IBWARN("bad dest_type %d", dest_type); } From weiny2 at llnl.gov Wed Apr 22 18:54:41 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 22 Apr 2009 18:54:41 -0700 Subject: [ofa-general] [PATCH 0/5] Follow on patch series to libibnetdisc including converting ibqueryerrors.pl Message-ID: <20090422185441.6f8601dc.weiny2@llnl.gov> Sasha, Here are follow up patches to the first 3 libibnetdisc patches. These apply to the pq/ibn3 branch. The first 4 are changes needed to convert ibqueryerrors.pl to C. The 5th is a small patch which fixes a couple of bugs in ibnetdiscover and iblinkinfo I found along the way. When do you plan to merge pq/ibn3? I noticed you don't have Sean's patches in there yet. Thanks, Ira -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab weiny2 at llnl.gov From weiny2 at llnl.gov Wed Apr 22 18:54:44 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 22 Apr 2009 18:54:44 -0700 Subject: [ofa-general] [PATCH 1/5] change ibnd_discover_fabric to receive ibmad_port Message-ID: <20090422185444.d22f1f84.weiny2@llnl.gov> >From 05d6ab1d016e0d1d79db47365bb897a387b68d46 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Wed, 22 Apr 2009 18:44:17 -0700 Subject: [PATCH] change ibnd_discover_fabric to receive ibmad_port In order to allow ibmad_port to be opened with additional classes libibnetdisc should accept an ibmad_port as a parameter. The library will error out if the classes it needs are not opened. Signed-off-by: Ira Weiny --- .../libibnetdisc/include/infiniband/ibnetdisc.h | 6 +- .../libibnetdisc/man/ibnd_discover_fabric.3 | 21 ++++++-- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 54 +++++++++----------- infiniband-diags/libibnetdisc/src/internal.h | 1 - infiniband-diags/libibnetdisc/test/testleaks.c | 20 ++++++-- 5 files changed, 60 insertions(+), 42 deletions(-) diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h index a882994..7eaca24 100644 --- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h @@ -124,6 +124,7 @@ typedef struct chassis { * Main fabric object which is returned and represents the data discovered */ typedef struct ib_fabric { + struct ibmad_port *ibmad_port; /* the node the discover was initiated from * "from" parameter in ibnd_discover_fabric * or by default the node you ar running on @@ -143,11 +144,10 @@ typedef struct ib_fabric { void ibnd_debug(int i); void ibnd_show_progress(int i); -ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int dev_port, +ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms, ib_portid_t *from, int hops); /** - * dev_name: (required) local device name to use to access the fabric - * dev_port: (required) local device port to use to access the fabric + * open: (required) ibmad_port object from libibmad * timeout_ms: (required) gives the timeout for a _SINGLE_ query on * the fabric. So if there are multiple nodes not * responding this may result in a lengthy delay. diff --git a/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 b/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 index 44d8c65..c832c11 100644 --- a/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 +++ b/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 @@ -5,7 +5,7 @@ ibnd_discover_fabric, ibnd_destroy_fabric, ibnd_debug ibnd_show_progress \- init .nf .B #include .sp -.BI "ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int dev_port, int timeout_ms, ib_portid_t *from, int hops)" +.bi "ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms, ib_portid_t *from, int hops)" .BI "void ibnd_destroy_fabric(ibnd_fabric_t *fabric)" .BI "void ibnd_debug(int i)" .BI "void ibnd_show_progress(int i)" @@ -13,7 +13,10 @@ ibnd_discover_fabric, ibnd_destroy_fabric, ibnd_debug ibnd_show_progress \- init .SH "DESCRIPTION" .B ibnd_discover_fabric() -Discover the fabric connected to the port specified by dev_name and dev_port, using a timeout specified. The "from" and "hops" parameters are optional and allow one to scan part of a fabric by specifying a node "from" and a number of hops away from that node to scan, "hops". This gives the user a "sub-fabric" which is "centered" anywhere they chose. +Discover the fabric connected to the port specified by ibmad_port, using a timeout specified. The "from" and "hops" parameters are optional and allow one to scan part of a fabric by specifying a node "from" and a number of hops away from that node to scan, "hops". This gives the user a "sub-fabric" which is "centered" anywhere they chose. + +ibmad_port must be opened with at least IB_SMI_CLASS and IB_SMI_DIRECT_CLASS +classes for ibnd_discover_fabric to work. .B ibnd_destroy_fabric() free all memory and resources associated with the fabric. @@ -36,13 +39,23 @@ NONE .B Discover the entire fabric connected to device "mthca0", port 1. - ibnd_discover_fabric("mthca0", 1, 100, NULL, 0); + int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; + struct ibmad_port *ibmad_port = mad_rpc_open_port(ca, ca_port, mgmt_classes, 2); + ibnd_fabric_t *fabric = ibnd_discover_fabric(ibmad_port, 100, NULL, 0); + ... + ibnd_destroy_fabric(fabric); + mad_rpc_close_port(ibmad_port); .B Discover only a single node and those nodes connected to it. + ... str2drpath(&(port_id.drpath), from, 0, 0); + ... + ibnd_discover_fabric(ibmad_port, 100, &port_id, 1); + ... - ibnd_discover_fabric("mthca0", 1, 100, &port_id, 1); +.SH "SEE ALSO" + libibmad, mad_rpc_open_port .SH "AUTHORS" .TP diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index 479bae7..f7e4ae2 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -79,7 +79,7 @@ get_port_info(struct ibnd_fabric *fabric, struct ibnd_port *port, IB_PORT_LINK_SPEED_ACTIVE_F); if (!smp_query_via(port->port.info, portid, IB_ATTR_PORT_INFO, portnum, timeout_ms, - fabric->ibmad_port)) + fabric->fabric.ibmad_port)) return -1; decode_port_info(&(port->port)); @@ -100,7 +100,7 @@ static int query_node_info(struct ibnd_fabric *fabric, struct ibnd_node *node, ib_portid_t *portid) { if (!smp_query_via(&(node->node.info), portid, IB_ATTR_NODE_INFO, 0, timeout_ms, - fabric->ibmad_port)) + fabric->fabric.ibmad_port)) return -1; /* decode just a couple of fields for quicker reference. */ @@ -130,11 +130,11 @@ query_node(struct ibnd_fabric *fabric, struct ibnd_node *inode, port->guid = mad_get_field64(node->info, 0, IB_NODE_PORT_GUID_F); if (!smp_query_via(nd, portid, IB_ATTR_NODE_DESC, 0, timeout_ms, - fabric->ibmad_port)) + fabric->fabric.ibmad_port)) return -1; if (!smp_query_via(port->info, portid, IB_ATTR_PORT_INFO, 0, timeout_ms, - fabric->ibmad_port)) + fabric->fabric.ibmad_port)) return -1; decode_port_info(port); @@ -146,7 +146,7 @@ query_node(struct ibnd_fabric *fabric, struct ibnd_node *inode, /* after we have the sma information find out the real PortInfo for this port */ if (!smp_query_via(port->info, portid, IB_ATTR_PORT_INFO, port->portnum, timeout_ms, - fabric->ibmad_port)) + fabric->fabric.ibmad_port)) return -1; decode_port_info(port); @@ -154,7 +154,7 @@ query_node(struct ibnd_fabric *fabric, struct ibnd_node *inode, port->lmc = node->smalmc; if (!smp_query_via(node->switchinfo, portid, IB_ATTR_SWITCH_INFO, 0, timeout_ms, - fabric->ibmad_port)) + fabric->fabric.ibmad_port)) node->smaenhsp0 = 0; /* assume base SP0 */ else mad_decode_field(node->switchinfo, IB_SW_ENHANCED_PORT0_F, &node->smaenhsp0); @@ -241,7 +241,7 @@ ibnd_update_node(ibnd_node_t *node) return (NULL); if (!smp_query_via(nd, &(n->node.path_portid), IB_ATTR_NODE_DESC, 0, timeout_ms, - f->ibmad_port)) + f->fabric.ibmad_port)) return (NULL); /* update all the port info's */ @@ -253,14 +253,14 @@ ibnd_update_node(ibnd_node_t *node) goto done; if (!smp_query_via(portinfo_port0, &(n->node.path_portid), IB_ATTR_PORT_INFO, 0, timeout_ms, - f->ibmad_port)) + f->fabric.ibmad_port)) return (NULL); n->node.smalid = mad_get_field(portinfo_port0, 0, IB_PORT_LID_F); n->node.smalmc = mad_get_field(portinfo_port0, 0, IB_PORT_LMC_F); if (!smp_query_via(node->switchinfo, &(n->node.path_portid), IB_ATTR_SWITCH_INFO, 0, timeout_ms, - f->ibmad_port)) + f->fabric.ibmad_port)) node->smaenhsp0 = 0; /* assume base SP0 */ else mad_decode_field(node->switchinfo, IB_SW_ENHANCED_PORT0_F, &n->node.smaenhsp0); @@ -476,17 +476,8 @@ get_remote_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd_ return 0; } -static void * -ibnd_init_port(char *dev_name, int dev_port) -{ - int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; - - /* Crank up the mad lib */ - return (mad_rpc_open_port(dev_name, dev_port, mgmt_classes, 2)); -} - ibnd_fabric_t * -ibnd_discover_fabric(char *dev_name, int dev_port, int timeout_ms, +ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms, ib_portid_t *from, int hops) { struct ibnd_fabric *fabric = NULL; @@ -500,15 +491,27 @@ ibnd_discover_fabric(char *dev_name, int dev_port, int timeout_ms, ib_portid_t *path; int max_hops = MAXHOPS-1; /* default find everything */ + if (mad_rpc_class_agent(ibmad_port, IB_SMI_CLASS) == -1 + || + mad_rpc_class_agent(ibmad_port, IB_SMI_DIRECT_CLASS) == -1) { + IBPANIC("ibmad_port must be opened with " + "IB_SMI_CLASS && IB_SMI_DIRECT_CLASS\n"); + return (NULL); + } + if (!ibmad_port) { + IBPANIC("ibmad_port must be specified to " + "ibnd_discover_fabric\n"); + return (NULL); + } + /* if not everything how much? */ if (hops >= 0) { max_hops = hops; } /* If not specified start from "my" port */ - if (!from) { + if (!from) from = &my_portid; - } fabric = malloc(sizeof(*fabric)); @@ -519,12 +522,7 @@ ibnd_discover_fabric(char *dev_name, int dev_port, int timeout_ms, memset(fabric, 0, sizeof(*fabric)); - fabric->ibmad_port = ibnd_init_port(dev_name, dev_port); - if (!fabric->ibmad_port) { - IBPANIC("OOM: failed to open \"%s\" port %d\n", - dev_name, dev_port); - goto error; - } + fabric->fabric.ibmad_port = ibmad_port; IBND_DEBUG("from %s\n", portid2str(from)); @@ -633,8 +631,6 @@ ibnd_destroy_fabric(ibnd_fabric_t *fabric) node = next; } } - if (f->ibmad_port) - mad_rpc_close_port(f->ibmad_port); free(f); } diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h index afed25e..4e6bb18 100644 --- a/infiniband-diags/libibnetdisc/src/internal.h +++ b/infiniband-diags/libibnetdisc/src/internal.h @@ -79,7 +79,6 @@ struct ibnd_fabric { ibnd_fabric_t fabric; /* internal use only */ - void *ibmad_port; struct ibnd_node *nodestbl[HTSZ]; struct ibnd_port *portstbl[HTSZ]; struct ibnd_node *nodesdist[MAXHOPS+1]; diff --git a/infiniband-diags/libibnetdisc/test/testleaks.c b/infiniband-diags/libibnetdisc/test/testleaks.c index 1fabaac..0d009c3 100644 --- a/infiniband-diags/libibnetdisc/test/testleaks.c +++ b/infiniband-diags/libibnetdisc/test/testleaks.c @@ -84,6 +84,7 @@ usage(void) int main(int argc, char **argv) { + int rc = 0; char *ca = 0; int ca_port = 0; ibnd_fabric_t *fabric = NULL; @@ -94,6 +95,9 @@ main(int argc, char **argv) ib_portid_t port_id; int iters = -1; + struct ibmad_port *ibmad_port; + int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; + static char const str_opts[] = "S:D:n:C:P:t:shuf:i:"; static const struct option long_opts[] = { { "S", 1, 0, 'S'}, @@ -155,25 +159,31 @@ main(int argc, char **argv) argc -= optind; argv += optind; + ibmad_port = mad_rpc_open_port(ca, ca_port, mgmt_classes, 2); + while (iters == -1 || iters-- > 0) { if (from) { /* only scan part of the fabric */ str2drpath(&(port_id.drpath), from, 0, 0); - if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, + if ((fabric = ibnd_discover_fabric(ibmad_port, timeout_ms, &port_id, hops)) == NULL) { fprintf(stderr, "discover failed\n"); - exit(1); + rc = 1; + goto close_port; } guid = 0; } else { - if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, NULL, -1)) == NULL) { + if ((fabric = ibnd_discover_fabric(ibmad_port, timeout_ms, NULL, -1)) == NULL) { fprintf(stderr, "discover failed\n"); - exit(1); + rc = 1; + goto close_port; } } ibnd_destroy_fabric(fabric); } - exit(0); +close_port: + mad_rpc_close_port(ibmad_port); + exit(rc); } -- 1.5.4.5 From weiny2 at llnl.gov Wed Apr 22 18:54:46 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 22 Apr 2009 18:54:46 -0700 Subject: [ofa-general] [PATCH 2/5] Convert ibnetdiscover and iblinkinfo to use the new interface to libibnetdisc Message-ID: <20090422185446.056b7355.weiny2@llnl.gov> >From 3b4e97a345e8a0758cb9ef6de6517de61922831c Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Wed, 22 Apr 2009 18:44:17 -0700 Subject: [PATCH] Convert ibnetdiscover and iblinkinfo to use the new interface to libibnetdisc Signed-off-by: Ira Weiny --- infiniband-diags/src/iblinkinfo.c | 24 +++++++++++++++++++----- infiniband-diags/src/ibnetdiscover.c | 14 ++++++++++---- 2 files changed, 29 insertions(+), 9 deletions(-) diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c index 1e43788..16728cb 100644 --- a/infiniband-diags/src/iblinkinfo.c +++ b/infiniband-diags/src/iblinkinfo.c @@ -255,6 +255,7 @@ usage(void) int main(int argc, char **argv) { + int rc = 0; char *ca = 0; int ca_port = 0; ibnd_fabric_t *fabric = NULL; @@ -264,6 +265,9 @@ main(int argc, char **argv) int hops = 0; ib_portid_t port_id; + struct ibmad_port *ibmad_port; + int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; + static char const str_opts[] = "S:D:n:C:P:t:sldgphuf:R"; static const struct option long_opts[] = { { "S", 1, 0, 'S'}, @@ -352,20 +356,28 @@ main(int argc, char **argv) if (argc && !(f = fopen(argv[0], "w"))) fprintf(stderr, "can't open file %s for writing", argv[0]); + ibmad_port = mad_rpc_open_port(ca, ca_port, mgmt_classes, 2); + if (!ibmad_port) { + fprintf(stderr, "Failed to open %s port %d", ca, ca_port); + exit(1); + } + node_name_map = open_node_name_map(node_name_map_file); if (from) { /* only scan part of the fabric */ str2drpath(&(port_id.drpath), from, 0, 0); - if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, &port_id, hops)) == NULL) { + if ((fabric = ibnd_discover_fabric(ibmad_port, timeout_ms, &port_id, hops)) == NULL) { fprintf(stderr, "discover failed\n"); - exit(1); + rc = 1; + goto close_port; } guid = 0; } else { - if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, NULL, -1)) == NULL) { + if ((fabric = ibnd_discover_fabric(ibmad_port, timeout_ms, NULL, -1)) == NULL) { fprintf(stderr, "discover failed\n"); - exit(1); + rc = 1; + goto close_port; } } @@ -381,6 +393,8 @@ main(int argc, char **argv) ibnd_destroy_fabric(fabric); +close_port: close_node_name_map(node_name_map); - exit(0); + mad_rpc_close_port(ibmad_port); + exit(rc); } diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 99750f0..4cd0b37 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -650,6 +650,9 @@ int main(int argc, char **argv) { ibnd_fabric_t *fabric = NULL; + struct ibmad_port *ibmad_port; + int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; + const struct ibdiag_opt opts[] = { { "show", 's', 0, NULL, "show more information" }, { "list", 'l', 0, NULL, "list of connected nodes" }, @@ -677,15 +680,17 @@ int main(int argc, char **argv) if (ibverbose) ibnd_debug(1); + ibmad_port = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2); + if (!ibmad_port) + IBERROR("Failed to open %s port %d", ibd_ca, ibd_ca_port); + if (argc && !(f = fopen(argv[0], "w"))) IBERROR("can't open file %s for writing", argv[0]); node_name_map = open_node_name_map(node_name_map_file); - if ((fabric = ibnd_discover_fabric(ibd_ca, ibd_ca_port, ibd_timeout, NULL, -1)) == NULL) { - fprintf(stderr, "discover failed\n"); - exit(1); - } + if ((fabric = ibnd_discover_fabric(ibmad_port, ibd_timeout, NULL, -1)) == NULL) + IBERROR(stderr, "discover failed\n"); if (ports_report) ibnd_iter_nodes(fabric, @@ -698,5 +703,6 @@ int main(int argc, char **argv) ibnd_destroy_fabric(fabric); close_node_name_map(node_name_map); + mad_rpc_close_port(ibmad_port); exit(0); } -- 1.5.4.5 From weiny2 at llnl.gov Wed Apr 22 18:54:49 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 22 Apr 2009 18:54:49 -0700 Subject: [ofa-general] [PATCH 3/5] Add mad_field_name function Message-ID: <20090422185449.0ee00e7b.weiny2@llnl.gov> >From e812376dd0fd1368f536f1032b6035d5e01fa4ac Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Wed, 22 Apr 2009 18:44:17 -0700 Subject: [PATCH] Add mad_field_name function returns the "name" of the field specified Signed-off-by: Ira Weiny --- libibmad/include/infiniband/mad.h | 1 + libibmad/src/fields.c | 5 +++++ libibmad/src/libibmad.map | 1 + 3 files changed, 7 insertions(+), 0 deletions(-) diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index b8290a7..b6f4b60 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -722,6 +722,7 @@ MAD_EXPORT void mad_encode_field(uint8_t * buf, enum MAD_FIELDS field, void *val MAD_EXPORT int mad_print_field(enum MAD_FIELDS field, const char *name, void *val); MAD_EXPORT char *mad_dump_field(enum MAD_FIELDS field, char *buf, int bufsz, void *val); MAD_EXPORT char *mad_dump_val(enum MAD_FIELDS field, char *buf, int bufsz, void *val); +MAD_EXPORT const char *mad_field_name(enum MAD_FIELDS field); /* mad.c */ MAD_EXPORT void *mad_encode(void *buf, ib_rpc_t * rpc, ib_dr_path_t * drpath, diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c index df43ceb..02f2e75 100644 --- a/libibmad/src/fields.c +++ b/libibmad/src/fields.c @@ -686,3 +686,8 @@ char *mad_dump_val(enum MAD_FIELDS field, char *buf, int bufsz, void *val) return 0; return _mad_dump_val(ib_mad_f + field, buf, bufsz, val); } + +const char *mad_field_name(enum MAD_FIELDS field) +{ + return (ib_mad_f[field].name); +} diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map index 4306dbc..6b77784 100644 --- a/libibmad/src/libibmad.map +++ b/libibmad/src/libibmad.map @@ -102,5 +102,6 @@ IBMAD_1.3 { ib_resolve_guid_via; ib_resolve_portid_str_via; ib_resolve_self_via; + mad_field_name; local: *; }; -- 1.5.4.5 From weiny2 at llnl.gov Wed Apr 22 18:54:52 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 22 Apr 2009 18:54:52 -0700 Subject: [ofa-general] [PATCH 4/5] Convert ibqueryerrors.pl to C and use new ibnetdisc library. Message-ID: <20090422185452.56e6bfa0.weiny2@llnl.gov> >From f46430ec41406db8a6d0c3799d442a16db9ed8c0 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Wed, 22 Apr 2009 18:44:17 -0700 Subject: [PATCH] Convert ibqueryerrors.pl to C and use new ibnetdisc library. Signed-off-by: Ira Weiny --- infiniband-diags/Makefile.am | 5 +- infiniband-diags/configure.in | 1 + infiniband-diags/scripts/ibqueryerrors.pl | 230 ------------- infiniband-diags/scripts/ibqueryerrors.pl.in | 40 +++ infiniband-diags/src/ibqueryerrors.c | 469 ++++++++++++++++++++++++++ 5 files changed, 514 insertions(+), 231 deletions(-) delete mode 100755 infiniband-diags/scripts/ibqueryerrors.pl create mode 100755 infiniband-diags/scripts/ibqueryerrors.pl.in create mode 100644 infiniband-diags/src/ibqueryerrors.c diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am index 19b992c..503d573 100644 --- a/infiniband-diags/Makefile.am +++ b/infiniband-diags/Makefile.am @@ -12,7 +12,8 @@ endif sbin_PROGRAMS = src/ibaddr src/ibnetdiscover src/ibping src/ibportstate \ src/ibroute src/ibstat src/ibsysstat src/ibtracert \ src/perfquery src/sminfo src/smpdump src/smpquery \ - src/saquery src/vendstat src/iblinkinfo + src/saquery src/vendstat src/iblinkinfo \ + src/ibqueryerrors if ENABLE_TEST_UTILS sbin_PROGRAMS += src/ibsendtrap src/mcm_rereg_test @@ -59,6 +60,8 @@ src_vendstat_SOURCES = src/vendstat.c src_mcm_rereg_test_SOURCES = src/mcm_rereg_test.c src_iblinkinfo_SOURCES = src/iblinkinfo.c src_iblinkinfo_LDFLAGS = -L$(top_srcdir)/libibnetdisc -libnetdisc +src_ibqueryerrors_SOURCES = src/ibqueryerrors.c +src_ibqueryerrors_LDFLAGS = -L$(top_srcdir)/libibnetdisc -libnetdisc man_MANS = man/ibaddr.8 man/ibcheckerrors.8 man/ibcheckerrs.8 \ man/ibchecknet.8 man/ibchecknode.8 man/ibcheckport.8 \ diff --git a/infiniband-diags/configure.in b/infiniband-diags/configure.in index 4516dfa..ae492b8 100644 --- a/infiniband-diags/configure.in +++ b/infiniband-diags/configure.in @@ -167,6 +167,7 @@ AC_CONFIG_FILES([\ scripts/ibswitches \ scripts/ibrouters \ scripts/iblinkinfo.pl \ + scripts/ibqueryerrors.pl \ libibnetdisc/Makefile ]) AC_OUTPUT diff --git a/infiniband-diags/scripts/ibqueryerrors.pl b/infiniband-diags/scripts/ibqueryerrors.pl deleted file mode 100755 index 99adac7..0000000 --- a/infiniband-diags/scripts/ibqueryerrors.pl +++ /dev/null @@ -1,230 +0,0 @@ -#!/usr/bin/perl -# -# Copyright (c) 2008 Voltaire, Inc. All rights reserved. -# Copyright (c) 2006 The Regents of the University of California. -# -# Produced at Lawrence Livermore National Laboratory. -# Written by Ira Weiny . -# -# This software is available to you under a choice of one of two -# licenses. You may choose to be licensed under the terms of the GNU -# General Public License (GPL) Version 2, available from the file -# COPYING in the main directory of this source tree, or the -# OpenIB.org BSD license below: -# -# Redistribution and use in source and binary forms, with or -# without modification, are permitted provided that the following -# conditions are met: -# -# - Redistributions of source code must retain the above -# copyright notice, this list of conditions and the following -# disclaimer. -# -# - Redistributions in binary form must reproduce the above -# copyright notice, this list of conditions and the following -# disclaimer in the documentation and/or other materials -# provided with the distribution. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, -# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF -# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND -# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS -# BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN -# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN -# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -# - -use strict; - -use Getopt::Std; -use IBswcountlimits; - -my $print_action = "no"; -my $report_port_info = undef; -my $single_switch = undef; -my $include_data_counters = undef; -my $cache_file = ""; -my $switch_found = "no"; - -# ========================================================================= -# -sub report_counts -{ - my $addr = $_[0]; - my $port = $_[1]; - my $ca_name = $_[2]; - my $ca_port = $_[3]; - my $extra_params = get_ca_name_port_param_string($ca_name, $ca_port); - - if (any_counts()) { - print(" GUID $addr port $port:"); - check_counters($print_action); - if ($include_data_counters) { - check_data_counters($print_action); - } - print("\n"); - - if ($report_port_info) { - my $lid = ""; - my $speed = ""; - my $width = ""; - my $data = `smpquery $extra_params -G portinfo $addr $port`; - my @lines = split("\n", $data); - foreach my $line (@lines) { - if ($line =~ /^# Port info: Lid (\w+) port.*/) { $lid = $1; } - if ($line =~ /^LinkSpeedActive:\.+(.*)/) { $speed = $1; } - if ($line =~ /^LinkWidthActive:\.+(.*)/) { $width = $1; } - } - my $hr = $IBswcountlimits::link_ends{"$addr"}{$port}; - if ($hr) { - printf( -" Link info: %6s %4s[%2s] ==(%3s %s)==> %18s %4s[%2s] \"%s\"\n", - $lid, $port, - $hr->{loc_ext_port}, $width, - $speed, $hr->{rem_guid}, - $hr->{rem_port}, $hr->{rem_ext_port}, - $hr->{rem_desc} - ); - } else { - printf( -" Link info: %6s %4s[ ] ==(%3s %s)==> (Disconnected)\n", - $lid, $port, $width, $speed); - } - } - } -} - -# ========================================================================= -# use perfquery to get the counters. -sub get_counts -{ - my $addr = $_[0]; - my $port = $_[1]; - my $ca_name = $_[2]; - my $ca_port = $_[3]; - my $extra_params = get_ca_name_port_param_string($ca_name, $ca_port); - - my $data = `perfquery $extra_params -G $addr $port` || - die "'perfquery $extra_params -G $addr $port' FAILED.\n"; - my @lines = split("\n", $data); - foreach my $line (@lines) { - foreach my $count (@IBswcountlimits::counters) { - if ($line =~ /^$count:\.+(\d+)/) { - $IBswcountlimits::cur_counts{$count} = $1; - } - } - } -} - -# ========================================================================= -# -my %switches = (); - -sub get_switches -{ - my $data = `ibswitches $cache_file` || - die "'ibswitches $cache_file' failed.\n"; - my @lines = split("\n", $data); - foreach my $line (@lines) { - if ($line =~ /^Switch\s+:\s+(\w+)\s+ports\s+(\d+)\s+.*/) { - $switches{$1} = $2; - } - } -} - -# ========================================================================= -# -sub usage_and_exit -{ - my $prog = $_[0]; - print -"Usage: $prog [-a -c -r -R -s -S -D -d -C -P ]\n"; - print " Report counters on all switches in subnet\n"; - print " -a Report an action to take\n"; - print " -c suppress some of the common counters\n"; - print " -r report port configuration information\n"; - print " -R Recalculate ibnetdiscover information\n"; - print " -s suppress errors listed\n"; - print -" -D output only the switch specified by direct route path\n"; - print " -S query only (hex format)\n"; - print " -d include the data counters in the output\n"; - print " -C use selected Channel Adaptor name for queries\n"; - print " -P use selected channel adaptor port for queries\n"; - exit 2; -} - -my $argv0 = `basename $0`; -my $regenerate_map = undef; -my $single_switch = undef; -my $direct_route = undef; -my $ca_name = ""; -my $ca_port = ""; - -chomp $argv0; -if (!getopts("has:crRS:D:dC:P:")) { usage_and_exit $argv0; } -if (defined $Getopt::Std::opt_h) { usage_and_exit $argv0; } -if (defined $Getopt::Std::opt_a) { $print_action = "yes"; } -if (defined $Getopt::Std::opt_s) { - @IBswcountlimits::suppress_errors = split(",", $Getopt::Std::opt_s); -} -if (defined $Getopt::Std::opt_c) { - @IBswcountlimits::suppress_errors = split(",", "RcvSwRelayErrors"); -} -if (defined $Getopt::Std::opt_r) { $report_port_info = $Getopt::Std::opt_r; } -if (defined $Getopt::Std::opt_R) { $regenerate_map = $Getopt::Std::opt_R; } -if (defined $Getopt::Std::opt_D) { $direct_route = $Getopt::Std::opt_D; } -if (defined $Getopt::Std::opt_S) { - $single_switch = format_guid($Getopt::Std::opt_S); -} -if (defined $Getopt::Std::opt_d) { - $include_data_counters = $Getopt::Std::opt_d; -} -if (defined $Getopt::Std::opt_C) { $ca_name = $Getopt::Std::opt_C; } -if (defined $Getopt::Std::opt_P) { $ca_port = $Getopt::Std::opt_P; } - -$cache_file = get_cache_file($ca_name, $ca_port); - -sub main -{ - if (@IBswcountlimits::suppress_errors) { - my $msg = join(",", @IBswcountlimits::suppress_errors); - print "Suppressing: $msg\n"; - } - get_link_ends($regenerate_map, $ca_name, $ca_port); - get_switches; - if (defined($direct_route)) { - # convert DR to guid, then use original single_switch option - $single_switch = convert_dr_to_guid($direct_route); - if (!defined($single_switch) || !is_switch($single_switch)) { - printf("The direct route (%s) does not map to a switch.\n", - $direct_route); - return; - } - } - foreach my $sw_addr (keys %switches) { - if ($single_switch && $sw_addr ne "$single_switch") { - next; - } else { - $switch_found = "yes"; - } - - my $switch_prompt = "no"; - foreach my $sw_port (1 .. $switches{$sw_addr}) { - clear_counters; - get_counts($sw_addr, $sw_port, $ca_name, $ca_port); - if (any_counts() && $switch_prompt eq "no") { - my $hr = $IBswcountlimits::link_ends{"$sw_addr"}{$sw_port}; - printf("Errors for %18s \"%s\"\n", $sw_addr, $hr->{loc_desc}); - $switch_prompt = "yes"; - } - report_counts($sw_addr, $sw_port); - } - } - if ($single_switch && $switch_found ne "yes") { - printf("Switch \"%s\" not found.\n", $single_switch); - } -} -main; - diff --git a/infiniband-diags/scripts/ibqueryerrors.pl.in b/infiniband-diags/scripts/ibqueryerrors.pl.in new file mode 100755 index 0000000..30e610c --- /dev/null +++ b/infiniband-diags/scripts/ibqueryerrors.pl.in @@ -0,0 +1,40 @@ +#!/usr/bin/perl +# +# Copyright (c) 2009 Lawrence Livermore National Security +# +# Produced at Lawrence Livermore National Laboratory. +# Written by Ira Weiny . +# +# This software is available to you under a choice of one of two +# licenses. You may choose to be licensed under the terms of the GNU +# General Public License (GPL) Version 2, available from the file +# COPYING in the main directory of this source tree, or the +# OpenIB.org BSD license below: +# +# Redistribution and use in source and binary forms, with or +# without modification, are permitted provided that the following +# conditions are met: +# +# - Redistributions of source code must retain the above +# copyright notice, this list of conditions and the following +# disclaimer. +# +# - Redistributions in binary form must reproduce the above +# copyright notice, this list of conditions and the following +# disclaimer in the documentation and/or other materials +# provided with the distribution. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS +# BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN +# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN +# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# + + +# this is now just a wrapper for the C based utility +$str = join " ", at ARGV; +exec "@IBSCRIPTPATH@/ibqueryerrors $str"; diff --git a/infiniband-diags/src/ibqueryerrors.c b/infiniband-diags/src/ibqueryerrors.c new file mode 100644 index 0000000..9d96190 --- /dev/null +++ b/infiniband-diags/src/ibqueryerrors.c @@ -0,0 +1,469 @@ +/* + * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. + * Copyright (c) 2008 Lawrence Livermore National Lab. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include "ibdiag_common.h" + +char *argv0 = "ibqueryerrors"; +static FILE *f; + +struct ibmad_port *ibmad_port; +static char *node_name_map_file = NULL; +static nn_map_t *node_name_map = NULL; +int data_counters = 0; +int port_config = 0; +uint64_t switch_guid = 0; +char *switch_guid_str = NULL; +int sup_total = 0; +enum MAD_FIELDS *suppressed_fields = NULL; +char *dr_path = NULL; +int all_nodes = 0; + +static unsigned int +get_max(unsigned int num) +{ + unsigned int v = num; // 32-bit word to find the log base 2 of + unsigned r = 0; // r will be lg(v) + + while (v >>= 1) // unroll for more speed... + { + r++; + } + + return (1 << r); +} + +static void +get_msg(char *width_msg, char *speed_msg, int msg_size, ibnd_port_t *port) +{ + char buf[64]; + uint32_t max_speed = 0; + + uint32_t max_width = get_max(mad_get_field(port->info, 0, + IB_PORT_LINK_WIDTH_SUPPORTED_F) + & mad_get_field(port->remoteport->info, 0, + IB_PORT_LINK_WIDTH_SUPPORTED_F)); + if ((max_width & mad_get_field(port->info, 0, + IB_PORT_LINK_WIDTH_ACTIVE_F)) == 0) { + // we are not at the max supported width + // print what we could be at. + snprintf(width_msg, msg_size, "Could be %s", + mad_dump_val(IB_PORT_LINK_WIDTH_ACTIVE_F, + buf, 64, &max_width)); + } + + max_speed = get_max(mad_get_field(port->info, 0, + IB_PORT_LINK_SPEED_SUPPORTED_F) + & mad_get_field(port->remoteport->info, 0, + IB_PORT_LINK_SPEED_SUPPORTED_F)); + if ((max_speed & mad_get_field(port->info, 0, + IB_PORT_LINK_SPEED_ACTIVE_F)) == 0) { + // we are not at the max supported speed + // print what we could be at. + snprintf(speed_msg, msg_size, "Could be %s", + mad_dump_val(IB_PORT_LINK_SPEED_ACTIVE_F, + buf, 64, &max_speed)); + } +} + +static void +print_port_config(ibnd_node_t *node, int portnum) +{ + char width[64], speed[64], state[64], physstate[64]; + char remote_str[256]; + char link_str[256]; + char width_msg[256]; + char speed_msg[256]; + char ext_port_str[256]; + int iwidth, ispeed, istate, iphystate; + int n = 0; + + ibnd_port_t *port = node->ports[portnum]; + + if (!port) + return; + + iwidth = mad_get_field(port->info, 0, IB_PORT_LINK_WIDTH_ACTIVE_F); + ispeed = mad_get_field(port->info, 0, IB_PORT_LINK_SPEED_ACTIVE_F); + istate = mad_get_field(port->info, 0, IB_PORT_STATE_F); + iphystate = mad_get_field(port->info, 0, IB_PORT_PHYS_STATE_F); + + remote_str[0] = '\0'; + link_str[0] = '\0'; + width_msg[0] = '\0'; + speed_msg[0] = '\0'; + + n = snprintf(link_str, 256, "(%3s %s %6s/%8s)", + mad_dump_val(IB_PORT_LINK_WIDTH_ACTIVE_F, width, 64, &iwidth), + mad_dump_val(IB_PORT_LINK_SPEED_ACTIVE_F, speed, 64, &ispeed), + mad_dump_val(IB_PORT_STATE_F, state, 64, &istate), + mad_dump_val(IB_PORT_PHYS_STATE_F, physstate, 64, &iphystate)); + + if (port->remoteport) { + char *remap = remap_node_name(node_name_map, port->remoteport->node->guid, + port->remoteport->node->nodedesc); + + if (port->remoteport->ext_portnum) + snprintf(ext_port_str, 256, "%d", port->remoteport->ext_portnum); + else + ext_port_str[0] = '\0'; + + get_msg(width_msg, speed_msg, 256, port); + + snprintf(remote_str, 256, + "0x%016"PRIx64" %6d %4d[%2s] \"%s\" (%s %s)\n", + port->remoteport->node->guid, + port->remoteport->base_lid ? port->remoteport->base_lid : + port->remoteport->node->smalid, + port->remoteport->portnum, + ext_port_str, + remap, + width_msg, + speed_msg); + free(remap); + } else + snprintf(remote_str, 256, " [ ] \"\" ( )\n"); + + if (port->ext_portnum) + snprintf(ext_port_str, 256, "%d", port->ext_portnum); + else + ext_port_str[0] = '\0'; + + if (node->type == IB_NODE_SWITCH) + printf(" %6d", node->smalid); + else + printf(" %6d", port->base_lid); + + printf("%4d[%2s] ==%s==> %s", + port->portnum, ext_port_str, link_str, remote_str); +} + +static int +suppress(enum MAD_FIELDS field) +{ + int i = 0; + if (suppressed_fields) + for (i = 0; i < sup_total; i++) { + if (field == suppressed_fields[i]) + return (1); + } + return (0); +} + +static void +report_suppressed(void) +{ + int i = 0; + if (suppressed_fields) { + printf("Suppressing:"); + for (i = 0; i < sup_total; i++) { + printf(" %s", mad_field_name(suppressed_fields[i])); + } + printf("\n"); + } +} + +static void +print_results(ibnd_node_t *node, uint8_t *pc, int portnum) +{ + char buf[1024]; + char *str = buf; + uint32_t val = 0; + int n = 0; + int i = 0; + + for (n = 0, i = IB_PC_ERR_SYM_F; i <= IB_PC_VL15_DROPPED_F; i++) { + if (suppress(i)) + continue; + + mad_decode_field(pc, i, (void *)&val); + if (val) + n += snprintf(str+n, 1024-n, " [%s == %d]", + mad_field_name(i), val); + } + + if (!suppress(IB_PC_XMT_WAIT_F)) { + mad_decode_field(pc, IB_PC_XMT_WAIT_F, (void *)&val); + if (val) + n += snprintf(str+n, 1024-n, " [%s == %d]", mad_field_name(i), val); + } + + /* if we found errors. */ + if (n != 0) { + char *nodename = remap_node_name(node_name_map, node->guid, node->nodedesc); + if (data_counters) + for (i = IB_PC_XMT_BYTES_F; i <= IB_PC_RCV_PKTS_F; i++) { + uint64_t val64 = 0; + mad_decode_field(pc, i, (void *)&val64); + if (val64) + n += snprintf(str+n, 1024-n, " [%s == %"PRId64"]", + mad_field_name(i), val64); + } + + printf("Errors for 0x%" PRIx64 " \"%s\"\n", node->guid, nodename); + printf(" GUID 0x%" PRIx64 " port %d:%s\n", + node->guid, portnum, str); + if (port_config) + print_port_config(node, portnum); + free(nodename); + } +} + +static void +print_port(ibnd_node_t *node, int portnum) +{ + uint8_t pc[1024]; + uint16_t cap_mask; + ib_portid_t portid = {0}; + char *nodename = remap_node_name(node_name_map, node->guid, node->nodedesc); + + if (node->type == IB_NODE_SWITCH) + ib_portid_set(&portid, node->smalid, 0, 0); + else + ib_portid_set(&portid, node->ports[portnum]->base_lid, 0, 0); + + /* PerfMgt ClassPortInfo is a required attribute */ + if (!pma_query_via(pc, &portid, portnum, ibd_timeout, CLASS_PORT_INFO, + ibmad_port)) { + IBWARN("classportinfo query failed on %s, %s port %d", + nodename, portid2str(&portid), portnum); + goto cleanup; + } + /* ClassPortInfo should be supported as part of libibmad */ + memcpy(&cap_mask, pc + 2, sizeof(cap_mask)); /* CapabilityMask */ + + if (!pma_query_via(pc, &portid, portnum, ibd_timeout, + IB_GSI_PORT_COUNTERS, + ibmad_port)) { + IBWARN("IB_GSI_PORT_COUNTERS query failed on %s, %s port %d\n", + nodename, portid2str(&portid), portnum); + goto cleanup; + } + if (!(cap_mask & 0x1000)) { + /* if PortCounters:PortXmitWait not suppported clear this counter */ + uint32_t foo = 0; + mad_encode_field(pc, IB_PC_XMT_WAIT_F, &foo); + } + print_results(node, pc, portnum); + +cleanup: + free(nodename); +} + +void +print_node(ibnd_node_t *node, void *user_data) +{ + int p = 0; + int startport = 1; + + if (!all_nodes && node->type != IB_NODE_SWITCH) + return; + + if (node->type == IB_NODE_SWITCH && node->smaenhsp0) + startport = 0; + + for (p = startport; p <= node->numports; p++) { + if (node->ports[p]) { + print_port(node, p); + } + } +} + +static void +add_suppressed(enum MAD_FIELDS field) +{ + suppressed_fields = realloc(suppressed_fields, sizeof(enum MAD_FIELDS)); + suppressed_fields[sup_total] = field; + sup_total++; +} + +static void +calculate_suppressed_fields(char *str) +{ + enum MAD_FIELDS f = 0; + char *tmp = strdup(str); + char *lasts, *val; + + val = strtok_r(tmp, ",", &lasts); + while (val) { + for (f = IB_PC_FIRST_F; f <= IB_PC_LAST_F; f++) { + if (strcmp(val, mad_field_name(f)) == 0) { + add_suppressed(f); + } + } + val = strtok_r(NULL, ",", &lasts); + } + + free(tmp); +} + +static int process_opt(void *context, int ch, char *optarg) +{ + switch (ch) { + case 's': + calculate_suppressed_fields(optarg); + break; + case 'c': + /* Right now this is the only "common" error */ + add_suppressed(IB_PC_ERR_SWITCH_REL_F); + break; + case 1: + node_name_map_file = strdup(optarg); + break; + case 2: + data_counters++; + break; + case 3: + all_nodes++; + break; + case 'S': + switch_guid_str = strdup(optarg); + switch_guid = (uint64_t)strtoull(switch_guid_str, 0, 0); + break; + case 'D': + dr_path = strdup(optarg); + break; + case 'r': + port_config++; + break; + case 'R': /* nop */ + break; + default: + return -1; + } + + return 0; +} + +int +main(int argc, char **argv) +{ + int rc = 0; + ibnd_fabric_t *fabric = NULL; + + int mgmt_classes[4] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS, IB_PERFORMANCE_CLASS}; + + const struct ibdiag_opt opts[] = { + { "suppress", 's', 1, "", "suppress errors listed" }, + { "suppress-common", 'c', 0, NULL, "suppress some of the common counters" }, + { "node-name-map", 1, 1, "", "node name map file" }, + { "switch", 'S', 1, "", "query only (hex format)"}, + { "Direct", 'D', 1, "", "query only switch specified by "}, + { "report-port", 'r', 0, NULL, "report port configuration information"}, + { "GNDN", 'R', 0, NULL, "(This option is obsolete and does nothing)"}, + { "data", 2, 0, NULL, "include the data counters in the output"}, + { "all", 3, 0, NULL, "output all nodes (not just switches)"}, + { 0 } + }; + char usage_args[] = ""; + + ibdiag_process_opts(argc, argv, "sDLG", "snSrR", opts, process_opt, + usage_args, NULL); + + f = stdout; + + argc -= optind; + argv += optind; + + if (ibverbose) + ibnd_debug(1); + + ibmad_port = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 4); + if (!ibmad_port) + IBERROR("Failed to open port; %s:%d\n", ibd_ca, ibd_ca_port); + + node_name_map = open_node_name_map(node_name_map_file); + + if (switch_guid) { + /* limit the scan the fabric around the target */ + ib_portid_t portid = {0}; + + if (ib_resolve_portid_str_via(&portid, switch_guid_str, IB_DEST_GUID, + ibd_sm_id, ibmad_port) < 0) { + fprintf(stderr, "can't resolve destination port %s %p\n", + switch_guid_str, ibd_sm_id); + rc = 1; + goto close_port; + } + + if ((fabric = ibnd_discover_fabric(ibmad_port, ibd_timeout, &portid, 1)) == NULL) { + fprintf(stderr, "discover failed\n"); + rc = 1; + goto close_port; + } + } else { + if ((fabric = ibnd_discover_fabric(ibmad_port, ibd_timeout, NULL, -1)) == NULL) { + fprintf(stderr, "discover failed\n"); + rc = 1; + goto close_port; + } + } + + report_suppressed(); + + if (switch_guid) { + ibnd_node_t *node = ibnd_find_node_guid(fabric, switch_guid); + print_node(node, NULL); + } else if (dr_path) { + ibnd_node_t *node = ibnd_find_node_dr(fabric, dr_path); + print_node(node, NULL); + } else + ibnd_iter_nodes(fabric, print_node, NULL); + + ibnd_destroy_fabric(fabric); + +close_port: + mad_rpc_close_port(ibmad_port); + close_node_name_map(node_name_map); + exit(rc); +} -- 1.5.4.5 From weiny2 at llnl.gov Wed Apr 22 18:54:55 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 22 Apr 2009 18:54:55 -0700 Subject: [ofa-general] [PATCH 5/5] Various bug fixes to the tools I have found Message-ID: <20090422185455.47c804dc.weiny2@llnl.gov> >From 808598c4cc2ed2b4b3271d623c8d564448391e8d Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Wed, 22 Apr 2009 18:44:17 -0700 Subject: [PATCH] Various bug fixes to the tools I have found Signed-off-by: Ira Weiny --- infiniband-diags/scripts/iblinkinfo.pl.in | 2 +- infiniband-diags/src/iblinkinfo.c | 10 ++++++---- infiniband-diags/src/ibnetdiscover.c | 2 +- 3 files changed, 8 insertions(+), 6 deletions(-) diff --git a/infiniband-diags/scripts/iblinkinfo.pl.in b/infiniband-diags/scripts/iblinkinfo.pl.in index c81570d..0ce33ab 100755 --- a/infiniband-diags/scripts/iblinkinfo.pl.in +++ b/infiniband-diags/scripts/iblinkinfo.pl.in @@ -35,6 +35,6 @@ # -# this is not just a wrapper for the C based utility +# this is now just a wrapper for the C based utility $str = join " ", at ARGV; exec "@IBSCRIPTPATH@/iblinkinfo $str"; diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c index 16728cb..82c2ce8 100644 --- a/infiniband-diags/src/iblinkinfo.c +++ b/infiniband-diags/src/iblinkinfo.c @@ -121,15 +121,17 @@ print_port(ibnd_node_t *node, ibnd_port_t *port) char width_msg[256]; char speed_msg[256]; char ext_port_str[256]; - int iwidth = mad_get_field(port->info, 0, IB_PORT_LINK_WIDTH_ACTIVE_F); - int ispeed = mad_get_field(port->info, 0, IB_PORT_LINK_SPEED_ACTIVE_F); - int istate = mad_get_field(port->info, 0, IB_PORT_STATE_F); - int iphystate = mad_get_field(port->info, 0, IB_PORT_PHYS_STATE_F); + int iwidth, ispeed, istate, iphystate; int n = 0; if (!port) return; + iwidth = mad_get_field(port->info, 0, IB_PORT_LINK_WIDTH_ACTIVE_F); + ispeed = mad_get_field(port->info, 0, IB_PORT_LINK_SPEED_ACTIVE_F); + istate = mad_get_field(port->info, 0, IB_PORT_STATE_F); + iphystate = mad_get_field(port->info, 0, IB_PORT_PHYS_STATE_F); + remote_guid_str[0] = '\0'; remote_str[0] = '\0'; link_str[0] = '\0'; diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 4cd0b37..69fc5fb 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -690,7 +690,7 @@ int main(int argc, char **argv) node_name_map = open_node_name_map(node_name_map_file); if ((fabric = ibnd_discover_fabric(ibmad_port, ibd_timeout, NULL, -1)) == NULL) - IBERROR(stderr, "discover failed\n"); + IBERROR("discover failed\n"); if (ports_report) ibnd_iter_nodes(fabric, -- 1.5.4.5 From weiny2 at llnl.gov Wed Apr 22 19:22:52 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 22 Apr 2009 19:22:52 -0700 Subject: [ofa-general] Re: [PATCH 1/5 v2] change ibnd_discover_fabric to receive ibmad_port In-Reply-To: <20090422185444.d22f1f84.weiny2@llnl.gov> References: <20090422185444.d22f1f84.weiny2@llnl.gov> Message-ID: <20090422192252.f1e0de38.weiny2@llnl.gov> I already found a bug in this patch... You should check for ibmad_port being NULL before you use it. ;-) v2 is below. From: Ira Weiny Date: Wed, 22 Apr 2009 19:20:03 -0700 Subject: [PATCH] change ibnd_discover_fabric to receive ibmad_port In order to allow ibmad_port to be opened with additional classes libibnetdisc should accept an ibmad_port as a parameter. The library will error out if the classes it needs are not opened. Signed-off-by: Ira Weiny --- .../libibnetdisc/include/infiniband/ibnetdisc.h | 6 +- .../libibnetdisc/man/ibnd_discover_fabric.3 | 21 ++++++-- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 54 +++++++++----------- infiniband-diags/libibnetdisc/src/internal.h | 1 - infiniband-diags/libibnetdisc/test/testleaks.c | 20 ++++++-- 5 files changed, 60 insertions(+), 42 deletions(-) diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h index a882994..7eaca24 100644 --- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h @@ -124,6 +124,7 @@ typedef struct chassis { * Main fabric object which is returned and represents the data discovered */ typedef struct ib_fabric { + struct ibmad_port *ibmad_port; /* the node the discover was initiated from * "from" parameter in ibnd_discover_fabric * or by default the node you ar running on @@ -143,11 +144,10 @@ typedef struct ib_fabric { void ibnd_debug(int i); void ibnd_show_progress(int i); -ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int dev_port, +ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms, ib_portid_t *from, int hops); /** - * dev_name: (required) local device name to use to access the fabric - * dev_port: (required) local device port to use to access the fabric + * open: (required) ibmad_port object from libibmad * timeout_ms: (required) gives the timeout for a _SINGLE_ query on * the fabric. So if there are multiple nodes not * responding this may result in a lengthy delay. diff --git a/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 b/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 index 44d8c65..c832c11 100644 --- a/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 +++ b/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 @@ -5,7 +5,7 @@ ibnd_discover_fabric, ibnd_destroy_fabric, ibnd_debug ibnd_show_progress \- init .nf .B #include .sp -.BI "ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int dev_port, int timeout_ms, ib_portid_t *from, int hops)" +.bi "ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms, ib_portid_t *from, int hops)" .BI "void ibnd_destroy_fabric(ibnd_fabric_t *fabric)" .BI "void ibnd_debug(int i)" .BI "void ibnd_show_progress(int i)" @@ -13,7 +13,10 @@ ibnd_discover_fabric, ibnd_destroy_fabric, ibnd_debug ibnd_show_progress \- init .SH "DESCRIPTION" .B ibnd_discover_fabric() -Discover the fabric connected to the port specified by dev_name and dev_port, using a timeout specified. The "from" and "hops" parameters are optional and allow one to scan part of a fabric by specifying a node "from" and a number of hops away from that node to scan, "hops". This gives the user a "sub-fabric" which is "centered" anywhere they chose. +Discover the fabric connected to the port specified by ibmad_port, using a timeout specified. The "from" and "hops" parameters are optional and allow one to scan part of a fabric by specifying a node "from" and a number of hops away from that node to scan, "hops". This gives the user a "sub-fabric" which is "centered" anywhere they chose. + +ibmad_port must be opened with at least IB_SMI_CLASS and IB_SMI_DIRECT_CLASS +classes for ibnd_discover_fabric to work. .B ibnd_destroy_fabric() free all memory and resources associated with the fabric. @@ -36,13 +39,23 @@ NONE .B Discover the entire fabric connected to device "mthca0", port 1. - ibnd_discover_fabric("mthca0", 1, 100, NULL, 0); + int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; + struct ibmad_port *ibmad_port = mad_rpc_open_port(ca, ca_port, mgmt_classes, 2); + ibnd_fabric_t *fabric = ibnd_discover_fabric(ibmad_port, 100, NULL, 0); + ... + ibnd_destroy_fabric(fabric); + mad_rpc_close_port(ibmad_port); .B Discover only a single node and those nodes connected to it. + ... str2drpath(&(port_id.drpath), from, 0, 0); + ... + ibnd_discover_fabric(ibmad_port, 100, &port_id, 1); + ... - ibnd_discover_fabric("mthca0", 1, 100, &port_id, 1); +.SH "SEE ALSO" + libibmad, mad_rpc_open_port .SH "AUTHORS" .TP diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index 479bae7..410e2dd 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -79,7 +79,7 @@ get_port_info(struct ibnd_fabric *fabric, struct ibnd_port *port, IB_PORT_LINK_SPEED_ACTIVE_F); if (!smp_query_via(port->port.info, portid, IB_ATTR_PORT_INFO, portnum, timeout_ms, - fabric->ibmad_port)) + fabric->fabric.ibmad_port)) return -1; decode_port_info(&(port->port)); @@ -100,7 +100,7 @@ static int query_node_info(struct ibnd_fabric *fabric, struct ibnd_node *node, ib_portid_t *portid) { if (!smp_query_via(&(node->node.info), portid, IB_ATTR_NODE_INFO, 0, timeout_ms, - fabric->ibmad_port)) + fabric->fabric.ibmad_port)) return -1; /* decode just a couple of fields for quicker reference. */ @@ -130,11 +130,11 @@ query_node(struct ibnd_fabric *fabric, struct ibnd_node *inode, port->guid = mad_get_field64(node->info, 0, IB_NODE_PORT_GUID_F); if (!smp_query_via(nd, portid, IB_ATTR_NODE_DESC, 0, timeout_ms, - fabric->ibmad_port)) + fabric->fabric.ibmad_port)) return -1; if (!smp_query_via(port->info, portid, IB_ATTR_PORT_INFO, 0, timeout_ms, - fabric->ibmad_port)) + fabric->fabric.ibmad_port)) return -1; decode_port_info(port); @@ -146,7 +146,7 @@ query_node(struct ibnd_fabric *fabric, struct ibnd_node *inode, /* after we have the sma information find out the real PortInfo for this port */ if (!smp_query_via(port->info, portid, IB_ATTR_PORT_INFO, port->portnum, timeout_ms, - fabric->ibmad_port)) + fabric->fabric.ibmad_port)) return -1; decode_port_info(port); @@ -154,7 +154,7 @@ query_node(struct ibnd_fabric *fabric, struct ibnd_node *inode, port->lmc = node->smalmc; if (!smp_query_via(node->switchinfo, portid, IB_ATTR_SWITCH_INFO, 0, timeout_ms, - fabric->ibmad_port)) + fabric->fabric.ibmad_port)) node->smaenhsp0 = 0; /* assume base SP0 */ else mad_decode_field(node->switchinfo, IB_SW_ENHANCED_PORT0_F, &node->smaenhsp0); @@ -241,7 +241,7 @@ ibnd_update_node(ibnd_node_t *node) return (NULL); if (!smp_query_via(nd, &(n->node.path_portid), IB_ATTR_NODE_DESC, 0, timeout_ms, - f->ibmad_port)) + f->fabric.ibmad_port)) return (NULL); /* update all the port info's */ @@ -253,14 +253,14 @@ ibnd_update_node(ibnd_node_t *node) goto done; if (!smp_query_via(portinfo_port0, &(n->node.path_portid), IB_ATTR_PORT_INFO, 0, timeout_ms, - f->ibmad_port)) + f->fabric.ibmad_port)) return (NULL); n->node.smalid = mad_get_field(portinfo_port0, 0, IB_PORT_LID_F); n->node.smalmc = mad_get_field(portinfo_port0, 0, IB_PORT_LMC_F); if (!smp_query_via(node->switchinfo, &(n->node.path_portid), IB_ATTR_SWITCH_INFO, 0, timeout_ms, - f->ibmad_port)) + f->fabric.ibmad_port)) node->smaenhsp0 = 0; /* assume base SP0 */ else mad_decode_field(node->switchinfo, IB_SW_ENHANCED_PORT0_F, &n->node.smaenhsp0); @@ -476,17 +476,8 @@ get_remote_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd_ return 0; } -static void * -ibnd_init_port(char *dev_name, int dev_port) -{ - int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; - - /* Crank up the mad lib */ - return (mad_rpc_open_port(dev_name, dev_port, mgmt_classes, 2)); -} - ibnd_fabric_t * -ibnd_discover_fabric(char *dev_name, int dev_port, int timeout_ms, +ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms, ib_portid_t *from, int hops) { struct ibnd_fabric *fabric = NULL; @@ -500,15 +491,27 @@ ibnd_discover_fabric(char *dev_name, int dev_port, int timeout_ms, ib_portid_t *path; int max_hops = MAXHOPS-1; /* default find everything */ + if (!ibmad_port) { + IBPANIC("ibmad_port must be specified to " + "ibnd_discover_fabric\n"); + return (NULL); + } + if (mad_rpc_class_agent(ibmad_port, IB_SMI_CLASS) == -1 + || + mad_rpc_class_agent(ibmad_port, IB_SMI_DIRECT_CLASS) == -1) { + IBPANIC("ibmad_port must be opened with " + "IB_SMI_CLASS && IB_SMI_DIRECT_CLASS\n"); + return (NULL); + } + /* if not everything how much? */ if (hops >= 0) { max_hops = hops; } /* If not specified start from "my" port */ - if (!from) { + if (!from) from = &my_portid; - } fabric = malloc(sizeof(*fabric)); @@ -519,12 +522,7 @@ ibnd_discover_fabric(char *dev_name, int dev_port, int timeout_ms, memset(fabric, 0, sizeof(*fabric)); - fabric->ibmad_port = ibnd_init_port(dev_name, dev_port); - if (!fabric->ibmad_port) { - IBPANIC("OOM: failed to open \"%s\" port %d\n", - dev_name, dev_port); - goto error; - } + fabric->fabric.ibmad_port = ibmad_port; IBND_DEBUG("from %s\n", portid2str(from)); @@ -633,8 +631,6 @@ ibnd_destroy_fabric(ibnd_fabric_t *fabric) node = next; } } - if (f->ibmad_port) - mad_rpc_close_port(f->ibmad_port); free(f); } diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h index afed25e..4e6bb18 100644 --- a/infiniband-diags/libibnetdisc/src/internal.h +++ b/infiniband-diags/libibnetdisc/src/internal.h @@ -79,7 +79,6 @@ struct ibnd_fabric { ibnd_fabric_t fabric; /* internal use only */ - void *ibmad_port; struct ibnd_node *nodestbl[HTSZ]; struct ibnd_port *portstbl[HTSZ]; struct ibnd_node *nodesdist[MAXHOPS+1]; diff --git a/infiniband-diags/libibnetdisc/test/testleaks.c b/infiniband-diags/libibnetdisc/test/testleaks.c index 1fabaac..0d009c3 100644 --- a/infiniband-diags/libibnetdisc/test/testleaks.c +++ b/infiniband-diags/libibnetdisc/test/testleaks.c @@ -84,6 +84,7 @@ usage(void) int main(int argc, char **argv) { + int rc = 0; char *ca = 0; int ca_port = 0; ibnd_fabric_t *fabric = NULL; @@ -94,6 +95,9 @@ main(int argc, char **argv) ib_portid_t port_id; int iters = -1; + struct ibmad_port *ibmad_port; + int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; + static char const str_opts[] = "S:D:n:C:P:t:shuf:i:"; static const struct option long_opts[] = { { "S", 1, 0, 'S'}, @@ -155,25 +159,31 @@ main(int argc, char **argv) argc -= optind; argv += optind; + ibmad_port = mad_rpc_open_port(ca, ca_port, mgmt_classes, 2); + while (iters == -1 || iters-- > 0) { if (from) { /* only scan part of the fabric */ str2drpath(&(port_id.drpath), from, 0, 0); - if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, + if ((fabric = ibnd_discover_fabric(ibmad_port, timeout_ms, &port_id, hops)) == NULL) { fprintf(stderr, "discover failed\n"); - exit(1); + rc = 1; + goto close_port; } guid = 0; } else { - if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, NULL, -1)) == NULL) { + if ((fabric = ibnd_discover_fabric(ibmad_port, timeout_ms, NULL, -1)) == NULL) { fprintf(stderr, "discover failed\n"); - exit(1); + rc = 1; + goto close_port; } } ibnd_destroy_fabric(fabric); } - exit(0); +close_port: + mad_rpc_close_port(ibmad_port); + exit(rc); } -- 1.5.4.5 From sashak at voltaire.com Wed Apr 22 23:42:34 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 23 Apr 2009 09:42:34 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH 0/5] Follow on patch series to libibnetdisc including converting ibqueryerrors.pl In-Reply-To: <20090422185441.6f8601dc.weiny2@llnl.gov> References: <20090422185441.6f8601dc.weiny2@llnl.gov> Message-ID: <20090423064234.GB7267@sk> On 18:54 Wed 22 Apr , Ira Weiny wrote: > > When do you plan to merge pq/ibn3? I'm working on this now ( need to fix few small issues there). Sasha From sashak at voltaire.com Thu Apr 23 00:02:10 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 23 Apr 2009 10:02:10 +0300 Subject: ***SPAM*** Re: [ofa-general] [PATCH v3 1/3] Create a new library libibnetdisc In-Reply-To: References: <20090403154251.dec181f2.weiny2@llnl.gov> Message-ID: <20090423070210.GA8281@sk> On 15:41 Fri 17 Apr , Hefty, Sean wrote: > >+void > >+ibnd_debug(int i) > >+{ > >+ if (i) { > >+ ibdebug++; > >+ madrpc_show_errors(1); > >+ umad_debug(i); > >+ } else { > >+ ibdebug = 0; > >+ madrpc_show_errors(0); > >+ umad_debug(0); > >+ } > >+} > > Where does the definition for ibdebug come from? It is in ibdiag_common.c. Every infiniband-ibdiag tool is linked with it. And yes, using this in this library can be problematic since introduces a "hidden" dependency. Sasha From sashak at voltaire.com Thu Apr 23 00:08:29 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 23 Apr 2009 10:08:29 +0300 Subject: [ofa-general] Re: [PATCH v3 1/3] Create a new library libibnetdisc In-Reply-To: <20090403154251.dec181f2.weiny2@llnl.gov> References: <20090403154251.dec181f2.weiny2@llnl.gov> Message-ID: <20090423070829.GB8281@sk> On 15:42 Fri 03 Apr , Ira Weiny wrote: > From e1c2c10678b0d1d90f7eb31eb1c1441b5ee43311 Mon Sep 17 00:00:00 2001 Please mask (or remove) "From ..." line from commit message - 'git rebase' uses this to split patches (this is email message delimiter in mbox format). > From: Ira Weiny > Date: Fri, 3 Apr 2009 15:28:08 -0700 > Subject: [PATCH] Create a new library libibnetdisc > > This encompasses the functionality of ibnetdiscover in a C library. It returns > a single "ibnd_fabric_t" object which represents the data found during the > scan. The NodeInfo, PortInfo, and SwitchInfo are preserved from the queries > made on the fabric to be used by the calling function as they see fit. > > This greatly benefits some diags like iblinkinfo.pl. This diag in particular > was re-written using this library in C and has shown an 85% speed up on a ~1000 > node cluster. > > Previous iblinkinfo.pl > real 3m35.876s > user 0m13.210s > sys 1m1.046s > > New iblinkinfotest > real 0m32.869s > user 0m0.067s > sys 0m0.140s > > Signed-off-by: Ira Weiny Applied. Thanks. I think using 'ibdebug' issue should be addressed in subsequent patch. Sasha From sashak at voltaire.com Thu Apr 23 00:15:42 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 23 Apr 2009 10:15:42 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH v3 2/3] Convert iblinkinfo.pl to C and use new ibnetdisc library. In-Reply-To: <20090403154254.5ab60589.weiny2@llnl.gov> References: <20090403154254.5ab60589.weiny2@llnl.gov> Message-ID: <20090423071542.GC8281@sk> On 15:42 Fri 03 Apr , Ira Weiny wrote: > From a677ae35fe7a5966f05b5859df8f00e9b18df864 Mon Sep 17 00:00:00 2001 You know :) > From: Ira Weiny > Date: Fri, 3 Apr 2009 15:28:18 -0700 > Subject: [PATCH] Convert iblinkinfo.pl to C and use new ibnetdisc library. > > Signed-off-by: Ira Weiny Applied. Thanks. Couple of notes are below. > --- > infiniband-diags/Makefile.am | 7 +- > infiniband-diags/configure.in | 1 + > infiniband-diags/scripts/iblinkinfo.pl | 327 ------------------------ > infiniband-diags/scripts/iblinkinfo.pl.in | 40 +++ > infiniband-diags/src/iblinkinfo.c | 386 +++++++++++++++++++++++++++++ > 5 files changed, 432 insertions(+), 329 deletions(-) > delete mode 100755 infiniband-diags/scripts/iblinkinfo.pl > create mode 100755 infiniband-diags/scripts/iblinkinfo.pl.in > create mode 100644 infiniband-diags/src/iblinkinfo.c > > diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am > index 7b8523a..b480a4a 100644 > --- a/infiniband-diags/Makefile.am > +++ b/infiniband-diags/Makefile.am > @@ -1,6 +1,7 @@ > SUBDIRS = libibnetdisc > > -INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband > +INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband \ > + -I$(top_builddir)/libibnetdisc/include Here $(top_srcdir) should be used instead of $(top_builddir). 'top_builddir' points project build directory, not source directory (where not generated header files are located). And build like: mkdir /tmp/ib-diags && cd /tmp/ib-diags /path/to/management/infiniband-diags/configure && make will fail. I'm fixing this. > if DEBUG > DBGFLAGS = -ggdb -D_DEBUG_ > @@ -11,7 +12,7 @@ endif > sbin_PROGRAMS = src/ibaddr src/ibnetdiscover src/ibping src/ibportstate \ > src/ibroute src/ibstat src/ibsysstat src/ibtracert \ > src/perfquery src/sminfo src/smpdump src/smpquery \ > - src/saquery src/vendstat > + src/saquery src/vendstat src/iblinkinfo > > if ENABLE_TEST_UTILS > sbin_PROGRAMS += src/ibsendtrap src/mcm_rereg_test > @@ -55,6 +56,8 @@ src_saquery_SOURCES = src/saquery.c > src_ibsendtrap_SOURCES = src/ibsendtrap.c > src_vendstat_SOURCES = src/vendstat.c > src_mcm_rereg_test_SOURCES = src/mcm_rereg_test.c > +src_iblinkinfo_SOURCES = src/iblinkinfo.c > +src_iblinkinfo_LDADD = -libnetdisc I think here should be '-L$(top_builddir)/libibnetdisc -libnetdisc'. Otherwise we are assuming pre-installed library (we could run 'make install', but it fails due to 'make' error :)). Adding this too. > > man_MANS = man/ibaddr.8 man/ibcheckerrors.8 man/ibcheckerrs.8 \ > man/ibchecknet.8 man/ibchecknode.8 man/ibcheckport.8 \ [snip...] > +void > +usage(void) > +{ > + fprintf(stderr, > + "Usage: %s [-hclp -S -D -C -P ]\n" > + " Report link speed and connection for each port of each switch which is active\n" > + " -h This help message\n" > + " -S output only the node specified by guid\n" > + " -D print only node specified by \n" > + " -f specify node to start \"from\"\n" > + " -n Number of hops to include away from specified node\n" > + " -d print only down links\n" > + " -l (line mode) print all information for each link on each line\n" > + " -p print additional switch settings (PktLifeTime,HoqLife,VLStallCount)\n" > + > + > + " -t timeout for any single fabric query\n" > + " -s show progress during scan\n" > + " --node-name-map use specified node name map\n" > + > + " -C use selected Channel Adaptor name for queries\n" > + " -P use selected channel adaptor port for queries\n" > + " -g print port guids instead of node guids\n" > + " --debug print debug messages\n" > + " -R (this option is obsolete and does nothing)\n" > + , > + argv0); > + exit(-1); > +} > + > +int > +main(int argc, char **argv) > +{ > + char *ca = 0; > + int ca_port = 0; > + ibnd_fabric_t *fabric = NULL; > + uint64_t guid = 0; > + char *dr_path = NULL; > + char *from = NULL; > + int hops = 0; > + ib_portid_t port_id; > + > + static char const str_opts[] = "S:D:n:C:P:t:sldgphuf:R"; > + static const struct option long_opts[] = { > + { "S", 1, 0, 'S'}, > + { "D", 1, 0, 'D'}, > + { "num-hops", 1, 0, 'n'}, > + { "down-links-only", 0, 0, 'd'}, > + { "line-mode", 0, 0, 'l'}, > + { "ca-name", 1, 0, 'C'}, > + { "ca-port", 1, 0, 'P'}, > + { "timeout", 1, 0, 't'}, > + { "show", 0, 0, 's'}, > + { "print-port-guids", 0, 0, 'g'}, > + { "print-additional", 0, 0, 'p'}, > + { "help", 0, 0, 'h'}, > + { "usage", 0, 0, 'u'}, > + { "node-name-map", 1, 0, 1}, > + { "debug", 0, 0, 2}, > + { "compat", 0, 0, 3}, > + { "from", 1, 0, 'f'}, > + { "R", 0, 0, 'R'}, > + { } > + }; > + > + f = stdout; > + > + argv0 = argv[0]; > + > + while (1) { > + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); Any reason to not use new ibdiag_process_opts() here? It should simplify an options processing and unify the usage message. Of course this can be done as subsequent patch. Sasha From celine.bourde at ext.bull.net Thu Apr 23 01:10:40 2009 From: celine.bourde at ext.bull.net (Celine Bourde) Date: Thu, 23 Apr 2009 10:10:40 +0200 Subject: [Fwd: Re: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition]] Message-ID: <49F02280.7010005@ext.bull.net> Hi, I've updated nfs-utils package: [root at my_host ~]# mount.nfs -V mount.nfs (linux nfs-utils 1.1.6) >[root at my_host ~]# strace mount.nfs 192.168.0.215:/vol0 /mnt/ -o rdma,port=2050 >Does it work without rdma? The problem is exactly the same without rdma: [root at my_host ~]# strace mount.nfs 192.168.0.215:/vol0 /mnt/ -o rw,port=2050 [..] socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 3 bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 fcntl(3, F_GETFL) = 0x2 (flags O_RDWR) fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 connect(3, {sa_family=AF_INET, sin_port=htons(997), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 fcntl(3, F_SETFL, O_RDWR) = 0 connect(3, {sa_family=AF_UNSPEC, sa_data="\0o\177\0\0\1\0\0\0\0\0\0\0\0"}, 16) = 0 sendto(3, "\0308\310\272\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., 40, 0, {sa_family=AF_INET, sin_port=htons(997), sin_addr=inet_addr("127.0.0.1")}, 16) = 40 poll([{fd=3, events=POLLIN}], 1, 3000) = 1 ([{fd=3, revents=POLLIN}]) recvfrom(3, "\0308\310\272\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 400, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(997), sin_addr=inet_addr("127.0.0.1")}, [16]) = 24 close(3) = 0 mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, "port=2050,addr=192.168.0.215" .. and it blocks ! [root at my_host ~]# dmesg rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16 rpcrdma: connection to 192.168.0.215:2050 closed (-103) rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16 rpcrdma: connection to 192.168.0.215:2050 closed (-103) >Celine, can you establish regular rdma connections over your IB link? >Like rping? I can't etablish rdma connection, following errors occur with rping : [root at my_host ~]# rping -c 192.168.0.214 cq completion failed status 5 wait for CONNECTED state 10 connect error -1 cma event RDMA_CM_EVENT_REJECTED, error 8 My kernel is a 2.6.27 kernel.org build on a Red Hat Distribution : Red Hat Enterprise Linux Server release 5.3 Beta (Tikanga). Céline. From sashak at voltaire.com Thu Apr 23 01:25:35 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 23 Apr 2009 11:25:35 +0300 Subject: [ofa-general] ***SPAM*** Re: [PATCH v3 3/3] Convert ibnetdiscover to use new ibnetdisc library. In-Reply-To: <20090403154301.f656e7a4.weiny2@llnl.gov> References: <20090403154301.f656e7a4.weiny2@llnl.gov> Message-ID: <20090423082535.GD8281@sk> On 15:43 Fri 03 Apr , Ira Weiny wrote: > From e506ac4d6accefb49b89811cc9dd77775ad481f7 Mon Sep 17 00:00:00 2001 > From: Ira Weiny > Date: Fri, 3 Apr 2009 15:28:29 -0700 > Subject: [PATCH] Convert ibnetdiscover to use new ibnetdisc library. > > All other functionality is preserved And what about '-v' and '-e' options? Why is it removed from man page? > Signed-off-by: Ira Weiny > --- > infiniband-diags/Makefile.am | 5 +- > infiniband-diags/include/grouping.h | 113 --- > infiniband-diags/libibnetdisc/src/chassis.c | 20 +- > infiniband-diags/libibnetdisc/src/ibnetdisc.c | 5 +- > infiniband-diags/man/ibnetdiscover.8 | 10 +- > infiniband-diags/src/grouping.c | 785 -------------------- > infiniband-diags/src/ibnetdiscover.c | 974 +++++++++---------------- > 7 files changed, 345 insertions(+), 1567 deletions(-) > delete mode 100644 infiniband-diags/include/grouping.h > delete mode 100644 infiniband-diags/src/grouping.c > > diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am > index b480a4a..19b992c 100644 > --- a/infiniband-diags/Makefile.am > +++ b/infiniband-diags/Makefile.am > @@ -41,7 +41,8 @@ LDADD = libcommon.a > > libcommon_a_SOURCES = src/ibdiag_common.c > src_ibaddr_SOURCES = src/ibaddr.c > -src_ibnetdiscover_SOURCES = src/ibnetdiscover.c src/grouping.c > +src_ibnetdiscover_SOURCES = src/ibnetdiscover.c > +src_ibnetdiscover_LDFLAGS = -L$(top_srcdir)/libibnetdisc -libnetdisc As with previous patch 'top_builddir' should be used here. > src_ibping_SOURCES = src/ibping.c > src_ibportstate_SOURCES = src/ibportstate.c > src_ibroute_SOURCES = src/ibroute.c > @@ -57,7 +58,7 @@ src_ibsendtrap_SOURCES = src/ibsendtrap.c > src_vendstat_SOURCES = src/vendstat.c > src_mcm_rereg_test_SOURCES = src/mcm_rereg_test.c > src_iblinkinfo_SOURCES = src/iblinkinfo.c > -src_iblinkinfo_LDADD = -libnetdisc > +src_iblinkinfo_LDFLAGS = -L$(top_srcdir)/libibnetdisc -libnetdisc BTW what is the reason change LDADD to LDFLAGS? > diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c > index bf7c2a7..479bae7 100644 > --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c > +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c > @@ -150,6 +150,9 @@ query_node(struct ibnd_fabric *fabric, struct ibnd_node *inode, > return -1; > decode_port_info(port); > > + port->base_lid = node->smalid; /* LID is still defined by port 0 */ > + port->lmc = node->smalmc; > + > if (!smp_query_via(node->switchinfo, portid, IB_ATTR_SWITCH_INFO, 0, timeout_ms, > fabric->ibmad_port)) > node->smaenhsp0 = 0; /* assume base SP0 */ > @@ -167,7 +170,7 @@ add_port_to_dpath(ib_dr_path_t *path, int nextport) > if (path->cnt+2 >= sizeof(path->p)) > return -1; > ++path->cnt; > - path->p[path->cnt] = nextport; > + path->p[path->cnt] = (uint8_t) nextport; > return path->cnt; > } > > diff --git a/infiniband-diags/man/ibnetdiscover.8 b/infiniband-diags/man/ibnetdiscover.8 > index 958efa9..768d392 100644 > --- a/infiniband-diags/man/ibnetdiscover.8 > +++ b/infiniband-diags/man/ibnetdiscover.8 > @@ -5,7 +5,7 @@ ibnetdiscover \- discover InfiniBand topology > > .SH SYNOPSIS > .B ibnetdiscover > -[\-d(ebug)] [\-e(rr_show)] [\-v(erbose)] [\-s(how)] [\-l(ist)] [\-g(rouping)] [\-H(ca_list)] [\-S(witch_list)] [\-R(outer_list)] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] [\-V(ersion)] [\--node-name-map ] [\-p(orts)] [\-h(elp)] [] > +[\-d(ebug)] [\-s(how)] [\-l(ist)] [\-g(rouping)] [\-H(ca_list)] [\-S(witch_list)] [\-R(outer_list)] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] [\-V(ersion)] [\--node-name-map ] [\-p(orts)] [\-h(elp)] [] > > .SH DESCRIPTION > .PP > @@ -37,7 +37,7 @@ List of connected switches > List of connected routers > .TP > \fB\-s\fR, \fB\-\-show\fR > -Show more information > +Show progress information during discovery. > .TP > \fB\-\-node\-name\-map\fR > Specify a node name map. The node name map file maps GUIDs to more user friendly > @@ -57,15 +57,9 @@ using the util_name -h syntax. > # Debugging flags > .PP > \-d raise the IB debugging level. > - May be used several times (-ddd or -d -d -d). > -.PP > -\-e show send and receive errors (timeouts and others) > .PP > \-h show the usage message > .PP > -\-v increase the application verbosity level. > - May be used several times (-vv or -v -v -v) > -.PP > \-V show the version info. Those options are used actually. Why should it be removed from man page? Just typo? Sasha From sashak at voltaire.com Thu Apr 23 01:32:34 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Thu, 23 Apr 2009 11:32:34 +0300 Subject: ***SPAM*** Not a spam [was: Re: [ofa-general] ***SPAM*** Re: [PATCH v3 2/3] Convert iblinkinfo.pl to C and use new ibnetdisc library.] Message-ID: <20090423083234.GE8281@sk> Ugh... I'm adding those lines in my .procmailrc: :0 f * ^Subject:.* \*\*\*SPAM\*\*\* | sed -e '/^Subject:/s/\*\*\*SPAM\*\*\* //' No more ***SPAM***s in my mail box. Sasha From tziporet at dev.mellanox.co.il Thu Apr 23 01:41:07 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Thu, 23 Apr 2009 11:41:07 +0300 Subject: [ofa-general] Registering physical memory region In-Reply-To: <49EF0B09.2000804@tx.technion.ac.il> References: <49EF0B09.2000804@tx.technion.ac.il> Message-ID: <49F029A3.7070803@mellanox.co.il> Leonid Azriel wrote: > Hi, > > Is there a way to register physical memory with HCA from a user > application. Tried mmap to map it to the virtual memory, but > ibv_reg_mr fails with bad address. The memory region is physically > located in the IO (PCI) space. > Please advise. > Registration of physical memory is not enabled from user space, only from kernel Tziporet From vlad at lists.openfabrics.org Thu Apr 23 03:23:14 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 23 Apr 2009 03:23:14 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090423-0200 daily build status Message-ID: <20090423102314.5DF8FE6118D@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From celine.bourde at ext.bull.net Thu Apr 23 03:48:56 2009 From: celine.bourde at ext.bull.net (Celine Bourde) Date: Thu, 23 Apr 2009 12:48:56 +0200 Subject: [Fwd: Re: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition]] In-Reply-To: <49F02280.7010005@ext.bull.net> References: <49F02280.7010005@ext.bull.net> Message-ID: <49F04798.1090006@ext.bull.net> There was a mistake in my last email. My last nfs mount test with no rdma was not correct, I've retried with [root at my_host ] #mount -o rw 192.168.0.215:/vol0 /mnt/ My nfs partition is mounted and everything is correct, which confirms that the problem comes from rdma connection manager. Céline. Celine Bourde wrote: > Hi, > > > I've updated nfs-utils package: > > [root at my_host ~]# mount.nfs -V > > mount.nfs (linux nfs-utils 1.1.6) > > >> [root at my_host ~]# strace mount.nfs >> 192.168.0.215:/vol0 /mnt/ -o rdma,port=2050 >> >> Does it work without rdma? >> > > The problem is exactly the same without rdma: > > > [root at my_host ~]# strace mount.nfs 192.168.0.215:/vol0 /mnt/ -o > rw,port=2050 > > [..] > > socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 3 > > bind(3, {sa_family=AF_INET, sin_port=htons(0), > sin_addr=inet_addr("0.0.0.0")}, 16) = 0 > > fcntl(3, F_GETFL) = 0x2 (flags O_RDWR) > > fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > > connect(3, {sa_family=AF_INET, sin_port=htons(997), > sin_addr=inet_addr("127.0.0.1")}, 16) = 0 > > fcntl(3, F_SETFL, O_RDWR) = 0 > > connect(3, {sa_family=AF_UNSPEC, > sa_data="\0o\177\0\0\1\0\0\0\0\0\0\0\0"}, 16) = 0 > > sendto(3, > "\0308\310\272\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., > 40, 0, {sa_family=AF_INET, sin_port=htons(997), > sin_addr=inet_addr("127.0.0.1")}, 16) = 40 > > poll([{fd=3, events=POLLIN}], 1, 3000) = 1 ([{fd=3, revents=POLLIN}]) > > recvfrom(3, "\0308\310\272\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", > 400, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(997), > sin_addr=inet_addr("127.0.0.1")}, [16]) = 24 > > close(3) = 0 > > mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, > "port=2050,addr=192.168.0.215" > > > .. and it blocks ! > > > [root at my_host ~]# dmesg > > rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 > ird 16 > > rpcrdma: connection to 192.168.0.215:2050 closed (-103) > > rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 > ird 16 > > rpcrdma: connection to 192.168.0.215:2050 closed (-103) > > >> Celine, can you establish regular rdma >> connections over your IB link? Like rping? >> > > I can't etablish rdma connection, following errors occur with rping : > > > [root at my_host ~]# rping -c 192.168.0.214 > > cq completion failed status 5 > > wait for CONNECTED state 10 > > connect error -1 > > cma event RDMA_CM_EVENT_REJECTED, error 8 > > > My kernel is a 2.6.27 kernel.org build on a Red Hat Distribution : Red > Hat Enterprise Linux Server release 5.3 Beta (Tikanga). > > > Céline. > > > _______________________________________________ > > general mailing list > > general at lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From monis at Voltaire.COM Thu Apr 23 05:06:32 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Thu, 23 Apr 2009 15:06:32 +0300 Subject: [ofa-general] [PATCH] rdma_cm: Add debugfs entries to monitor rdma_cm connections In-Reply-To: References: <49E71FE1.90102@Voltaire.COM> <49EB11E5.5000407@Voltaire.COM> Message-ID: <49F059C8.5070009@Voltaire.COM> > I was able to figure these out by looking at the code, but if I look at the > output of netstat, the headings and values are easy to interpret without needing > to refer to source code. > OK. I'll resend a version with that is human friendly. thanks MoniS > - Sean > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From monis at Voltaire.COM Thu Apr 23 05:10:22 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Thu, 23 Apr 2009 15:10:22 +0300 Subject: [ofa-general] [PATCH v2] rdma_cm: Add debugfs entries to monitor rdma_cm connections Message-ID: <49F05AAE.4020606@Voltaire.COM> Create a virtual file under debugfs for each cma device and use it to print information about each rdma_id that is attached to this device. Here is an example of 'cat /sys/kernel/debug/rdma_cm/mthca0_rdma_id'. This example is for a host that runs a rping server (when a remote client is connected to it) and a rping client to a remote server. TYPE DEVICE PORT NET_DEV SRC_ADDR DST_ADDR SPACE STATE QP_NUM mthca0 0 0.0.0.0:7174 TCP LISTEN 0 IB mthca0 1 ib0 192.30.3.249:46079 192.30.3.248:7174 TCP CONNECT 132102 IB mthca0 1 ib0 192.30.3.249:7174 192.30.3.248:42561 TCP CONNECT 132103 Signed-off-by: Moni Shoua -- drivers/infiniband/core/cma.c | 206 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 206 insertions(+) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 2a2e508..0288cad 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -51,6 +51,9 @@ #include #include +#include +#include + MODULE_AUTHOR("Sean Hefty"); MODULE_DESCRIPTION("Generic RDMA CM Agent"); MODULE_LICENSE("Dual BSD/GPL"); @@ -59,6 +62,10 @@ MODULE_LICENSE("Dual BSD/GPL"); #define CMA_MAX_CM_RETRIES 15 #define CMA_CM_MRA_SETTING (IB_CM_MRA_FLAG_DELAY | 24) +#define CASE_RET(val, ret) case val: return #ret; + +static struct dentry *cma_root_dentry; + static void cma_add_one(struct ib_device *device); static void cma_remove_one(struct ib_device *device); @@ -86,6 +93,7 @@ struct cma_device { struct completion comp; atomic_t refcount; struct list_head id_list; + struct dentry *rdma_id_dentry; }; enum cma_state { @@ -102,6 +110,49 @@ enum cma_state { CMA_DESTROYING }; +static const char *format_cma_state(enum cma_state s) +{ + switch (s) { + CASE_RET(CMA_IDLE, IDLE); + CASE_RET(CMA_ADDR_QUERY, ADDR_QUERY); + CASE_RET(CMA_ADDR_RESOLVED, ADDR_RESOLVED); + CASE_RET(CMA_ROUTE_QUERY, ROUTE_QUERY); + CASE_RET(CMA_ROUTE_RESOLVED, ROUTE_RESOLVED); + CASE_RET(CMA_CONNECT, CONNECT); + CASE_RET(CMA_DISCONNECT, DISCONNECT); + CASE_RET(CMA_ADDR_BOUND, ADDR_BOUND); + CASE_RET(CMA_LISTEN, LISTEN); + CASE_RET(CMA_DEVICE_REMOVAL, DEVICE_REMOVAL); + CASE_RET(CMA_DESTROYING, DESTROYING); + } + return ""; +} + +static const char *format_port_space(enum rdma_port_space ps) +{ + switch (ps) { + CASE_RET(RDMA_PS_SDP, SDP); + CASE_RET(RDMA_PS_IPOIB, IPOIB); + CASE_RET(RDMA_PS_TCP, TCP); + CASE_RET(RDMA_PS_UDP, UDP); + CASE_RET(RDMA_PS_SCTP, SCTP); + } + return ""; +} + +static const char *format_node_type(enum rdma_node_type nt) +{ + enum rdma_transport_type tt; + if (nt) { + tt = rdma_node_get_transport(nt); + switch (tt) { + CASE_RET(RDMA_TRANSPORT_IB, IB); + CASE_RET(RDMA_TRANSPORT_IWARP, IW); + } + } + return ""; +} + struct rdma_bind_list { struct idr *ps; struct hlist_head owners; @@ -2850,6 +2901,150 @@ static struct notifier_block cma_nb = { .notifier_call = cma_netdev_callback }; +static void *cma_rdma_id_seq_start(struct seq_file *file, loff_t *pos) +{ + struct cma_device *cma_dev = file->private; + void *ret; + + mutex_lock(&lock); + if (*pos == 0) + return SEQ_START_TOKEN; + ret = seq_list_start_head(&cma_dev->id_list, *pos); + return ret; +} + +static void *cma_rdma_id_seq_next(struct seq_file *file, void *v, loff_t *pos) +{ + void *ret; + struct cma_device *cma_dev = file->private; + if (v == SEQ_START_TOKEN) { + ++*pos; + if (!list_empty(&cma_dev->id_list)) + ret = cma_dev->id_list.next; + else + ret = NULL; + } else { + ret = seq_list_next(v, &cma_dev->id_list, pos); + } + return ret; +} + +static void cma_rdma_id_seq_stop(struct seq_file *file, void *iter_ptr) +{ + mutex_unlock(&lock); +} + +static void format_addr(struct sockaddr *sa, char* buf) +{ + switch (sa->sa_family) { + case AF_INET: { + struct sockaddr_in *sin = (struct sockaddr_in *)sa; + sprintf(buf, "%pI4:%u", &sin->sin_addr.s_addr, + be16_to_cpu(cma_port(sa))); + break; + } + case AF_INET6: { + struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)sa; + sprintf(buf, "%pI6:%u", &sin6->sin6_addr, + be16_to_cpu(cma_port(sa))); + break; + } + default: + buf[0] = 0; + } +} + +static int cma_rdma_id_seq_show(struct seq_file *file, void *v) +{ + struct rdma_id_private *id_priv; + char local_addr[64], remote_addr[64]; + + if (!v) + return 0; + if (v == SEQ_START_TOKEN) { + seq_printf(file, + "%-5s" + "%-8s" + "%-5s" + "%-8s" + "%-52s" + "%-52s" + "%-6s" + "%-15s" + "%-8s" + "\n", + "TYPE", "DEVICE", "PORT", "NET_DEV", "SRC_ADDR", "DST_ADDR", "SPACE", "STATE", "QP_NUM"); + } else { + id_priv = list_entry(v, struct rdma_id_private, list); + format_addr((struct sockaddr *)&id_priv->id.route.addr.src_addr, + local_addr); + format_addr((struct sockaddr *)&id_priv->id.route.addr.dst_addr, + remote_addr); + + seq_printf(file, + "%-5s" + "%-8s" + "%-5d" + "%-8s" + "%-52s" + "%-52s" + "%-6s" + "%-15s" + "%-8d" + "\n", + format_node_type(id_priv->id.route.addr.dev_addr.dev_type), + (id_priv->id.device) ? id_priv->id.device->name : "", + id_priv->id.port_num, + (id_priv->id.route.addr.dev_addr.src_dev) ? id_priv->id.route.addr.dev_addr.src_dev->name : "", + local_addr, + remote_addr, + format_port_space(id_priv->id.ps), + format_cma_state(id_priv->state), + id_priv->qp_num); + } + return 0; +} + +static const struct seq_operations cma_rdma_id_seq_ops = { + .start = cma_rdma_id_seq_start, + .next = cma_rdma_id_seq_next, + .stop = cma_rdma_id_seq_stop, + .show = cma_rdma_id_seq_show, +}; + +static int cma_rdma_id_open(struct inode *inode, struct file *file) +{ + struct seq_file *seq; + int ret; + + ret = seq_open(file, &cma_rdma_id_seq_ops); + if (ret) + return ret; + + seq = file->private_data; + seq->private = inode->i_private; + + return 0; +} + +static const struct file_operations cma_rdma_id_fops = { + .owner = THIS_MODULE, + .open = cma_rdma_id_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release +}; + +void cma_create_debug_files(struct cma_device *cma_dev) +{ + char name[IB_DEVICE_NAME_MAX + sizeof "_rdma_id"]; + snprintf(name, sizeof name, "%s_rdma_id", cma_dev->device->name); + cma_dev->rdma_id_dentry = debugfs_create_file(name, S_IFREG | S_IRUGO, + cma_root_dentry, cma_dev, &cma_rdma_id_fops); + if (!cma_dev->rdma_id_dentry) + printk(KERN_WARNING "RDMA CMA: failed to create debugfs file %s\n", name); +} + static void cma_add_one(struct ib_device *device) { struct cma_device *cma_dev; @@ -2871,6 +3066,7 @@ static void cma_add_one(struct ib_device *device) list_for_each_entry(id_priv, &listen_any_list, list) cma_listen_on_dev(id_priv, cma_dev); mutex_unlock(&lock); + cma_create_debug_files(cma_dev); } static int cma_remove_id_dev(struct rdma_id_private *id_priv) @@ -2905,6 +3101,8 @@ static void cma_process_remove(struct cma_device *cma_dev) int ret; mutex_lock(&lock); + if (cma_dev->rdma_id_dentry) + debugfs_remove(cma_dev->rdma_id_dentry); while (!list_empty(&cma_dev->id_list)) { id_priv = list_entry(cma_dev->id_list.next, struct rdma_id_private, list); @@ -2940,6 +3138,7 @@ static void cma_remove_one(struct ib_device *device) mutex_unlock(&lock); cma_process_remove(cma_dev); + kfree(cma_dev); } @@ -2947,6 +3146,12 @@ static int cma_init(void) { int ret, low, high, remaining; + cma_root_dentry = debugfs_create_dir("rdma_cm", NULL); + if (!cma_root_dentry) { + printk(KERN_ERR "RDMA CMA: failed to create debugfs dir\n"); + return -ENOMEM; + } + get_random_bytes(&next_port, sizeof next_port); inet_get_local_port_range(&low, &high); remaining = (high - low) + 1; @@ -2984,6 +3189,7 @@ static void cma_cleanup(void) idr_destroy(&tcp_ps); idr_destroy(&udp_ps); idr_destroy(&ipoib_ps); + debugfs_remove(cma_root_dentry); } module_init(cma_init); From swise at opengridcomputing.com Thu Apr 23 07:11:28 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 23 Apr 2009 09:11:28 -0500 Subject: [Fwd: Re: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition]] In-Reply-To: <49F02280.7010005@ext.bull.net> References: <49F02280.7010005@ext.bull.net> Message-ID: <49F07710.3070002@opengridcomputing.com> Celine Bourde wrote: > Hi, > > I've updated nfs-utils package: > [root at my_host ~]# mount.nfs -V > mount.nfs (linux nfs-utils 1.1.6) > >> [root at my_host ~]# strace mount.nfs 192.168.0.215:/vol0 /mnt/ -o >> rdma,port=2050 >> Does it work without rdma? > > The problem is exactly the same without rdma: > > [root at my_host ~]# strace mount.nfs 192.168.0.215:/vol0 /mnt/ -o > rw,port=2050 > [..] You cannot use port 2050 for tcp mounts. So remove the 'port=2050' and it will attempt a tcp mount to port 2049. Steve. From swise at opengridcomputing.com Thu Apr 23 07:14:31 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 23 Apr 2009 09:14:31 -0500 Subject: [Fwd: Re: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition]] In-Reply-To: <49F04798.1090006@ext.bull.net> References: <49F02280.7010005@ext.bull.net> <49F04798.1090006@ext.bull.net> Message-ID: <49F077C7.8030606@opengridcomputing.com> Celine Bourde wrote: > There was a mistake in my last email. > My last nfs mount test with no rdma was not correct, I've retried with > > [root at my_host ] #mount -o rw 192.168.0.215:/vol0 /mnt/ > > My nfs partition is mounted and everything is correct, which confirms > that the problem comes from rdma connection manager. > > Céline. > Can you run rping or one of the perf programs over the IB link? This will confirm that your IB setup works. Also, on the nfs server, what is the output of 'cat /proc/fs/nfsd/portlist'? > > Celine Bourde wrote: > >> Hi, >> >> I've updated nfs-utils package: >> [root at my_host ~]# mount.nfs -V >> mount.nfs (linux nfs-utils 1.1.6) >> >>> [root at my_host ~]# strace mount.nfs >>> 192.168.0.215:/vol0 /mnt/ -o rdma,port=2050 >>> Does it work without rdma? >>> >> >> The problem is exactly the same without rdma: >> >> [root at my_host ~]# strace mount.nfs 192.168.0.215:/vol0 /mnt/ -o >> rw,port=2050 >> [..] >> socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 3 >> bind(3, {sa_family=AF_INET, sin_port=htons(0), >> sin_addr=inet_addr("0.0.0.0")}, 16) = 0 >> fcntl(3, F_GETFL) = 0x2 (flags O_RDWR) >> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 >> connect(3, {sa_family=AF_INET, sin_port=htons(997), >> sin_addr=inet_addr("127.0.0.1")}, 16) = 0 >> fcntl(3, F_SETFL, O_RDWR) = 0 >> connect(3, {sa_family=AF_UNSPEC, >> sa_data="\0o\177\0\0\1\0\0\0\0\0\0\0\0"}, 16) = 0 >> sendto(3, >> "\0308\310\272\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., >> >> 40, 0, {sa_family=AF_INET, sin_port=htons(997), >> sin_addr=inet_addr("127.0.0.1")}, 16) = 40 >> poll([{fd=3, events=POLLIN}], 1, 3000) = 1 ([{fd=3, revents=POLLIN}]) >> recvfrom(3, "\0308\310\272\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", >> 400, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(997), >> sin_addr=inet_addr("127.0.0.1")}, [16]) = 24 >> close(3) = 0 >> mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, >> "port=2050,addr=192.168.0.215" >> >> .. and it blocks ! >> >> [root at my_host ~]# dmesg >> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 >> ird 16 >> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 >> ird 16 >> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >> >>> Celine, can you establish regular rdma >>> connections over your IB link? Like rping? >>> >> >> I can't etablish rdma connection, following errors occur with rping : >> >> [root at my_host ~]# rping -c 192.168.0.214 >> cq completion failed status 5 >> wait for CONNECTED state 10 >> connect error -1 >> cma event RDMA_CM_EVENT_REJECTED, error 8 >> >> My kernel is a 2.6.27 kernel.org build on a Red Hat Distribution : Red >> Hat Enterprise Linux Server release 5.3 Beta (Tikanga). >> Céline. >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> From Zhen.Liang at Sun.COM Thu Apr 23 08:13:51 2009 From: Zhen.Liang at Sun.COM (Liang Zhen) Date: Thu, 23 Apr 2009 23:13:51 +0800 Subject: [ofa-general] OFED1.3.1: soft lockup in completion handler In-Reply-To: <20090421152536.GA11622@mtls03> References: <49ED38A9.7040208@sun.com> <20090421152536.GA11622@mtls03> Message-ID: <49F085AF.1030205@sun.com> Eli Cohen wrote: > Where is the other place that you acquire conn->ibc_sched->ibs_lock? > Is it in the per CPU thread? Maybe you should try to decrease the time > when the lock is acquired at the thread. Can you send all references > to the code aquiring the lock? > Eli, Yes, it's a per-CPU lock (for CPU affinity thread), so completion callback can only race with one thread a time, and I'm very sure the thread just do very light operations with the lock. I actually already know the reason: Most time, interrupts always perfer to happen on the same cpu, so there is no chance to schedule CPU-watchdog thread if there is quite a lot interrupts on the cpu, i.e: 100K/Sec, although we have irqbalance, but it's wakeup per 10 seconds(it's a pity that we can't change to interval, it's hard-code constant), which is enough to trigger soft lockup warning. So I add a static counter for completion handler, and when we found there are too many interrupts for several seconds, we just call touch_softlockup_watchdog() to tell watchdog the CPU is OK, not soft lockup..... Also, we reserve the first core on each CPU socket (no affinity thread bound on it), to make sure there are enough cores to handle interrupts. I know it's urgly, but it works and it's the only way that I can find to resolve my problem, if there is no mutiple completion vectors. Anyway, I really think multiple completion vectors will be an important feature in the recent future, because our hardwares are more and more faster, and machines have more and more CPU-cores. Thanks Liang > > >> I've tried to turn off irqbalancer and set /proc/irq/.../smp_affinity >> for more cores, but changed nothing and still soft lockup. >> >> After I installed ofed1.4.1 and create CQ with >> ib_create_cq(....comp_vector), the problem is gone and get really good >> performance. The problem now is, seems ofed1.4.1:mlx4 is the only driver >> can really support multiple completion vectors, but we can't expect all >> customers to have the same environment... >> Is there only other possible way to resolve this? >> >> Thanks >> Liang >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> From roel.kluin at gmail.com Thu Apr 23 08:50:50 2009 From: roel.kluin at gmail.com (Roel Kluin) Date: Thu, 23 Apr 2009 17:50:50 +0200 Subject: [ofa-general] ***SPAM*** mlx4_ib_post_send(): incorrect test on wr->opcode? Message-ID: <49F08E5A.7060808@gmail.com> // vi drivers/infiniband/hw/mlx4/qp.c +1523 int mlx4_ib_post_send(..., struct ib_send_wr *wr, ...) { ... if (wr->opcode < 0 || wr->opcode >= ARRAY_SIZE(mlx4_ib_opcode)) { err = -EINVAL; goto out; } ctrl->owner_opcode = mlx4_ib_opcode[wr->opcode] | (ind & qp->sq.wqe_cnt ? cpu_to_be32(1 << 31) : 0); ... } wr->opcode cannot be less than 0, can it? but note below that in mlx4_ib_opcode IB_WR_RDMA_READ_WITH_INV is missing, so shouldn't this be: if (wr->opcode > IB_WR_FAST_REG_MR) { err = -EINVAL; goto out; } // vi include/rdma/ib_verbs.h +714 struct ib_send_wr { ... enum ib_wr_opcode opcode; ... } // vi include/rdma/ib_verbs.h +679 enum ib_wr_opcode { IB_WR_RDMA_WRITE, IB_WR_RDMA_WRITE_WITH_IMM, IB_WR_SEND, IB_WR_SEND_WITH_IMM, IB_WR_RDMA_READ, IB_WR_ATOMIC_CMP_AND_SWP, IB_WR_ATOMIC_FETCH_AND_ADD, IB_WR_LSO, IB_WR_SEND_WITH_INV, IB_WR_RDMA_READ_WITH_INV, IB_WR_LOCAL_INV, IB_WR_FAST_REG_MR, }; // vi drivers/infiniband/hw/mlx4/qp.c +72 static const __be32 mlx4_ib_opcode[] = { [IB_WR_SEND] = cpu_to_be32(MLX4_OPCODE_SEND), [IB_WR_LSO] = cpu_to_be32(MLX4_OPCODE_LSO), [IB_WR_SEND_WITH_IMM] = cpu_to_be32(MLX4_OPCODE_SEND_IMM), [IB_WR_RDMA_WRITE] = cpu_to_be32(MLX4_OPCODE_RDMA_WRITE), [IB_WR_RDMA_WRITE_WITH_IMM] = cpu_to_be32(MLX4_OPCODE_RDMA_WRITE_IMM), [IB_WR_RDMA_READ] = cpu_to_be32(MLX4_OPCODE_RDMA_READ), [IB_WR_ATOMIC_CMP_AND_SWP] = cpu_to_be32(MLX4_OPCODE_ATOMIC_CS), [IB_WR_ATOMIC_FETCH_AND_ADD] = cpu_to_be32(MLX4_OPCODE_ATOMIC_FA), [IB_WR_SEND_WITH_INV] = cpu_to_be32(MLX4_OPCODE_SEND_INVAL), [IB_WR_LOCAL_INV] = cpu_to_be32(MLX4_OPCODE_LOCAL_INVAL), [IB_WR_FAST_REG_MR] = cpu_to_be32(MLX4_OPCODE_FMR), }; From weiny2 at llnl.gov Thu Apr 23 09:08:19 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 23 Apr 2009 09:08:19 -0700 Subject: [ofa-general] Re: [PATCH v3 2/3] Convert iblinkinfo.pl to C and use new ibnetdisc library. In-Reply-To: <20090423071542.GC8281@sk> References: <20090403154254.5ab60589.weiny2@llnl.gov> <20090423071542.GC8281@sk> Message-ID: <20090423090819.2588d8e6.weiny2@llnl.gov> On Thu, 23 Apr 2009 10:15:42 +0300 Sasha Khapyorsky wrote: > On 15:42 Fri 03 Apr , Ira Weiny wrote: > > From a677ae35fe7a5966f05b5859df8f00e9b18df864 Mon Sep 17 00:00:00 2001 > > You know :) > > > From: Ira Weiny > > Date: Fri, 3 Apr 2009 15:28:18 -0700 > > Subject: [PATCH] Convert iblinkinfo.pl to C and use new ibnetdisc library. > > > > Signed-off-by: Ira Weiny > > Applied. Thanks. Couple of notes are below. > > > --- > > infiniband-diags/Makefile.am | 7 +- > > infiniband-diags/configure.in | 1 + > > infiniband-diags/scripts/iblinkinfo.pl | 327 ------------------------ > > infiniband-diags/scripts/iblinkinfo.pl.in | 40 +++ > > infiniband-diags/src/iblinkinfo.c | 386 +++++++++++++++++++++++++++++ > > 5 files changed, 432 insertions(+), 329 deletions(-) > > delete mode 100755 infiniband-diags/scripts/iblinkinfo.pl > > create mode 100755 infiniband-diags/scripts/iblinkinfo.pl.in > > create mode 100644 infiniband-diags/src/iblinkinfo.c > > > > diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am > > index 7b8523a..b480a4a 100644 > > --- a/infiniband-diags/Makefile.am > > +++ b/infiniband-diags/Makefile.am > > @@ -1,6 +1,7 @@ > > SUBDIRS = libibnetdisc > > > > -INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband > > +INCLUDES = -I$(top_builddir)/include/ -I$(srcdir)/include -I$(includedir) -I$(includedir)/infiniband \ > > + -I$(top_builddir)/libibnetdisc/include > > Here $(top_srcdir) should be used instead of $(top_builddir). > 'top_builddir' points project build directory, not source directory > (where not generated header files are located). And build like: > > mkdir /tmp/ib-diags && cd /tmp/ib-diags > /path/to/management/infiniband-diags/configure && make > > will fail. > > I'm fixing this. yep, thanks. > > > if DEBUG > > DBGFLAGS = -ggdb -D_DEBUG_ > > @@ -11,7 +12,7 @@ endif > > sbin_PROGRAMS = src/ibaddr src/ibnetdiscover src/ibping src/ibportstate \ > > src/ibroute src/ibstat src/ibsysstat src/ibtracert \ > > src/perfquery src/sminfo src/smpdump src/smpquery \ > > - src/saquery src/vendstat > > + src/saquery src/vendstat src/iblinkinfo > > > > if ENABLE_TEST_UTILS > > sbin_PROGRAMS += src/ibsendtrap src/mcm_rereg_test > > @@ -55,6 +56,8 @@ src_saquery_SOURCES = src/saquery.c > > src_ibsendtrap_SOURCES = src/ibsendtrap.c > > src_vendstat_SOURCES = src/vendstat.c > > src_mcm_rereg_test_SOURCES = src/mcm_rereg_test.c > > +src_iblinkinfo_SOURCES = src/iblinkinfo.c > > +src_iblinkinfo_LDADD = -libnetdisc > > I think here should be '-L$(top_builddir)/libibnetdisc -libnetdisc'. > Otherwise we are assuming pre-installed library (we could run > 'make install', but it fails due to 'make' error :)). > > Adding this too. Oops... Sorry, that got put into the "convert ibnetdiscover" patch. But I used $(top_srcdir) I can change it. See below: --- a/infiniband-diags/Makefile.am +++ b/infiniband-diags/Makefile.am @@ -41,7 +41,8 @@ LDADD = libcommon.a libcommon_a_SOURCES = src/ibdiag_common.c src_ibaddr_SOURCES = src/ibaddr.c -src_ibnetdiscover_SOURCES = src/ibnetdiscover.c src/grouping.c +src_ibnetdiscover_SOURCES = src/ibnetdiscover.c +src_ibnetdiscover_LDFLAGS = -L$(top_srcdir)/libibnetdisc -libnetdisc src_ibping_SOURCES = src/ibping.c src_ibportstate_SOURCES = src/ibportstate.c src_ibroute_SOURCES = src/ibroute.c @@ -57,7 +58,7 @@ src_ibsendtrap_SOURCES = src/ibsendtrap.c src_vendstat_SOURCES = src/vendstat.c src_mcm_rereg_test_SOURCES = src/mcm_rereg_test.c src_iblinkinfo_SOURCES = src/iblinkinfo.c -src_iblinkinfo_LDADD = -libnetdisc +src_iblinkinfo_LDFLAGS = -L$(top_srcdir)/libibnetdisc -libnetdisc man_MANS = man/ibaddr.8 man/ibcheckerrors.8 man/ibcheckerrs.8 \ man/ibchecknet.8 man/ibchecknode.8 man/ibcheckport.8 \ [snip] > > + > > + static char const str_opts[] = "S:D:n:C:P:t:sldgphuf:R"; > > + static const struct option long_opts[] = { > > + { "S", 1, 0, 'S'}, > > + { "D", 1, 0, 'D'}, > > + { "num-hops", 1, 0, 'n'}, > > + { "down-links-only", 0, 0, 'd'}, > > + { "line-mode", 0, 0, 'l'}, > > + { "ca-name", 1, 0, 'C'}, > > + { "ca-port", 1, 0, 'P'}, > > + { "timeout", 1, 0, 't'}, > > + { "show", 0, 0, 's'}, > > + { "print-port-guids", 0, 0, 'g'}, > > + { "print-additional", 0, 0, 'p'}, > > + { "help", 0, 0, 'h'}, > > + { "usage", 0, 0, 'u'}, > > + { "node-name-map", 1, 0, 1}, > > + { "debug", 0, 0, 2}, > > + { "compat", 0, 0, 3}, > > + { "from", 1, 0, 'f'}, > > + { "R", 0, 0, 'R'}, > > + { } > > + }; > > + > > + f = stdout; > > + > > + argv0 = argv[0]; > > + > > + while (1) { > > + int ch = getopt_long(argc, argv, str_opts, long_opts, NULL); > > Any reason to not use new ibdiag_process_opts() here? It should simplify > an options processing and unify the usage message. > > Of course this can be done as subsequent patch. > The only reason was that I started this patch before ibdiag_process_opts was created! I did use ibdiag_process_opts for ibqueryerrors. And yes I will be cleaning this up in subsequent patches. Ira From weiny2 at llnl.gov Thu Apr 23 10:02:06 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 23 Apr 2009 10:02:06 -0700 Subject: [ofa-general] Re: [PATCH v3 3/3] Convert ibnetdiscover to use new ibnetdisc library. In-Reply-To: <20090423082535.GD8281@sk> References: <20090403154301.f656e7a4.weiny2@llnl.gov> <20090423082535.GD8281@sk> Message-ID: <20090423100206.c2621310.weiny2@llnl.gov> On Thu, 23 Apr 2009 11:25:35 +0300 Sasha Khapyorsky wrote: > On 15:43 Fri 03 Apr , Ira Weiny wrote: > > From e506ac4d6accefb49b89811cc9dd77775ad481f7 Mon Sep 17 00:00:00 2001 > > From: Ira Weiny > > Date: Fri, 3 Apr 2009 15:28:29 -0700 > > Subject: [PATCH] Convert ibnetdiscover to use new ibnetdisc library. > > > > All other functionality is preserved > > And what about '-v' and '-e' options? Why is it removed from man page? New patch on it's way. > > > Signed-off-by: Ira Weiny > > --- > > infiniband-diags/Makefile.am | 5 +- > > infiniband-diags/include/grouping.h | 113 --- > > infiniband-diags/libibnetdisc/src/chassis.c | 20 +- > > infiniband-diags/libibnetdisc/src/ibnetdisc.c | 5 +- > > infiniband-diags/man/ibnetdiscover.8 | 10 +- > > infiniband-diags/src/grouping.c | 785 -------------------- > > infiniband-diags/src/ibnetdiscover.c | 974 +++++++++---------------- > > 7 files changed, 345 insertions(+), 1567 deletions(-) > > delete mode 100644 infiniband-diags/include/grouping.h > > delete mode 100644 infiniband-diags/src/grouping.c > > > > diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am > > index b480a4a..19b992c 100644 > > --- a/infiniband-diags/Makefile.am > > +++ b/infiniband-diags/Makefile.am > > @@ -41,7 +41,8 @@ LDADD = libcommon.a > > > > libcommon_a_SOURCES = src/ibdiag_common.c > > src_ibaddr_SOURCES = src/ibaddr.c > > -src_ibnetdiscover_SOURCES = src/ibnetdiscover.c src/grouping.c > > +src_ibnetdiscover_SOURCES = src/ibnetdiscover.c > > +src_ibnetdiscover_LDFLAGS = -L$(top_srcdir)/libibnetdisc -libnetdisc > > As with previous patch 'top_builddir' should be used here. > > > src_ibping_SOURCES = src/ibping.c > > src_ibportstate_SOURCES = src/ibportstate.c > > src_ibroute_SOURCES = src/ibroute.c > > @@ -57,7 +58,7 @@ src_ibsendtrap_SOURCES = src/ibsendtrap.c > > src_vendstat_SOURCES = src/vendstat.c > > src_mcm_rereg_test_SOURCES = src/mcm_rereg_test.c > > src_iblinkinfo_SOURCES = src/iblinkinfo.c > > -src_iblinkinfo_LDADD = -libnetdisc > > +src_iblinkinfo_LDFLAGS = -L$(top_srcdir)/libibnetdisc -libnetdisc > > BTW what is the reason change LDADD to LDFLAGS? Somewhere along the line I broke this and then this got put into the ibnetdiscover patch. This should not even have been here. Anyway, LDFLAGS is required for the -L I believe? BTW I am using topgit now to manage all this as we work through these. I am going to blame this on me not knowing how to use topgit... ;-) > > > diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c > > index bf7c2a7..479bae7 100644 > > --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c > > +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c > > @@ -150,6 +150,9 @@ query_node(struct ibnd_fabric *fabric, struct ibnd_node *inode, > > return -1; > > decode_port_info(port); > > > > + port->base_lid = node->smalid; /* LID is still defined by port 0 */ > > + port->lmc = node->smalmc; > > + > > if (!smp_query_via(node->switchinfo, portid, IB_ATTR_SWITCH_INFO, 0, timeout_ms, > > fabric->ibmad_port)) > > node->smaenhsp0 = 0; /* assume base SP0 */ > > @@ -167,7 +170,7 @@ add_port_to_dpath(ib_dr_path_t *path, int nextport) > > if (path->cnt+2 >= sizeof(path->p)) > > return -1; > > ++path->cnt; > > - path->p[path->cnt] = nextport; > > + path->p[path->cnt] = (uint8_t) nextport; > > return path->cnt; > > } > > > > diff --git a/infiniband-diags/man/ibnetdiscover.8 b/infiniband-diags/man/ibnetdiscover.8 > > index 958efa9..768d392 100644 > > --- a/infiniband-diags/man/ibnetdiscover.8 > > +++ b/infiniband-diags/man/ibnetdiscover.8 > > @@ -5,7 +5,7 @@ ibnetdiscover \- discover InfiniBand topology > > > > .SH SYNOPSIS > > .B ibnetdiscover > > -[\-d(ebug)] [\-e(rr_show)] [\-v(erbose)] [\-s(how)] [\-l(ist)] [\-g(rouping)] [\-H(ca_list)] [\-S(witch_list)] [\-R(outer_list)] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] [\-V(ersion)] [\--node-name-map ] [\-p(orts)] [\-h(elp)] [] > > +[\-d(ebug)] [\-s(how)] [\-l(ist)] [\-g(rouping)] [\-H(ca_list)] [\-S(witch_list)] [\-R(outer_list)] [\-C ca_name] [\-P ca_port] [\-t(imeout) timeout_ms] [\-V(ersion)] [\--node-name-map ] [\-p(orts)] [\-h(elp)] [] > > > > .SH DESCRIPTION > > .PP > > @@ -37,7 +37,7 @@ List of connected switches > > List of connected routers > > .TP > > \fB\-s\fR, \fB\-\-show\fR > > -Show more information > > +Show progress information during discovery. > > .TP > > \fB\-\-node\-name\-map\fR > > Specify a node name map. The node name map file maps GUIDs to more user friendly > > @@ -57,15 +57,9 @@ using the util_name -h syntax. > > # Debugging flags > > .PP > > \-d raise the IB debugging level. > > - May be used several times (-ddd or -d -d -d). > > -.PP > > -\-e show send and receive errors (timeouts and others) > > .PP > > \-h show the usage message > > .PP > > -\-v increase the application verbosity level. > > - May be used several times (-vv or -v -v -v) > > -.PP > > \-V show the version info. > > Those options are used actually. Why should it be removed from man page? > Just typo? Yes when I put them back in I forgot the man page. New patch to follow. Ira From sean.hefty at intel.com Thu Apr 23 11:49:36 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 23 Apr 2009 11:49:36 -0700 Subject: [ofa-general] [PATCH v3 1/3] Create a new library libibnetdisc In-Reply-To: <20090423070210.GA8281@sk> References: <20090403154251.dec181f2.weiny2@llnl.gov> <20090423070210.GA8281@sk> Message-ID: >> Where does the definition for ibdebug come from? > >It is in ibdiag_common.c. Every infiniband-ibdiag tool is linked with >it. And yes, using this in this library can be problematic since >introduces a "hidden" dependency. How does that work? The library doesn't link ibdiag_common.c, so I'm not sure what definition it picks up. Maybe it defaults to undefined, assumed int... To get things to build and run on Windows, I defined it as a static in the library. From weiny2 at llnl.gov Thu Apr 23 13:30:48 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 23 Apr 2009 13:30:48 -0700 Subject: [ofa-general] Re: [PATCH v4 0/8] Convert ibnetdiscover to use new ibnetdisc library and friends In-Reply-To: <20090423100206.c2621310.weiny2@llnl.gov> References: <20090403154301.f656e7a4.weiny2@llnl.gov> <20090423082535.GD8281@sk> <20090423100206.c2621310.weiny2@llnl.gov> Message-ID: <20090423133048.75ec241d.weiny2@llnl.gov> Ok, here is a new series. Starting after converting iblinkinfo. I also removed the SHA1 ID for each... ;-) Sorry about that in the series I sent yesterday. :-( I am only sending 3-8 because I did not see the following 2 patches in master even though you said you applied them: 1/8 [PATCH] Create a new library libibnetdisc 2/8 [PATCH] Convert iblinkinfo.pl to C and use new ibnetdisc library. 3-8 should apply after #2 above. If you want 1 and 2 again I can resend them. Make sense? Ira From weiny2 at llnl.gov Thu Apr 23 13:30:53 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 23 Apr 2009 13:30:53 -0700 Subject: [ofa-general] [PATCH 3/8] clean up iblinkinfo conversion Message-ID: <20090423133053.99d2992b.weiny2@llnl.gov> From: Ira Weiny Date: Thu, 23 Apr 2009 10:57:10 -0700 Subject: [PATCH] clean up iblinkinfo conversion Clean up a comment Fix potential bug Signed-off-by: Ira Weiny --- infiniband-diags/Makefile.am | 2 +- infiniband-diags/scripts/iblinkinfo.pl.in | 2 +- infiniband-diags/src/iblinkinfo.c | 10 ++++++---- 3 files changed, 8 insertions(+), 6 deletions(-) diff --git a/infiniband-diags/scripts/iblinkinfo.pl.in b/infiniband-diags/scripts/iblinkinfo.pl.in index c81570d..0ce33ab 100755 --- a/infiniband-diags/scripts/iblinkinfo.pl.in +++ b/infiniband-diags/scripts/iblinkinfo.pl.in @@ -35,6 +35,6 @@ # -# this is not just a wrapper for the C based utility +# this is now just a wrapper for the C based utility $str = join " ", at ARGV; exec "@IBSCRIPTPATH@/iblinkinfo $str"; diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c index 1e43788..39de7a2 100644 --- a/infiniband-diags/src/iblinkinfo.c +++ b/infiniband-diags/src/iblinkinfo.c @@ -121,15 +121,17 @@ print_port(ibnd_node_t *node, ibnd_port_t *port) char width_msg[256]; char speed_msg[256]; char ext_port_str[256]; - int iwidth = mad_get_field(port->info, 0, IB_PORT_LINK_WIDTH_ACTIVE_F); - int ispeed = mad_get_field(port->info, 0, IB_PORT_LINK_SPEED_ACTIVE_F); - int istate = mad_get_field(port->info, 0, IB_PORT_STATE_F); - int iphystate = mad_get_field(port->info, 0, IB_PORT_PHYS_STATE_F); + int iwidth, ispeed, istate, iphystate; int n = 0; if (!port) return; + iwidth = mad_get_field(port->info, 0, IB_PORT_LINK_WIDTH_ACTIVE_F); + ispeed = mad_get_field(port->info, 0, IB_PORT_LINK_SPEED_ACTIVE_F); + istate = mad_get_field(port->info, 0, IB_PORT_STATE_F); + iphystate = mad_get_field(port->info, 0, IB_PORT_PHYS_STATE_F); + remote_guid_str[0] = '\0'; remote_str[0] = '\0'; link_str[0] = '\0'; -- 1.5.4.5 From weiny2 at llnl.gov Thu Apr 23 13:30:57 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 23 Apr 2009 13:30:57 -0700 Subject: [ofa-general] [PATCH v4 4/8] Convert ibnetdiscover to use new ibnetdisc library Message-ID: <20090423133057.9f5d36f9.weiny2@llnl.gov> From: Ira Weiny Date: Thu, 23 Apr 2009 10:57:10 -0700 Subject: [PATCH] Convert ibnetdiscover to use new ibnetdisc library All other functionality is preserved Signed-off-by: Ira Weiny --- infiniband-diags/Makefile.am | 3 +- infiniband-diags/include/grouping.h | 113 --- infiniband-diags/libibnetdisc/src/chassis.c | 20 +- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 5 +- infiniband-diags/man/ibnetdiscover.8 | 2 +- infiniband-diags/src/grouping.c | 785 -------------------- infiniband-diags/src/ibnetdiscover.c | 974 +++++++++---------------- 7 files changed, 343 insertions(+), 1559 deletions(-) delete mode 100644 infiniband-diags/include/grouping.h delete mode 100644 infiniband-diags/src/grouping.c diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am index 78efe7f..bebb35e 100644 --- a/infiniband-diags/Makefile.am +++ b/infiniband-diags/Makefile.am @@ -41,7 +41,8 @@ LDADD = libcommon.a libcommon_a_SOURCES = src/ibdiag_common.c src_ibaddr_SOURCES = src/ibaddr.c -src_ibnetdiscover_SOURCES = src/ibnetdiscover.c src/grouping.c +src_ibnetdiscover_SOURCES = src/ibnetdiscover.c +src_ibnetdiscover_LDFLAGS = -L$(top_builddir)/libibnetdisc -libnetdisc src_ibping_SOURCES = src/ibping.c src_ibportstate_SOURCES = src/ibportstate.c src_ibroute_SOURCES = src/ibroute.c diff --git a/infiniband-diags/include/grouping.h b/infiniband-diags/include/grouping.h deleted file mode 100644 index 811e372..0000000 --- a/infiniband-diags/include/grouping.h +++ /dev/null @@ -1,113 +0,0 @@ -/* - * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. - * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -#ifndef _GROUPING_H_ -#define _GROUPING_H_ - -/*========================================================*/ -/* FABRIC SCANNER SPECIFIC DATA */ -/*========================================================*/ - -#define SPINES_MAX_NUM 12 -#define LINES_MAX_NUM 36 - -typedef struct ChassisList ChassisList; -typedef struct AllChassisList AllChassisList; - -struct ChassisList { - ChassisList *next; - uint64_t chassisguid; - unsigned char chassisnum; - unsigned char chassistype; - unsigned int nodecount; /* used for grouping by SystemImageGUID */ - Node *spinenode[SPINES_MAX_NUM + 1]; - Node *linenode[LINES_MAX_NUM + 1]; -}; - -struct AllChassisList { - ChassisList *first; - ChassisList *current; - ChassisList *last; -}; - -/*========================================================*/ -/* CHASSIS RECOGNITION SPECIFIC DATA */ -/*========================================================*/ - -/* Device IDs */ -#define VTR_DEVID_IB_FC_ROUTER 0x5a00 -#define VTR_DEVID_IB_IP_ROUTER 0x5a01 -#define VTR_DEVID_ISR9600_SPINE 0x5a02 -#define VTR_DEVID_ISR9600_LEAF 0x5a03 -#define VTR_DEVID_HCA1 0x5a04 -#define VTR_DEVID_HCA2 0x5a44 -#define VTR_DEVID_HCA3 0x6278 -#define VTR_DEVID_SW_6IB4 0x5a05 -#define VTR_DEVID_ISR9024 0x5a06 -#define VTR_DEVID_ISR9288 0x5a07 -#define VTR_DEVID_SLB24 0x5a09 -#define VTR_DEVID_SFB12 0x5a08 -#define VTR_DEVID_SFB4 0x5a0b -#define VTR_DEVID_ISR9024_12 0x5a0c -#define VTR_DEVID_SLB8 0x5a0d -#define VTR_DEVID_RLX_SWITCH_BLADE 0x5a20 -#define VTR_DEVID_ISR9024_DDR 0x5a31 -#define VTR_DEVID_SFB12_DDR 0x5a32 -#define VTR_DEVID_SFB4_DDR 0x5a33 -#define VTR_DEVID_SLB24_DDR 0x5a34 -#define VTR_DEVID_SFB2012 0x5a37 -#define VTR_DEVID_SLB2024 0x5a38 -#define VTR_DEVID_ISR2012 0x5a39 -#define VTR_DEVID_SFB2004 0x5a40 -#define VTR_DEVID_ISR2004 0x5a41 -#define VTR_DEVID_SRB2004 0x5a42 - -enum ChassisType { UNRESOLVED_CT, ISR9288_CT, ISR9096_CT, ISR2012_CT, ISR2004_CT }; -enum ChassisSlot { UNRESOLVED_CS, LINE_CS, SPINE_CS, SRBD_CS }; - -/*========================================================*/ -/* External interface */ -/*========================================================*/ - -ChassisList *group_nodes(); -char *portmapstring(Port *port); -char *get_chassis_type(unsigned char chassistype); -char *get_chassis_slot(unsigned char chassisslot); -uint64_t get_chassis_guid(unsigned char chassisnum); - -int is_xsigo_guid(uint64_t guid); -int is_xsigo_tca(uint64_t guid); -int is_xsigo_hca(uint64_t guid); - -#endif /* _GROUPING_H_ */ diff --git a/infiniband-diags/libibnetdisc/src/chassis.c b/infiniband-diags/libibnetdisc/src/chassis.c index a25d710..6b4930e 100644 --- a/infiniband-diags/libibnetdisc/src/chassis.c +++ b/infiniband-diags/libibnetdisc/src/chassis.c @@ -292,19 +292,19 @@ int is_chassis_switch(struct ibnd_node *n) } /* these structs help find Line (Anafa) slot number while using spine portnum */ -int line_slot_2_sfb4[25] = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 }; -int anafa_line_slot_2_sfb4[25] = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 }; -int line_slot_2_sfb12[25] = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 }; -int anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 }; +char line_slot_2_sfb4[25] = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 }; +char anafa_line_slot_2_sfb4[25] = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 }; +char line_slot_2_sfb12[25] = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 }; +char anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 }; /* IPR FCR modules connectivity while using sFB4 port as reference */ -int ipr_slot_2_sfb4_port[25] = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 }; +char ipr_slot_2_sfb4_port[25] = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 }; /* these structs help find Spine (Anafa) slot number while using spine portnum */ -int spine12_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -int anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -int spine4_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -int anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +char spine12_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +char anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +char spine4_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; +char anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; /* reference { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; */ static void get_sfb_slot(struct ibnd_node *node, ibnd_port_t *lineport) @@ -337,7 +337,7 @@ static void get_sfb_slot(struct ibnd_node *node, ibnd_port_t *lineport) static void get_router_slot(struct ibnd_node *node, ibnd_port_t *spineport) { ibnd_node_t *n = (ibnd_node_t *)node; - int guessnum = 0; + uint64_t guessnum = 0; node->ch_found = 1; diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index bf7c2a7..479bae7 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -150,6 +150,9 @@ query_node(struct ibnd_fabric *fabric, struct ibnd_node *inode, return -1; decode_port_info(port); + port->base_lid = node->smalid; /* LID is still defined by port 0 */ + port->lmc = node->smalmc; + if (!smp_query_via(node->switchinfo, portid, IB_ATTR_SWITCH_INFO, 0, timeout_ms, fabric->ibmad_port)) node->smaenhsp0 = 0; /* assume base SP0 */ @@ -167,7 +170,7 @@ add_port_to_dpath(ib_dr_path_t *path, int nextport) if (path->cnt+2 >= sizeof(path->p)) return -1; ++path->cnt; - path->p[path->cnt] = nextport; + path->p[path->cnt] = (uint8_t) nextport; return path->cnt; } diff --git a/infiniband-diags/man/ibnetdiscover.8 b/infiniband-diags/man/ibnetdiscover.8 index 958efa9..692994b 100644 --- a/infiniband-diags/man/ibnetdiscover.8 +++ b/infiniband-diags/man/ibnetdiscover.8 @@ -37,7 +37,7 @@ List of connected switches List of connected routers .TP \fB\-s\fR, \fB\-\-show\fR -Show more information +Show progress information during discovery. .TP \fB\-\-node\-name\-map\fR Specify a node name map. The node name map file maps GUIDs to more user friendly diff --git a/infiniband-diags/src/grouping.c b/infiniband-diags/src/grouping.c deleted file mode 100644 index 0c30726..0000000 --- a/infiniband-diags/src/grouping.c +++ /dev/null @@ -1,785 +0,0 @@ -/* - * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. - * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. - * - * This software is available to you under a choice of one of two - * licenses. You may choose to be licensed under the terms of the GNU - * General Public License (GPL) Version 2, available from the file - * COPYING in the main directory of this source tree, or the - * OpenIB.org BSD license below: - * - * Redistribution and use in source and binary forms, with or - * without modification, are permitted provided that the following - * conditions are met: - * - * - Redistributions of source code must retain the above - * copyright notice, this list of conditions and the following - * disclaimer. - * - * - Redistributions in binary form must reproduce the above - * copyright notice, this list of conditions and the following - * disclaimer in the documentation and/or other materials - * provided with the distribution. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, - * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF - * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND - * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS - * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN - * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN - * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE - * SOFTWARE. - * - */ - -/*========================================================*/ -/* FABRIC SCANNER SPECIFIC DATA */ -/*========================================================*/ - -#if HAVE_CONFIG_H -# include -#endif /* HAVE_CONFIG_H */ - -#include -#include - -#include - -#include "ibnetdiscover.h" -#include "grouping.h" - -#define OUT_BUFFER_SIZE 16 - - -extern Node *nodesdist[MAXHOPS+1]; /* last is CA list */ -extern Node *mynode; -extern Port *myport; -extern int maxhops_discovered; - -AllChassisList mylist; - -char *ChassisTypeStr[5] = { "", "ISR9288", "ISR9096", "ISR2012", "ISR2004" }; -char *ChassisSlotStr[4] = { "", "Line", "Spine", "SRBD" }; - - -char *get_chassis_type(unsigned char chassistype) -{ - if (chassistype == UNRESOLVED_CT || chassistype > ISR2004_CT) - return NULL; - return ChassisTypeStr[chassistype]; -} - -char *get_chassis_slot(unsigned char chassisslot) -{ - if (chassisslot == UNRESOLVED_CS || chassisslot > SRBD_CS) - return NULL; - return ChassisSlotStr[chassisslot]; -} - -static struct ChassisList *find_chassisnum(unsigned char chassisnum) -{ - ChassisList *current; - - for (current = mylist.first; current; current = current->next) { - if (current->chassisnum == chassisnum) - return current; - } - - return NULL; -} - -static uint64_t topspin_chassisguid(uint64_t guid) -{ - /* Byte 3 in system image GUID is chassis type, and */ - /* Byte 4 is location ID (slot) so just mask off byte 4 */ - return guid & 0xffffffff00ffffffULL; -} - -int is_xsigo_guid(uint64_t guid) -{ - if ((guid & 0xffffff0000000000ULL) == 0x0013970000000000ULL) - return 1; - else - return 0; -} - -static int is_xsigo_leafone(uint64_t guid) -{ - if ((guid & 0xffffffffff000000ULL) == 0x0013970102000000ULL) - return 1; - else - return 0; -} - -int is_xsigo_hca(uint64_t guid) -{ - /* NodeType 2 is HCA */ - if ((guid & 0xffffffff00000000ULL) == 0x0013970200000000ULL) - return 1; - else - return 0; -} - -int is_xsigo_tca(uint64_t guid) -{ - /* NodeType 3 is TCA */ - if ((guid & 0xffffffff00000000ULL) == 0x0013970300000000ULL) - return 1; - else - return 0; -} - -static int is_xsigo_ca(uint64_t guid) -{ - if (is_xsigo_hca(guid) || is_xsigo_tca(guid)) - return 1; - else - return 0; -} - -static int is_xsigo_switch(uint64_t guid) -{ - if ((guid & 0xffffffff00000000ULL) == 0x0013970100000000ULL) - return 1; - else - return 0; -} - -static uint64_t xsigo_chassisguid(Node *node) -{ - if (!is_xsigo_ca(node->sysimgguid)) { - /* Byte 3 is NodeType and byte 4 is PortType */ - /* If NodeType is 1 (switch), PortType is masked */ - if (is_xsigo_switch(node->sysimgguid)) - return node->sysimgguid & 0xffffffff00ffffffULL; - else - return node->sysimgguid; - } else { - /* Is there a peer port ? */ - if (!node->ports->remoteport) - return node->sysimgguid; - - /* If peer port is Leaf 1, use its chassis GUID */ - if (is_xsigo_leafone(node->ports->remoteport->node->sysimgguid)) - return node->ports->remoteport->node->sysimgguid & - 0xffffffff00ffffffULL; - else - return node->sysimgguid; - } -} - -static uint64_t get_chassisguid(Node *node) -{ - if (node->vendid == TS_VENDOR_ID || node->vendid == SS_VENDOR_ID) - return topspin_chassisguid(node->sysimgguid); - else if (node->vendid == XS_VENDOR_ID || is_xsigo_guid(node->sysimgguid)) - return xsigo_chassisguid(node); - else - return node->sysimgguid; -} - -static struct ChassisList *find_chassisguid(Node *node) -{ - ChassisList *current; - uint64_t chguid; - - chguid = get_chassisguid(node); - for (current = mylist.first; current; current = current->next) { - if (current->chassisguid == chguid) - return current; - } - - return NULL; -} - -uint64_t get_chassis_guid(unsigned char chassisnum) -{ - ChassisList *chassis; - - chassis = find_chassisnum(chassisnum); - if (chassis) - return chassis->chassisguid; - else - return 0; -} - -static int is_router(Node *node) -{ - return (node->devid == VTR_DEVID_IB_FC_ROUTER || - node->devid == VTR_DEVID_IB_IP_ROUTER); -} - -static int is_spine_9096(Node *node) -{ - return (node->devid == VTR_DEVID_SFB4 || - node->devid == VTR_DEVID_SFB4_DDR); -} - -static int is_spine_9288(Node *node) -{ - return (node->devid == VTR_DEVID_SFB12 || - node->devid == VTR_DEVID_SFB12_DDR); -} - -static int is_spine_2004(Node *node) -{ - return (node->devid == VTR_DEVID_SFB2004); -} - -static int is_spine_2012(Node *node) -{ - return (node->devid == VTR_DEVID_SFB2012); -} - -static int is_spine(Node *node) -{ - return (is_spine_9096(node) || is_spine_9288(node) || - is_spine_2004(node) || is_spine_2012(node)); -} - -static int is_line_24(Node *node) -{ - return (node->devid == VTR_DEVID_SLB24 || - node->devid == VTR_DEVID_SLB24_DDR || - node->devid == VTR_DEVID_SRB2004); -} - -static int is_line_8(Node *node) -{ - return (node->devid == VTR_DEVID_SLB8); -} - -static int is_line_2024(Node *node) -{ - return (node->devid == VTR_DEVID_SLB2024); -} - -static int is_line(Node *node) -{ - return (is_line_24(node) || is_line_8(node) || is_line_2024(node)); -} - -int is_chassis_switch(Node *node) -{ - return (is_spine(node) || is_line(node)); -} - -/* these structs help find Line (Anafa) slot number while using spine portnum */ -char line_slot_2_sfb4[25] = { 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 }; -char anafa_line_slot_2_sfb4[25] = { 0, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2 }; -char line_slot_2_sfb12[25] = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10, 10, 11, 11, 12, 12 }; -char anafa_line_slot_2_sfb12[25] = { 0, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 }; - -/* IPR FCR modules connectivity while using sFB4 port as reference */ -char ipr_slot_2_sfb4_port[25] = { 0, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1, 3, 2, 1 }; - -/* these structs help find Spine (Anafa) slot number while using spine portnum */ -char spine12_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -char anafa_spine12_slot_2_slb[25]= { 0, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -char spine4_slot_2_slb[25] = { 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -char anafa_spine4_slot_2_slb[25] = { 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; -/* reference { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24 }; */ - -static void get_sfb_slot(Node *node, Port *lineport) -{ - ChassisRecord *ch = node->chrecord; - - ch->chassisslot = SPINE_CS; - if (is_spine_9096(node)) { - ch->chassistype = ISR9096_CT; - ch->slotnum = spine4_slot_2_slb[lineport->portnum]; - ch->anafanum = anafa_spine4_slot_2_slb[lineport->portnum]; - } else if (is_spine_9288(node)) { - ch->chassistype = ISR9288_CT; - ch->slotnum = spine12_slot_2_slb[lineport->portnum]; - ch->anafanum = anafa_spine12_slot_2_slb[lineport->portnum]; - } else if (is_spine_2012(node)) { - ch->chassistype = ISR2012_CT; - ch->slotnum = spine12_slot_2_slb[lineport->portnum]; - ch->anafanum = anafa_spine12_slot_2_slb[lineport->portnum]; - } else if (is_spine_2004(node)) { - ch->chassistype = ISR2004_CT; - ch->slotnum = spine4_slot_2_slb[lineport->portnum]; - ch->anafanum = anafa_spine4_slot_2_slb[lineport->portnum]; - } else { - IBPANIC("Unexpected node found: guid 0x%016" PRIx64, node->nodeguid); - } -} - -static void get_router_slot(Node *node, Port *spineport) -{ - ChassisRecord *ch = node->chrecord; - uint64_t guessnum = 0; - - if (!ch) { - if (!(node->chrecord = calloc(1, sizeof(ChassisRecord)))) - IBPANIC("out of mem"); - ch = node->chrecord; - } - - ch->chassisslot = SRBD_CS; - if (is_spine_9096(spineport->node)) { - ch->chassistype = ISR9096_CT; - ch->slotnum = line_slot_2_sfb4[spineport->portnum]; - ch->anafanum = ipr_slot_2_sfb4_port[spineport->portnum]; - } else if (is_spine_9288(spineport->node)) { - ch->chassistype = ISR9288_CT; - ch->slotnum = line_slot_2_sfb12[spineport->portnum]; - /* this is a smart guess based on nodeguids order on sFB-12 module */ - guessnum = spineport->node->nodeguid % 4; - /* module 1 <--> remote anafa 3 */ - /* module 2 <--> remote anafa 2 */ - /* module 3 <--> remote anafa 1 */ - ch->anafanum = (guessnum == 3 ? 1 : (guessnum == 1 ? 3 : 2)); - } else if (is_spine_2012(spineport->node)) { - ch->chassistype = ISR2012_CT; - ch->slotnum = line_slot_2_sfb12[spineport->portnum]; - /* this is a smart guess based on nodeguids order on sFB-12 module */ - guessnum = spineport->node->nodeguid % 4; - /* module 1 <--> remote anafa 3 */ - /* module 2 <--> remote anafa 2 */ - /* module 3 <--> remote anafa 1 */ - ch->anafanum = (guessnum == 3? 1 : (guessnum == 1 ? 3 : 2)); - } else if (is_spine_2004(spineport->node)) { - ch->chassistype = ISR2004_CT; - ch->slotnum = line_slot_2_sfb4[spineport->portnum]; - ch->anafanum = ipr_slot_2_sfb4_port[spineport->portnum]; - } else { - IBPANIC("Unexpected node found: guid 0x%016" PRIx64, spineport->node->nodeguid); - } -} - -static void get_slb_slot(ChassisRecord *ch, Port *spineport) -{ - ch->chassisslot = LINE_CS; - if (is_spine_9096(spineport->node)) { - ch->chassistype = ISR9096_CT; - ch->slotnum = line_slot_2_sfb4[spineport->portnum]; - ch->anafanum = anafa_line_slot_2_sfb4[spineport->portnum]; - } else if (is_spine_9288(spineport->node)) { - ch->chassistype = ISR9288_CT; - ch->slotnum = line_slot_2_sfb12[spineport->portnum]; - ch->anafanum = anafa_line_slot_2_sfb12[spineport->portnum]; - } else if (is_spine_2012(spineport->node)) { - ch->chassistype = ISR2012_CT; - ch->slotnum = line_slot_2_sfb12[spineport->portnum]; - ch->anafanum = anafa_line_slot_2_sfb12[spineport->portnum]; - } else if (is_spine_2004(spineport->node)) { - ch->chassistype = ISR2004_CT; - ch->slotnum = line_slot_2_sfb4[spineport->portnum]; - ch->anafanum = anafa_line_slot_2_sfb4[spineport->portnum]; - } else { - IBPANIC("Unexpected node found: guid 0x%016" PRIx64, spineport->node->nodeguid); - } -} - -/* - This function called for every Voltaire node in fabric - It could be optimized so, but time overhead is very small - and its only diag.util -*/ -static void fill_chassis_record(Node *node) -{ - Port *port; - Node *remnode = 0; - ChassisRecord *ch = 0; - - if (node->chrecord) /* somehow this node has already been passed */ - return; - - if (!(node->chrecord = calloc(1, sizeof(ChassisRecord)))) - IBPANIC("out of mem"); - - ch = node->chrecord; - - /* node is router only in case of using unique lid */ - /* (which is lid of chassis router port) */ - /* in such case node->ports is actually a requested port... */ - if (is_router(node) && is_spine(node->ports->remoteport->node)) - get_router_slot(node, node->ports->remoteport); - else if (is_spine(node)) { - for (port = node->ports; port; port = port->next) { - if (!port->remoteport) - continue; - remnode = port->remoteport->node; - if (remnode->type != SWITCH_NODE) { - if (!remnode->chrecord) - get_router_slot(remnode, port); - continue; - } - if (!ch->chassistype) - /* we assume here that remoteport belongs to line */ - get_sfb_slot(node, port->remoteport); - - /* we could break here, but need to find if more routers connected */ - } - - } else if (is_line(node)) { - for (port = node->ports; port; port = port->next) { - if (port->portnum > 12) - continue; - if (!port->remoteport) - continue; - /* we assume here that remoteport belongs to spine */ - get_slb_slot(ch, port->remoteport); - break; - } - } - - return; -} - -static int get_line_index(Node *node) -{ - int retval = 3 * (node->chrecord->slotnum - 1) + node->chrecord->anafanum; - - if (retval > LINES_MAX_NUM || retval < 1) - IBPANIC("Internal error"); - return retval; -} - -static int get_spine_index(Node *node) -{ - int retval; - - if (is_spine_9288(node) || is_spine_2012(node)) - retval = 3 * (node->chrecord->slotnum - 1) + node->chrecord->anafanum; - else - retval = node->chrecord->slotnum; - - if (retval > SPINES_MAX_NUM || retval < 1) - IBPANIC("Internal error"); - return retval; -} - -static void insert_line_router(Node *node, ChassisList *chassislist) -{ - int i = get_line_index(node); - - if (chassislist->linenode[i]) - return; /* already filled slot */ - - chassislist->linenode[i] = node; - node->chrecord->chassisnum = chassislist->chassisnum; -} - -static void insert_spine(Node *node, ChassisList *chassislist) -{ - int i = get_spine_index(node); - - if (chassislist->spinenode[i]) - return; /* already filled slot */ - - chassislist->spinenode[i] = node; - node->chrecord->chassisnum = chassislist->chassisnum; -} - -static void pass_on_lines_catch_spines(ChassisList *chassislist) -{ - Node *node, *remnode; - Port *port; - int i; - - for (i = 1; i <= LINES_MAX_NUM; i++) { - node = chassislist->linenode[i]; - - if (!(node && is_line(node))) - continue; /* empty slot or router */ - - for (port = node->ports; port; port = port->next) { - if (port->portnum > 12) - continue; - - if (!port->remoteport) - continue; - remnode = port->remoteport->node; - - if (!remnode->chrecord) - continue; /* some error - spine not initialized ? FIXME */ - insert_spine(remnode, chassislist); - } - } -} - -static void pass_on_spines_catch_lines(ChassisList *chassislist) -{ - Node *node, *remnode; - Port *port; - int i; - - for (i = 1; i <= SPINES_MAX_NUM; i++) { - node = chassislist->spinenode[i]; - if (!node) - continue; /* empty slot */ - for (port = node->ports; port; port = port->next) { - if (!port->remoteport) - continue; - remnode = port->remoteport->node; - - if (!remnode->chrecord) - continue; /* some error - line/router not initialized ? FIXME */ - insert_line_router(remnode, chassislist); - } - } -} - -/* - Stupid interpolation algorithm... - But nothing to do - have to be compliant with VoltaireSM/NMS -*/ -static void pass_on_spines_interpolate_chguid(ChassisList *chassislist) -{ - Node *node; - int i; - - for (i = 1; i <= SPINES_MAX_NUM; i++) { - node = chassislist->spinenode[i]; - if (!node) - continue; /* skip the empty slots */ - - /* take first guid minus one to be consistent with SM */ - chassislist->chassisguid = node->nodeguid - 1; - break; - } -} - -/* - This function fills chassislist structure with all nodes - in that chassis - chassislist structure = structure of one standalone chassis -*/ -static void build_chassis(Node *node, ChassisList *chassislist) -{ - Node *remnode = 0; - Port *port = 0; - - /* we get here with node = chassis_spine */ - chassislist->chassistype = node->chrecord->chassistype; - insert_spine(node, chassislist); - - /* loop: pass on all ports of node */ - for (port = node->ports; port; port = port->next) { - if (!port->remoteport) - continue; - remnode = port->remoteport->node; - - if (!remnode->chrecord) - continue; /* some error - line or router not initialized ? FIXME */ - - insert_line_router(remnode, chassislist); - } - - pass_on_lines_catch_spines(chassislist); - /* this pass needed for to catch routers, since routers connected only */ - /* to spines in slot 1 or 4 and we could miss them first time */ - pass_on_spines_catch_lines(chassislist); - - /* additional 2 passes needed for to overcome a problem of pure "in-chassis" */ - /* connectivity - extra pass to ensure that all related chips/modules */ - /* inserted into the chassislist */ - pass_on_lines_catch_spines(chassislist); - pass_on_spines_catch_lines(chassislist); - pass_on_spines_interpolate_chguid(chassislist); -} - -/*========================================================*/ -/* INTERNAL TO EXTERNAL PORT MAPPING */ -/*========================================================*/ - -/* -Description : On ISR9288/9096 external ports indexing - is not matching the internal ( anafa ) port - indexes. Use this MAP to translate the data you get from - the OpenIB diagnostics (smpquery, ibroute, ibtracert, etc.) - - -Module : sLB-24 - anafa 1 anafa 2 -ext port | 13 14 15 16 17 18 | 19 20 21 22 23 24 -int port | 22 23 24 18 17 16 | 22 23 24 18 17 16 -ext port | 1 2 3 4 5 6 | 7 8 9 10 11 12 -int port | 19 20 21 15 14 13 | 19 20 21 15 14 13 ------------------------------------------------- - -Module : sLB-8 - anafa 1 anafa 2 -ext port | 13 14 15 16 17 18 | 19 20 21 22 23 24 -int port | 24 23 22 18 17 16 | 24 23 22 18 17 16 -ext port | 1 2 3 4 5 6 | 7 8 9 10 11 12 -int port | 21 20 19 15 14 13 | 21 20 19 15 14 13 - ------------> - anafa 1 anafa 2 -ext port | - - 5 - - 6 | - - 7 - - 8 -int port | 24 23 22 18 17 16 | 24 23 22 18 17 16 -ext port | - - 1 - - 2 | - - 3 - - 4 -int port | 21 20 19 15 14 13 | 21 20 19 15 14 13 ------------------------------------------------- - -Module : sLB-2024 - -ext port | 13 14 15 16 17 18 19 20 21 22 23 24 -A1 int port| 13 14 15 16 17 18 19 20 21 22 23 24 -ext port | 1 2 3 4 5 6 7 8 9 10 11 12 -A2 int port| 13 14 15 16 17 18 19 20 21 22 23 24 ---------------------------------------------------- - -*/ - -int int2ext_map_slb24[2][25] = { - { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 5, 4, 18, 17, 16, 1, 2, 3, 13, 14, 15 }, - { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 11, 10, 24, 23, 22, 7, 8, 9, 19, 20, 21 } - }; -int int2ext_map_slb8[2][25] = { - { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 6, 6, 6, 1, 1, 1, 5, 5, 5 }, - { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 8, 8, 8, 3, 3, 3, 7, 7, 7 } - }; -int int2ext_map_slb2024[2][25] = { - { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }, - { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 } - }; -/* reference { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }; */ - -/* - This function relevant only for line modules/chips - Returns string with external port index -*/ -char *portmapstring(Port *port) -{ - static char mapping[OUT_BUFFER_SIZE]; - ChassisRecord *ch = port->node->chrecord; - int portnum = port->portnum; - int chipnum = 0; - int pindex = 0; - Node *node = port->node; - - if (!ch || !is_line(node) || (portnum < 13 || portnum > 24)) - return NULL; - - if (ch->anafanum < 1 || ch->anafanum > 2) - return NULL; - - memset(mapping, 0, sizeof(mapping)); - - chipnum = ch->anafanum - 1; - - if (is_line_24(node)) - pindex = int2ext_map_slb24[chipnum][portnum]; - else if (is_line_2024(node)) - pindex = int2ext_map_slb2024[chipnum][portnum]; - else - pindex = int2ext_map_slb8[chipnum][portnum]; - - sprintf(mapping, "[ext %d]", pindex); - - return mapping; -} - -static void add_chassislist() -{ - if (!(mylist.current = calloc(1, sizeof(ChassisList)))) - IBPANIC("out of mem"); - - if (mylist.first == NULL) { - mylist.first = mylist.current; - mylist.last = mylist.current; - } else { - mylist.last->next = mylist.current; - mylist.current->next = NULL; - mylist.last = mylist.current; - } -} - -/* - Main grouping function - Algorithm: - 1. pass on every Voltaire node - 2. catch spine chip for every Voltaire node - 2.1 build/interpolate chassis around this chip - 2.2 go to 1. - 3. pass on non Voltaire nodes (SystemImageGUID based grouping) - 4. now group non Voltaire nodes by SystemImageGUID -*/ -ChassisList *group_nodes() -{ - Node *node; - int dist; - int chassisnum = 0; - struct ChassisList *chassis; - - mylist.first = NULL; - mylist.current = NULL; - mylist.last = NULL; - - /* first pass on switches and build for every Voltaire node */ - /* an appropriate chassis record (slotnum and position) */ - /* according to internal connectivity */ - /* not very efficient but clear code so... */ - for (dist = 0; dist <= maxhops_discovered; dist++) { - for (node = nodesdist[dist]; node; node = node->dnext) { - if (node->vendid == VTR_VENDOR_ID) - fill_chassis_record(node); - } - } - - /* separate every Voltaire chassis from each other and build linked list of them */ - /* algorithm: catch spine and find all surrounding nodes */ - for (dist = 0; dist <= maxhops_discovered; dist++) { - for (node = nodesdist[dist]; node; node = node->dnext) { - if (node->vendid != VTR_VENDOR_ID) - continue; - if (!node->chrecord || node->chrecord->chassisnum || !is_spine(node)) - continue; - add_chassislist(); - mylist.current->chassisnum = ++chassisnum; - build_chassis(node, mylist.current); - } - } - - /* now make pass on nodes for chassis which are not Voltaire */ - /* grouped by common SystemImageGUID */ - for (dist = 0; dist <= maxhops_discovered; dist++) { - for (node = nodesdist[dist]; node; node = node->dnext) { - if (node->vendid == VTR_VENDOR_ID) - continue; - if (node->sysimgguid) { - chassis = find_chassisguid(node); - if (chassis) - chassis->nodecount++; - else { - /* Possible new chassis */ - add_chassislist(); - mylist.current->chassisguid = get_chassisguid(node); - mylist.current->nodecount = 1; - } - } - } - } - - /* now, make another pass to see which nodes are part of chassis */ - /* (defined as chassis->nodecount > 1) */ - for (dist = 0; dist <= MAXHOPS; ) { - for (node = nodesdist[dist]; node; node = node->dnext) { - if (node->vendid == VTR_VENDOR_ID) - continue; - if (node->sysimgguid) { - chassis = find_chassisguid(node); - if (chassis && chassis->nodecount > 1) { - if (!chassis->chassisnum) - chassis->chassisnum = ++chassisnum; - if (!node->chrecord) { - if (!(node->chrecord = calloc(1, sizeof(ChassisRecord)))) - IBPANIC("out of mem"); - node->chrecord->chassisnum = chassis->chassisnum; - } - } - } - } - if (dist == maxhops_discovered) - dist = MAXHOPS; /* skip to CAs */ - else - dist++; - } - - return (mylist.first); -} diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 25c1f7f..99750f0 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -1,6 +1,7 @@ /* * Copyright (c) 2004-2008 Voltaire Inc. All rights reserved. * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. + * Copyright (c) 2008 Lawrence Livermore National Lab. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -48,445 +49,108 @@ #include #include #include +#include #include "ibnetdiscover.h" -#include "grouping.h" #include "ibdiag_common.h" struct ibmad_port *srcport; -static char *node_type_str[] = { - "???", - "ca", - "switch", - "router", - "iwarp rnic" -}; - -static char *linkwidth_str[] = { - "??", - "1x", - "4x", - "??", - "8x", - "??", - "??", - "??", - "12x" -}; - -static char *linkspeed_str[] = { - "???", - "SDR", - "DDR", - "???", - "QDR" -}; - static int timeout = 2000; /* ms */ -static int dumplevel = 0; static FILE *f; static char *node_name_map_file = NULL; static nn_map_t *node_name_map = NULL; -Node *nodesdist[MAXHOPS+1]; /* last is Ca list */ -Node *mynode; -int maxhops_discovered = 0; - -struct ChassisList *chassis = NULL; - -static char * -get_linkwidth_str(int linkwidth) +/** + * Define our own conversion functions to maintain compatibility with the old + * ibnetdiscover which did not use the ibmad conversion functions. + */ +char *dump_linkspeed_compat(uint32_t speed) { - if (linkwidth > 8) - return linkwidth_str[0]; - else - return linkwidth_str[linkwidth]; + switch (speed) { + case 1: + return ("SDR"); + break; + case 2: + return ("DDR"); + break; + case 4: + return ("QDR"); + break; + } + return ("???"); } -static char * -get_linkspeed_str(int linkspeed) +char *dump_linkwidth_compat(uint32_t width) { - if (linkspeed > 4) - return linkspeed_str[0]; - else - return linkspeed_str[linkspeed]; + switch (width) { + case 1: + return ("1x"); + break; + case 2: + return ("4x"); + break; + case 4: + return ("8x"); + break; + case 8: + return ("12x"); + break; + } + return ("??"); } static inline const char* -node_type_str2(Node *node) +ports_nt_str_compat(ibnd_node_t *node) { switch(node->type) { - case SWITCH_NODE: return "SW"; - case CA_NODE: return "CA"; - case ROUTER_NODE: return "RT"; + case IB_NODE_SWITCH: return "SW"; + case IB_NODE_CA: return "CA"; + case IB_NODE_ROUTER: return "RT"; } return "??"; } -void -decode_port_info(void *pi, Port *port) -{ - mad_decode_field(pi, IB_PORT_LID_F, &port->lid); - mad_decode_field(pi, IB_PORT_LMC_F, &port->lmc); - mad_decode_field(pi, IB_PORT_STATE_F, &port->state); - mad_decode_field(pi, IB_PORT_PHYS_STATE_F, &port->physstate); - mad_decode_field(pi, IB_PORT_LINK_WIDTH_ACTIVE_F, &port->linkwidth); - mad_decode_field(pi, IB_PORT_LINK_SPEED_ACTIVE_F, &port->linkspeed); -} - - -int -get_port(Port *port, int portnum, ib_portid_t *portid) -{ - char portinfo[64]; - void *pi = portinfo; - - port->portnum = portnum; - - if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, portnum, timeout, - srcport)) - return -1; - decode_port_info(pi, port); - - DEBUG("portid %s portnum %d: lid %d state %d physstate %d %s %s", - portid2str(portid), portnum, port->lid, port->state, port->physstate, get_linkwidth_str(port->linkwidth), get_linkspeed_str(port->linkspeed)); - return 1; -} -/* - * Returns 0 if non switch node is found, 1 if switch is found, -1 if error. - */ -int -get_node(Node *node, Port *port, ib_portid_t *portid) -{ - char portinfo[64]; - char switchinfo[64]; - void *pi = portinfo, *ni = node->nodeinfo, *nd = node->nodedesc; - void *si = switchinfo; - - if (!smp_query_via(ni, portid, IB_ATTR_NODE_INFO, 0, timeout, srcport)) - return -1; - - mad_decode_field(ni, IB_NODE_GUID_F, &node->nodeguid); - mad_decode_field(ni, IB_NODE_TYPE_F, &node->type); - mad_decode_field(ni, IB_NODE_NPORTS_F, &node->numports); - mad_decode_field(ni, IB_NODE_DEVID_F, &node->devid); - mad_decode_field(ni, IB_NODE_VENDORID_F, &node->vendid); - mad_decode_field(ni, IB_NODE_SYSTEM_GUID_F, &node->sysimgguid); - mad_decode_field(ni, IB_NODE_PORT_GUID_F, &node->portguid); - mad_decode_field(ni, IB_NODE_LOCAL_PORT_F, &node->localport); - port->portnum = node->localport; - port->portguid = node->portguid; - - if (!smp_query_via(nd, portid, IB_ATTR_NODE_DESC, 0, timeout, srcport)) - return -1; - - if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, 0, timeout, srcport)) - return -1; - decode_port_info(pi, port); - - if (node->type != SWITCH_NODE) - return 0; - - node->smalid = port->lid; - node->smalmc = port->lmc; - - /* after we have the sma information find out the real PortInfo for this port */ - if (!smp_query_via(pi, portid, IB_ATTR_PORT_INFO, node->localport, - timeout, srcport)) - return -1; - decode_port_info(pi, port); - - port->lid = node->smalid; /* LID is still defined by port 0 */ - port->lmc = node->smalmc; - - if (!smp_query_via(si, portid, IB_ATTR_SWITCH_INFO, 0, timeout, srcport)) - node->smaenhsp0 = 0; /* assume base SP0 */ - else - mad_decode_field(si, IB_SW_ENHANCED_PORT0_F, &node->smaenhsp0); - - DEBUG("portid %s: got switch node %" PRIx64 " '%s'", - portid2str(portid), node->nodeguid, node->nodedesc); - return 1; -} - -static int -extend_dpath(ib_dr_path_t *path, int nextport) -{ - if (path->cnt+2 >= sizeof(path->p)) - return -1; - ++path->cnt; - if (path->cnt > maxhops_discovered) - maxhops_discovered = path->cnt; - path->p[path->cnt] = (uint8_t) nextport; - return path->cnt; -} - -static void -dump_endnode(ib_portid_t *path, char *prompt, Node *node, Port *port) -{ - if (!dumplevel) - return; - - fprintf(f, "%s -> %s %s {%016" PRIx64 "} portnum %d lid %d-%d\"%s\"\n", - portid2str(path), prompt, - (node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"), - node->nodeguid, node->type == SWITCH_NODE ? 0 : port->portnum, - port->lid, port->lid + (1 << port->lmc) - 1, - clean_nodedesc(node->nodedesc)); -} - -#define HASHGUID(guid) ((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103))) -#define HTSZ 137 - -static Node *nodestbl[HTSZ]; - -static Node * -find_node(Node *new) -{ - int hash = HASHGUID(new->nodeguid) % HTSZ; - Node *node; - - for (node = nodestbl[hash]; node; node = node->htnext) - if (node->nodeguid == new->nodeguid) - return node; - - return NULL; -} - -static Node * -create_node(Node *temp, ib_portid_t *path, int dist) -{ - Node *node; - int hash = HASHGUID(temp->nodeguid) % HTSZ; - - node = malloc(sizeof(*node)); - if (!node) - return NULL; - - memcpy(node, temp, sizeof(*node)); - node->dist = dist; - node->path = *path; - - node->htnext = nodestbl[hash]; - nodestbl[hash] = node; - - if (node->type != SWITCH_NODE) - dist = MAXHOPS; /* special Ca list */ - - node->dnext = nodesdist[dist]; - nodesdist[dist] = node; - - return node; -} - -static Port * -find_port(Node *node, Port *port) -{ - Port *old; - - for (old = node->ports; old; old = old->next) - if (old->portnum == port->portnum) - return old; - - return NULL; -} - -static Port * -create_port(Node *node, Port *temp) -{ - Port *port; - - port = malloc(sizeof(*port)); - if (!port) - return NULL; - - memcpy(port, temp, sizeof(*port)); - port->node = node; - port->next = node->ports; - node->ports = port; - - return port; -} - -static void -link_ports(Node *node, Port *port, Node *remotenode, Port *remoteport) -{ - DEBUG("linking: 0x%" PRIx64 " %p->%p:%u and 0x%" PRIx64 " %p->%p:%u", - node->nodeguid, node, port, port->portnum, - remotenode->nodeguid, remotenode, remoteport, remoteport->portnum); - if (port->remoteport) - port->remoteport->remoteport = NULL; - if (remoteport->remoteport) - remoteport->remoteport->remoteport = NULL; - port->remoteport = remoteport; - remoteport->remoteport = port; -} - -static int -handle_port(Node *node, Port *port, ib_portid_t *path, int portnum, int dist) -{ - Node node_buf; - Port port_buf; - Node *remotenode, *oldnode; - Port *remoteport, *oldport; - - memset(&node_buf, 0, sizeof(node_buf)); - memset(&port_buf, 0, sizeof(port_buf)); - - DEBUG("handle node %p port %p:%d dist %d", node, port, portnum, dist); - if (port->physstate != 5) /* LinkUp */ - return -1; - - if (extend_dpath(&path->drpath, portnum) < 0) - return -1; - - if (get_node(&node_buf, &port_buf, path) < 0) { - IBWARN("NodeInfo on %s failed, skipping port", - portid2str(path)); - path->drpath.cnt--; /* restore path */ - return -1; - } - - oldnode = find_node(&node_buf); - if (oldnode) - remotenode = oldnode; - else if (!(remotenode = create_node(&node_buf, path, dist + 1))) - IBERROR("no memory"); - - oldport = find_port(remotenode, &port_buf); - if (oldport) { - remoteport = oldport; - if (node != remotenode || port != remoteport) - IBWARN("port moving..."); - } else if (!(remoteport = create_port(remotenode, &port_buf))) - IBERROR("no memory"); - - dump_endnode(path, oldnode ? "known remote" : "new remote", - remotenode, remoteport); - - link_ports(node, port, remotenode, remoteport); - - path->drpath.cnt--; /* restore path */ - return 0; -} - -/* - * Return 1 if found, 0 if not, -1 on errors. - */ -static int -discover(ib_portid_t *from) -{ - Node node_buf; - Port port_buf; - Node *node; - Port *port; - int i; - int dist = 0; - ib_portid_t *path; - - DEBUG("from %s", portid2str(from)); - - memset(&node_buf, 0, sizeof(node_buf)); - memset(&port_buf, 0, sizeof(port_buf)); - - if (get_node(&node_buf, &port_buf, from) < 0) { - IBWARN("can't reach node %s", portid2str(from)); - return -1; - } - - node = create_node(&node_buf, from, 0); - if (!node) - IBERROR("out of memory"); - - mynode = node; - - port = create_port(node, &port_buf); - if (!port) - IBERROR("out of memory"); - - if (node->type != SWITCH_NODE && - handle_port(node, port, from, node->localport, 0) < 0) - return 0; - - for (dist = 0; dist < MAXHOPS; dist++) { - - for (node = nodesdist[dist]; node; node = node->dnext) { - - path = &node->path; - - DEBUG("dist %d node %p", dist, node); - dump_endnode(path, "processing", node, port); - - for (i = 1; i <= node->numports; i++) { - if (i == node->localport) - continue; - - if (get_port(&port_buf, i, path) < 0) { - IBWARN("can't reach node %s port %d", portid2str(path), i); - continue; - } - - port = find_port(node, &port_buf); - if (port) - continue; - - port = create_port(node, &port_buf); - if (!port) - IBERROR("out of memory"); - - /* If switch, set port GUID to node GUID */ - if (node->type == SWITCH_NODE) - port->portguid = node->portguid; - - handle_port(node, port, path, i, dist); - } - } - } - - return 0; -} - char * -node_name(Node *node) +node_name(ibnd_node_t *node) { static char buf[256]; switch(node->type) { - case SWITCH_NODE: + case IB_NODE_SWITCH: sprintf(buf, "\"%s", "S"); break; - case CA_NODE: + case IB_NODE_CA: sprintf(buf, "\"%s", "H"); break; - case ROUTER_NODE: + case IB_NODE_ROUTER: sprintf(buf, "\"%s", "R"); break; default: sprintf(buf, "\"%s", "?"); break; } - sprintf(buf+2, "-%016" PRIx64 "\"", node->nodeguid); + sprintf(buf+2, "-%016" PRIx64 "\"", node->guid); return buf; } void -list_node(Node *node) +list_node(ibnd_node_t *node, void *user_data) { char *node_type; - char *nodename = remap_node_name(node_name_map, node->nodeguid, + char *nodename = remap_node_name(node_name_map, node->guid, node->nodedesc); switch(node->type) { - case SWITCH_NODE: + case IB_NODE_SWITCH: node_type = "Switch"; break; - case CA_NODE: + case IB_NODE_CA: node_type = "Ca"; break; - case ROUTER_NODE: + case IB_NODE_ROUTER: node_type = "Router"; break; default: @@ -495,36 +159,58 @@ list_node(Node *node) } fprintf(f, "%s\t : 0x%016" PRIx64 " ports %d devid 0x%x vendid 0x%x \"%s\"\n", node_type, - node->nodeguid, node->numports, node->devid, node->vendid, + node->guid, node->numports, + mad_get_field(node->info, 0, IB_NODE_DEVID_F), + mad_get_field(node->info, 0, IB_NODE_VENDORID_F), nodename); free(nodename); } void -out_ids(Node *node, int group, char *chname) +list_nodes(ibnd_fabric_t *fabric, int list) { - fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", node->vendid, node->devid); - if (node->sysimgguid) - fprintf(f, "sysimgguid=0x%" PRIx64, node->sysimgguid); + if (list & LIST_CA_NODE) { + ibnd_iter_nodes_type(fabric, list_node, IB_NODE_CA, NULL); + } + if (list & LIST_SWITCH_NODE) { + ibnd_iter_nodes_type(fabric, list_node, IB_NODE_SWITCH, NULL); + } + if (list & LIST_ROUTER_NODE) { + ibnd_iter_nodes_type(fabric, list_node, IB_NODE_ROUTER, NULL); + } +} + +void +out_ids(ibnd_node_t *node, int group, char *chname) +{ + uint64_t sysimgguid = mad_get_field64(node->info, 0, IB_NODE_SYSTEM_GUID_F); + + fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", + mad_get_field(node->info, 0, IB_NODE_VENDORID_F), + mad_get_field(node->info, 0, IB_NODE_DEVID_F)); + if (sysimgguid) + fprintf(f, "sysimgguid=0x%" PRIx64, sysimgguid); if (group - && node->chrecord && node->chrecord->chassisnum) { - fprintf(f, "\t\t# Chassis %d", node->chrecord->chassisnum); + && node->chassis && node->chassis->chassisnum) { + fprintf(f, "\t\t# Chassis %d", node->chassis->chassisnum); if (chname) - fprintf(f, " (%s)", chname); - if (is_xsigo_tca(node->nodeguid) && node->ports->remoteport) - fprintf(f, " slot %d", node->ports->remoteport->portnum); + fprintf(f, " (%s)", clean_nodedesc(chname)); + if (ibnd_is_xsigo_tca(node->guid) + && node->ports[1] + && node->ports[1]->remoteport) + fprintf(f, " slot %d", node->ports[1]->remoteport->portnum); } fprintf(f, "\n"); } uint64_t -out_chassis(unsigned char chassisnum) +out_chassis(ibnd_fabric_t *fabric, int chassisnum) { uint64_t guid; fprintf(f, "\nChassis %d", chassisnum); - guid = get_chassis_guid(chassisnum); + guid = ibnd_get_chassis_guid(fabric, chassisnum); if (guid) fprintf(f, " (guid 0x%" PRIx64 ")", guid); fprintf(f, "\n"); @@ -532,29 +218,25 @@ out_chassis(unsigned char chassisnum) } void -out_switch(Node *node, int group, char *chname) +out_switch(ibnd_node_t *node, int group, char *chname) { char *str; + char str2[256]; char *nodename = NULL; out_ids(node, group, chname); - fprintf(f, "switchguid=0x%" PRIx64, node->nodeguid); - fprintf(f, "(%" PRIx64 ")", node->portguid); - /* Currently, only if Voltaire chassis */ - if (group - && node->chrecord && node->chrecord->chassisnum - && node->vendid == VTR_VENDOR_ID) { - str = get_chassis_type(node->chrecord->chassistype); + fprintf(f, "switchguid=0x%" PRIx64, node->guid); + fprintf(f, "(%" PRIx64 ")", mad_get_field64(node->info, 0, IB_NODE_PORT_GUID_F)); + if (group) { + str = ibnd_get_chassis_type(node); if (str) fprintf(f, "%s ", str); - str = get_chassis_slot(node->chrecord->chassisslot); + str = ibnd_get_chassis_slot_str(node, str2, 256); if (str) - fprintf(f, "%s ", str); - fprintf(f, "%d Chip %d", node->chrecord->slotnum, node->chrecord->anafanum); + fprintf(f, "%s", str); } - nodename = remap_node_name(node_name_map, node->nodeguid, - node->nodedesc); + nodename = remap_node_name(node_name_map, node->guid, node->nodedesc); fprintf(f, "\nSwitch\t%d %s\t\t# \"%s\" %s port 0 lid %d lmc %d\n", node->numports, node_name(node), @@ -566,20 +248,18 @@ out_switch(Node *node, int group, char *chname) } void -out_ca(Node *node, int group, char *chname) +out_ca(ibnd_node_t *node, int group, char *chname) { char *node_type; char *node_type2; - char *nodename = remap_node_name(node_name_map, node->nodeguid, - node->nodedesc); out_ids(node, group, chname); switch(node->type) { - case CA_NODE: + case IB_NODE_CA: node_type = "ca"; node_type2 = "Ca"; break; - case ROUTER_NODE: + case IB_NODE_ROUTER: node_type = "rt"; node_type2 = "Rt"; break; @@ -589,37 +269,41 @@ out_ca(Node *node, int group, char *chname) break; } - fprintf(f, "%sguid=0x%" PRIx64 "\n", node_type, node->nodeguid); + fprintf(f, "%sguid=0x%" PRIx64 "\n", node_type, node->guid); fprintf(f, "%s\t%d %s\t\t# \"%s\"", node_type2, node->numports, node_name(node), - nodename); - if (group && is_xsigo_hca(node->nodeguid)) + clean_nodedesc(node->nodedesc)); + if (group && ibnd_is_xsigo_hca(node->guid)) fprintf(f, " (scp)"); fprintf(f, "\n"); - - free(nodename); } +#define OUT_BUFFER_SIZE 16 static char * -out_ext_port(Port *port, int group) +out_ext_port(ibnd_port_t *port, int group) { - char *str = NULL; + static char mapping[OUT_BUFFER_SIZE]; - /* Currently, only if Voltaire chassis */ - if (group - && port->node->chrecord && port->node->vendid == VTR_VENDOR_ID) - str = portmapstring(port); + if (group && port->ext_portnum != 0) { + snprintf(mapping, OUT_BUFFER_SIZE, + "[ext %d]", port->ext_portnum); + return (mapping); + } - return (str); + return (NULL); } void -out_switch_port(Port *port, int group) +out_switch_port(ibnd_port_t *port, int group) { char *ext_port_str = NULL; char *rem_nodename = NULL; + uint32_t iwidth = mad_get_field(port->info, 0, + IB_PORT_LINK_WIDTH_ACTIVE_F); + uint32_t ispeed = mad_get_field(port->info, 0, + IB_PORT_LINK_SPEED_ACTIVE_F); - DEBUG("port %p:%d remoteport %p", port, port->portnum, port->remoteport); + DEBUG("port %p:%d remoteport %p\n", port, port->portnum, port->remoteport); fprintf(f, "[%d]", port->portnum); ext_port_str = out_ext_port(port, group); @@ -627,7 +311,7 @@ out_switch_port(Port *port, int group) fprintf(f, "%s", ext_port_str); rem_nodename = remap_node_name(node_name_map, - port->remoteport->node->nodeguid, + port->remoteport->node->guid, port->remoteport->node->nodedesc); ext_port_str = out_ext_port(port->remoteport, group); @@ -635,17 +319,19 @@ out_switch_port(Port *port, int group) node_name(port->remoteport->node), port->remoteport->portnum, ext_port_str ? ext_port_str : ""); - if (port->remoteport->node->type != SWITCH_NODE) - fprintf(f, "(%" PRIx64 ") ", port->remoteport->portguid); + if (port->remoteport->node->type != IB_NODE_SWITCH) + fprintf(f, "(%" PRIx64 ") ", port->remoteport->guid); fprintf(f, "\t\t# \"%s\" lid %d %s%s", rem_nodename, - port->remoteport->node->type == SWITCH_NODE ? port->remoteport->node->smalid : port->remoteport->lid, - get_linkwidth_str(port->linkwidth), - get_linkspeed_str(port->linkspeed)); + port->remoteport->node->type == IB_NODE_SWITCH ? + port->remoteport->node->smalid : + port->remoteport->base_lid, + dump_linkwidth_compat(iwidth), + dump_linkspeed_compat(ispeed)); - if (is_xsigo_tca(port->remoteport->portguid)) + if (ibnd_is_xsigo_tca(port->remoteport->guid)) fprintf(f, " slot %d", port->portnum); - else if (is_xsigo_hca(port->remoteport->portguid)) + else if (ibnd_is_xsigo_hca(port->remoteport->guid)) fprintf(f, " (scp)"); fprintf(f, "\n"); @@ -653,281 +339,275 @@ out_switch_port(Port *port, int group) } void -out_ca_port(Port *port, int group) +out_ca_port(ibnd_port_t *port, int group) { char *str = NULL; char *rem_nodename = NULL; + uint32_t iwidth = mad_get_field(port->info, 0, + IB_PORT_LINK_WIDTH_ACTIVE_F); + uint32_t ispeed = mad_get_field(port->info, 0, + IB_PORT_LINK_SPEED_ACTIVE_F); fprintf(f, "[%d]", port->portnum); - if (port->node->type != SWITCH_NODE) - fprintf(f, "(%" PRIx64 ") ", port->portguid); + if (port->node->type != IB_NODE_SWITCH) + fprintf(f, "(%" PRIx64 ") ", port->guid); fprintf(f, "\t%s[%d]", node_name(port->remoteport->node), port->remoteport->portnum); str = out_ext_port(port->remoteport, group); if (str) fprintf(f, "%s", str); - if (port->remoteport->node->type != SWITCH_NODE) - fprintf(f, " (%" PRIx64 ") ", port->remoteport->portguid); + if (port->remoteport->node->type != IB_NODE_SWITCH) + fprintf(f, " (%" PRIx64 ") ", port->remoteport->guid); rem_nodename = remap_node_name(node_name_map, - port->remoteport->node->nodeguid, + port->remoteport->node->guid, port->remoteport->node->nodedesc); fprintf(f, "\t\t# lid %d lmc %d \"%s\" lid %d %s%s\n", - port->lid, port->lmc, rem_nodename, - port->remoteport->node->type == SWITCH_NODE ? port->remoteport->node->smalid : port->remoteport->lid, - get_linkwidth_str(port->linkwidth), - get_linkspeed_str(port->linkspeed)); + port->base_lid, port->lmc, rem_nodename, + port->remoteport->node->type == IB_NODE_SWITCH ? + port->remoteport->node->smalid : + port->remoteport->base_lid, + dump_linkwidth_compat(iwidth), + dump_linkspeed_compat(ispeed)); free(rem_nodename); } + +struct iter_user_data { + int group; + int skip_chassis_nodes; +}; + +static void +switch_iter_func(ibnd_node_t *node, void *iter_user_data) +{ + ibnd_port_t *port; + int p = 0; + struct iter_user_data *data = (struct iter_user_data *)iter_user_data; + + DEBUG("SWITCH: node %p\n", node); + + /* skip chassis based switches if flagged */ + if (data->skip_chassis_nodes && node->chassis && node->chassis->chassisnum) + return; + + out_switch(node, data->group, NULL); + for (p = 1; p <= node->numports; p++) { + port = node->ports[p]; + if (port && port->remoteport) + out_switch_port(port, data->group); + } +} + +static void +ca_iter_func(ibnd_node_t *node, void *iter_user_data) +{ + ibnd_port_t *port; + int p = 0; + struct iter_user_data *data = (struct iter_user_data *)iter_user_data; + + DEBUG("CA: node %p\n", node); + /* Now, skip chassis based CAs */ + if (data->group && node->chassis && node->chassis->chassisnum) + return; + out_ca(node, data->group, NULL); + + for (p = 1; p <= node->numports; p++) { + port = node->ports[p]; + if (port && port->remoteport) + out_ca_port(port, data->group); + } +} + +static void +router_iter_func(ibnd_node_t *node, void *iter_user_data) +{ + ibnd_port_t *port; + int p = 0; + struct iter_user_data *data = (struct iter_user_data *)iter_user_data; + + DEBUG("RT: node %p\n", node); + /* Now, skip chassis based RTs */ + if (data->group && node->chassis && + node->chassis->chassisnum) + return; + out_ca(node, data->group, NULL); + for (p = 1; p <= node->numports; p++) { + port = node->ports[p]; + if (port && port->remoteport) + out_ca_port(port, data->group); + } +} + int -dump_topology(int listtype, int group) +dump_topology(int group, ibnd_fabric_t *fabric) { - Node *node; - Port *port; - int i = 0, dist = 0; + ibnd_node_t *node; + ibnd_port_t *port; + int i = 0, p = 0; time_t t = time(0); uint64_t chguid; char *chname = NULL; + struct iter_user_data iter_user_data; - if (!listtype) { - fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t)); - fprintf(f, "# Max of %d hops discovered\n", maxhops_discovered); - fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n", mynode->nodeguid, mynode->portguid); - } + fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t)); + fprintf(f, "# Max of %d hops discovered\n", fabric->maxhops_discovered); + fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n", + fabric->from_node->guid, + mad_get_field64(fabric->from_node->info, 0, IB_NODE_PORT_GUID_F)); /* Make pass on switches */ - if (group && !listtype) { - ChassisList *ch = NULL; + if (group) { + ibnd_chassis_t *ch = NULL; /* Chassis based switches first */ - for (ch = chassis; ch; ch = ch->next) { + for (ch = fabric->chassis; ch; ch = ch->next) { int n = 0; if (!ch->chassisnum) continue; - chguid = out_chassis(ch->chassisnum); - if (chname) - free(chname); + chguid = out_chassis(fabric, ch->chassisnum); + chname = NULL; - if (is_xsigo_guid(chguid)) { - for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { - if (!node->chrecord || - !node->chrecord->chassisnum) - continue; - - if (node->chrecord->chassisnum != ch->chassisnum) - continue; - - if (is_xsigo_hca(node->nodeguid)) { - chname = remap_node_name(node_name_map, - node->nodeguid, - node->nodedesc); - fprintf(f, "Hostname: %s\n", chname); +/** + * Will this work for Xsigo? + */ + if (ibnd_is_xsigo_guid(chguid)) { + for (node = ch->nodes; node; + node = node->next_chassis_node) { + if (ibnd_is_xsigo_hca(node->guid)) { + chname = node->nodedesc; + fprintf(f, "Hostname: %s\n", clean_nodedesc(node->nodedesc)); } } } fprintf(f, "\n# Spine Nodes"); - for (n = 1; n <= (SPINES_MAX_NUM+1); n++) { + for (n = 1; n <= SPINES_MAX_NUM; n++) { if (ch->spinenode[n]) { out_switch(ch->spinenode[n], group, chname); - for (port = ch->spinenode[n]->ports; port; port = port->next, i++) - if (port->remoteport) + for (p = 1; p <= ch->spinenode[n]->numports; p++) { + port = ch->spinenode[n]->ports[p]; + if (port && port->remoteport) out_switch_port(port, group); + } } } fprintf(f, "\n# Line Nodes"); - for (n = 1; n <= (LINES_MAX_NUM+1); n++) { + for (n = 1; n <= LINES_MAX_NUM; n++) { if (ch->linenode[n]) { out_switch(ch->linenode[n], group, chname); - for (port = ch->linenode[n]->ports; port; port = port->next, i++) - if (port->remoteport) + for (p = 1; p <= ch->linenode[n]->numports; p++) { + port = ch->linenode[n]->ports[p]; + if (port && port->remoteport) out_switch_port(port, group); + } } } fprintf(f, "\n# Chassis Switches"); - for (dist = 0; dist <= maxhops_discovered; dist++) { - - for (node = nodesdist[dist]; node; node = node->dnext) { - - /* Non Voltaire chassis */ - if (node->vendid == VTR_VENDOR_ID) - continue; - if (!node->chrecord || - !node->chrecord->chassisnum) - continue; - - if (node->chrecord->chassisnum != ch->chassisnum) - continue; - + for (node = ch->nodes; node; + node = node->next_chassis_node) { + if (node->type == IB_NODE_SWITCH) { out_switch(node, group, chname); - for (port = node->ports; port; port = port->next, i++) - if (port->remoteport) + for (p = 1; p <= node->numports; p++) { + port = node->ports[p]; + if (port && port->remoteport) out_switch_port(port, group); - + } } } fprintf(f, "\n# Chassis CAs"); - for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { - if (!node->chrecord || - !node->chrecord->chassisnum) - continue; - - if (node->chrecord->chassisnum != ch->chassisnum) - continue; - - out_ca(node, group, chname); - for (port = node->ports; port; port = port->next, i++) - if (port->remoteport) - out_ca_port(port, group); - + for (node = ch->nodes; node; + node = node->next_chassis_node) { + if (node->type == IB_NODE_CA) { + out_ca(node, group, chname); + for (p = 1; p <= node->numports; p++) { + port = node->ports[p]; + if (port && port->remoteport) + out_ca_port(port, group); + } + } } } - } else { - for (dist = 0; dist <= maxhops_discovered; dist++) { - - for (node = nodesdist[dist]; node; node = node->dnext) { - DEBUG("SWITCH: dist %d node %p", dist, node); - if (!listtype) - out_switch(node, group, chname); - else { - if (listtype & LIST_SWITCH_NODE) - list_node(node); - continue; - } - - for (port = node->ports; port; port = port->next, i++) - if (port->remoteport) - out_switch_port(port, group); - } - } + } else { /* !group */ + iter_user_data.group = group; + iter_user_data.skip_chassis_nodes = 0; + ibnd_iter_nodes_type(fabric, switch_iter_func, + IB_NODE_SWITCH, &iter_user_data); } - if (chname) - free(chname); chname = NULL; - if (group && !listtype) { + if (group) { + iter_user_data.group = group; + iter_user_data.skip_chassis_nodes = 1; fprintf(f, "\nNon-Chassis Nodes\n"); - for (dist = 0; dist <= maxhops_discovered; dist++) { - - for (node = nodesdist[dist]; node; node = node->dnext) { - - DEBUG("SWITCH: dist %d node %p", dist, node); - /* Now, skip chassis based switches */ - if (node->chrecord && - node->chrecord->chassisnum) - continue; - out_switch(node, group, chname); - - for (port = node->ports; port; port = port->next, i++) - if (port->remoteport) - out_switch_port(port, group); - } - - } - + ibnd_iter_nodes_type(fabric, switch_iter_func, + IB_NODE_SWITCH, &iter_user_data); } + iter_user_data.group = group; + iter_user_data.skip_chassis_nodes = 0; /* Make pass on CAs */ - for (node = nodesdist[MAXHOPS]; node; node = node->dnext) { + ibnd_iter_nodes_type(fabric, ca_iter_func, IB_NODE_CA, + &iter_user_data); - DEBUG("CA: dist %d node %p", dist, node); - if (!listtype) { - /* Now, skip chassis based CAs */ - if (group && node->chrecord && - node->chrecord->chassisnum) - continue; - out_ca(node, group, chname); - } else { - if (((listtype & LIST_CA_NODE) && (node->type == CA_NODE)) || - ((listtype & LIST_ROUTER_NODE) && (node->type == ROUTER_NODE))) - list_node(node); - continue; - } - - for (port = node->ports; port; port = port->next, i++) - if (port->remoteport) - out_ca_port(port, group); - } - - if (chname) - free(chname); + /* make pass on routers */ + ibnd_iter_nodes_type(fabric, router_iter_func, IB_NODE_ROUTER, + &iter_user_data); return i; } -void dump_ports_report () +void dump_ports_report (ibnd_node_t *node, void *user_data) { - int b, n = 0, p; - Node *node; - Port *port; - - /* - * If switch and LID == 0, search of other switch ports with - * valid LID and assign it to all ports of that switch - */ - for (b = 0; b <= MAXHOPS; b++) - for (node = nodesdist[b]; node; node = node->dnext) - if (node->type == SWITCH_NODE) { - int swlid = 0; - for (p = 0, port = node->ports; - p < node->numports && port && !swlid; - port = port->next) - if (port->lid != 0) - swlid = port->lid; - for (p = 0, port = node->ports; - p < node->numports && port; - port = port->next) - port->lid = swlid; - } - - for (b = 0; b <= MAXHOPS; b++) - for (node = nodesdist[b]; node; node = node->dnext) { - for (p = 0, port = node->ports; - p < node->numports && port; - p++, port = port->next) { - fprintf(stdout, - "%2s %5d %2d 0x%016" PRIx64 " %s %s", - node_type_str2(port->node), port->lid, - port->portnum, - port->portguid, - get_linkwidth_str(port->linkwidth), - get_linkspeed_str(port->linkspeed)); - if (port->remoteport) - fprintf(stdout, - " - %2s %5d %2d 0x%016" PRIx64 - " ( '%s' - '%s' )\n", - node_type_str2(port->remoteport->node), - port->remoteport->lid, - port->remoteport->portnum, - port->remoteport->portguid, - remap_node_name(node_name_map, - port->node->nodeguid, - port->node->nodedesc), - remap_node_name(node_name_map, - port->remoteport->node->nodeguid, - port->remoteport->node->nodedesc)); - else - fprintf(stdout, "%36s'%s'\n", "", - remap_node_name(node_name_map, - port->node->nodeguid, - port->node->nodedesc)); - - } - n++; - } + int p = 0; + ibnd_port_t *port = NULL; + + /* for each port */ + for (p = node->numports, port = node->ports[p]; + p > 0; + port = node->ports[--p]) { + uint32_t iwidth, ispeed; + if (port == NULL) + continue; + iwidth = mad_get_field(port->info, 0, IB_PORT_LINK_WIDTH_ACTIVE_F); + ispeed = mad_get_field(port->info, 0, IB_PORT_LINK_SPEED_ACTIVE_F); + fprintf(stdout, + "%2s %5d %2d 0x%016" PRIx64 " %s %s", + ports_nt_str_compat(node), + node->type == IB_NODE_SWITCH ? + node->smalid : port->base_lid, + port->portnum, + port->guid, + dump_linkwidth_compat(iwidth), + dump_linkspeed_compat(ispeed)); + if (port->remoteport) + fprintf(stdout, + " - %2s %5d %2d 0x%016" PRIx64 + " ( '%s' - '%s' )\n", + ports_nt_str_compat(port->remoteport->node), + port->remoteport->node->type == IB_NODE_SWITCH ? + port->remoteport->node->smalid : + port->remoteport->base_lid, + port->remoteport->portnum, + port->remoteport->guid, + port->node->nodedesc, + port->remoteport->node->nodedesc); + else + fprintf(stdout, "%36s'%s'\n", "", + port->node->nodedesc); + } } static int list, group, ports_report; @@ -939,7 +619,7 @@ static int process_opt(void *context, int ch, char *optarg) node_name_map_file = strdup(optarg); break; case 's': - dumplevel = 1; + ibnd_show_progress(1); break; case 'l': list = LIST_CA_NODE | LIST_SWITCH_NODE | LIST_ROUTER_NODE; @@ -968,8 +648,7 @@ static int process_opt(void *context, int ch, char *optarg) int main(int argc, char **argv) { - int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; - ib_portid_t my_portid = {0}; + ibnd_fabric_t *fabric = NULL; const struct ibdiag_opt opts[] = { { "show", 's', 0, NULL, "show more information" }, @@ -996,29 +675,28 @@ int main(int argc, char **argv) timeout = ibd_timeout; if (ibverbose) - dumplevel = 1; + ibnd_debug(1); if (argc && !(f = fopen(argv[0], "w"))) IBERROR("can't open file %s for writing", argv[0]); - srcport = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2); - if (!srcport) - IBERROR("Failed to open '%s' port '%d'", ibd_ca, ibd_ca_port); - node_name_map = open_node_name_map(node_name_map_file); - if (discover(&my_portid) < 0) - IBERROR("discover"); - - if (group) - chassis = group_nodes(); + if ((fabric = ibnd_discover_fabric(ibd_ca, ibd_ca_port, ibd_timeout, NULL, -1)) == NULL) { + fprintf(stderr, "discover failed\n"); + exit(1); + } if (ports_report) - dump_ports_report(); + ibnd_iter_nodes(fabric, + dump_ports_report, + NULL); + else if (list) + list_nodes(fabric, list); else - dump_topology(list, group); + dump_topology(group, fabric); + ibnd_destroy_fabric(fabric); close_node_name_map(node_name_map); - mad_rpc_close_port(srcport); exit(0); } -- 1.5.4.5 From weiny2 at llnl.gov Thu Apr 23 13:31:05 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 23 Apr 2009 13:31:05 -0700 Subject: [ofa-general] [PATCH 5/8] change ibnd_discover_fabric to receive ibmad_port Message-ID: <20090423133105.f6a1215c.weiny2@llnl.gov> From: Ira Weiny Date: Thu, 23 Apr 2009 10:57:10 -0700 Subject: [PATCH] change ibnd_discover_fabric to receive ibmad_port In order to allow ibmad_port to be opened with additional classes libibnetdisc should accept an ibmad_port as a parameter. The library will error out if the classes it needs are not opened. Signed-off-by: Ira Weiny --- .../libibnetdisc/include/infiniband/ibnetdisc.h | 6 +- .../libibnetdisc/man/ibnd_discover_fabric.3 | 21 ++++++-- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 54 +++++++++----------- infiniband-diags/libibnetdisc/src/internal.h | 1 - infiniband-diags/libibnetdisc/test/testleaks.c | 20 ++++++-- 5 files changed, 60 insertions(+), 42 deletions(-) diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h index a882994..7eaca24 100644 --- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h @@ -124,6 +124,7 @@ typedef struct chassis { * Main fabric object which is returned and represents the data discovered */ typedef struct ib_fabric { + struct ibmad_port *ibmad_port; /* the node the discover was initiated from * "from" parameter in ibnd_discover_fabric * or by default the node you ar running on @@ -143,11 +144,10 @@ typedef struct ib_fabric { void ibnd_debug(int i); void ibnd_show_progress(int i); -ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int dev_port, +ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms, ib_portid_t *from, int hops); /** - * dev_name: (required) local device name to use to access the fabric - * dev_port: (required) local device port to use to access the fabric + * open: (required) ibmad_port object from libibmad * timeout_ms: (required) gives the timeout for a _SINGLE_ query on * the fabric. So if there are multiple nodes not * responding this may result in a lengthy delay. diff --git a/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 b/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 index 44d8c65..c832c11 100644 --- a/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 +++ b/infiniband-diags/libibnetdisc/man/ibnd_discover_fabric.3 @@ -5,7 +5,7 @@ ibnd_discover_fabric, ibnd_destroy_fabric, ibnd_debug ibnd_show_progress \- init .nf .B #include .sp -.BI "ibnd_fabric_t *ibnd_discover_fabric(char *dev_name, int dev_port, int timeout_ms, ib_portid_t *from, int hops)" +.bi "ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms, ib_portid_t *from, int hops)" .BI "void ibnd_destroy_fabric(ibnd_fabric_t *fabric)" .BI "void ibnd_debug(int i)" .BI "void ibnd_show_progress(int i)" @@ -13,7 +13,10 @@ ibnd_discover_fabric, ibnd_destroy_fabric, ibnd_debug ibnd_show_progress \- init .SH "DESCRIPTION" .B ibnd_discover_fabric() -Discover the fabric connected to the port specified by dev_name and dev_port, using a timeout specified. The "from" and "hops" parameters are optional and allow one to scan part of a fabric by specifying a node "from" and a number of hops away from that node to scan, "hops". This gives the user a "sub-fabric" which is "centered" anywhere they chose. +Discover the fabric connected to the port specified by ibmad_port, using a timeout specified. The "from" and "hops" parameters are optional and allow one to scan part of a fabric by specifying a node "from" and a number of hops away from that node to scan, "hops". This gives the user a "sub-fabric" which is "centered" anywhere they chose. + +ibmad_port must be opened with at least IB_SMI_CLASS and IB_SMI_DIRECT_CLASS +classes for ibnd_discover_fabric to work. .B ibnd_destroy_fabric() free all memory and resources associated with the fabric. @@ -36,13 +39,23 @@ NONE .B Discover the entire fabric connected to device "mthca0", port 1. - ibnd_discover_fabric("mthca0", 1, 100, NULL, 0); + int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; + struct ibmad_port *ibmad_port = mad_rpc_open_port(ca, ca_port, mgmt_classes, 2); + ibnd_fabric_t *fabric = ibnd_discover_fabric(ibmad_port, 100, NULL, 0); + ... + ibnd_destroy_fabric(fabric); + mad_rpc_close_port(ibmad_port); .B Discover only a single node and those nodes connected to it. + ... str2drpath(&(port_id.drpath), from, 0, 0); + ... + ibnd_discover_fabric(ibmad_port, 100, &port_id, 1); + ... - ibnd_discover_fabric("mthca0", 1, 100, &port_id, 1); +.SH "SEE ALSO" + libibmad, mad_rpc_open_port .SH "AUTHORS" .TP diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index 479bae7..410e2dd 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -79,7 +79,7 @@ get_port_info(struct ibnd_fabric *fabric, struct ibnd_port *port, IB_PORT_LINK_SPEED_ACTIVE_F); if (!smp_query_via(port->port.info, portid, IB_ATTR_PORT_INFO, portnum, timeout_ms, - fabric->ibmad_port)) + fabric->fabric.ibmad_port)) return -1; decode_port_info(&(port->port)); @@ -100,7 +100,7 @@ static int query_node_info(struct ibnd_fabric *fabric, struct ibnd_node *node, ib_portid_t *portid) { if (!smp_query_via(&(node->node.info), portid, IB_ATTR_NODE_INFO, 0, timeout_ms, - fabric->ibmad_port)) + fabric->fabric.ibmad_port)) return -1; /* decode just a couple of fields for quicker reference. */ @@ -130,11 +130,11 @@ query_node(struct ibnd_fabric *fabric, struct ibnd_node *inode, port->guid = mad_get_field64(node->info, 0, IB_NODE_PORT_GUID_F); if (!smp_query_via(nd, portid, IB_ATTR_NODE_DESC, 0, timeout_ms, - fabric->ibmad_port)) + fabric->fabric.ibmad_port)) return -1; if (!smp_query_via(port->info, portid, IB_ATTR_PORT_INFO, 0, timeout_ms, - fabric->ibmad_port)) + fabric->fabric.ibmad_port)) return -1; decode_port_info(port); @@ -146,7 +146,7 @@ query_node(struct ibnd_fabric *fabric, struct ibnd_node *inode, /* after we have the sma information find out the real PortInfo for this port */ if (!smp_query_via(port->info, portid, IB_ATTR_PORT_INFO, port->portnum, timeout_ms, - fabric->ibmad_port)) + fabric->fabric.ibmad_port)) return -1; decode_port_info(port); @@ -154,7 +154,7 @@ query_node(struct ibnd_fabric *fabric, struct ibnd_node *inode, port->lmc = node->smalmc; if (!smp_query_via(node->switchinfo, portid, IB_ATTR_SWITCH_INFO, 0, timeout_ms, - fabric->ibmad_port)) + fabric->fabric.ibmad_port)) node->smaenhsp0 = 0; /* assume base SP0 */ else mad_decode_field(node->switchinfo, IB_SW_ENHANCED_PORT0_F, &node->smaenhsp0); @@ -241,7 +241,7 @@ ibnd_update_node(ibnd_node_t *node) return (NULL); if (!smp_query_via(nd, &(n->node.path_portid), IB_ATTR_NODE_DESC, 0, timeout_ms, - f->ibmad_port)) + f->fabric.ibmad_port)) return (NULL); /* update all the port info's */ @@ -253,14 +253,14 @@ ibnd_update_node(ibnd_node_t *node) goto done; if (!smp_query_via(portinfo_port0, &(n->node.path_portid), IB_ATTR_PORT_INFO, 0, timeout_ms, - f->ibmad_port)) + f->fabric.ibmad_port)) return (NULL); n->node.smalid = mad_get_field(portinfo_port0, 0, IB_PORT_LID_F); n->node.smalmc = mad_get_field(portinfo_port0, 0, IB_PORT_LMC_F); if (!smp_query_via(node->switchinfo, &(n->node.path_portid), IB_ATTR_SWITCH_INFO, 0, timeout_ms, - f->ibmad_port)) + f->fabric.ibmad_port)) node->smaenhsp0 = 0; /* assume base SP0 */ else mad_decode_field(node->switchinfo, IB_SW_ENHANCED_PORT0_F, &n->node.smaenhsp0); @@ -476,17 +476,8 @@ get_remote_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd_ return 0; } -static void * -ibnd_init_port(char *dev_name, int dev_port) -{ - int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; - - /* Crank up the mad lib */ - return (mad_rpc_open_port(dev_name, dev_port, mgmt_classes, 2)); -} - ibnd_fabric_t * -ibnd_discover_fabric(char *dev_name, int dev_port, int timeout_ms, +ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms, ib_portid_t *from, int hops) { struct ibnd_fabric *fabric = NULL; @@ -500,15 +491,27 @@ ibnd_discover_fabric(char *dev_name, int dev_port, int timeout_ms, ib_portid_t *path; int max_hops = MAXHOPS-1; /* default find everything */ + if (!ibmad_port) { + IBPANIC("ibmad_port must be specified to " + "ibnd_discover_fabric\n"); + return (NULL); + } + if (mad_rpc_class_agent(ibmad_port, IB_SMI_CLASS) == -1 + || + mad_rpc_class_agent(ibmad_port, IB_SMI_DIRECT_CLASS) == -1) { + IBPANIC("ibmad_port must be opened with " + "IB_SMI_CLASS && IB_SMI_DIRECT_CLASS\n"); + return (NULL); + } + /* if not everything how much? */ if (hops >= 0) { max_hops = hops; } /* If not specified start from "my" port */ - if (!from) { + if (!from) from = &my_portid; - } fabric = malloc(sizeof(*fabric)); @@ -519,12 +522,7 @@ ibnd_discover_fabric(char *dev_name, int dev_port, int timeout_ms, memset(fabric, 0, sizeof(*fabric)); - fabric->ibmad_port = ibnd_init_port(dev_name, dev_port); - if (!fabric->ibmad_port) { - IBPANIC("OOM: failed to open \"%s\" port %d\n", - dev_name, dev_port); - goto error; - } + fabric->fabric.ibmad_port = ibmad_port; IBND_DEBUG("from %s\n", portid2str(from)); @@ -633,8 +631,6 @@ ibnd_destroy_fabric(ibnd_fabric_t *fabric) node = next; } } - if (f->ibmad_port) - mad_rpc_close_port(f->ibmad_port); free(f); } diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h index afed25e..4e6bb18 100644 --- a/infiniband-diags/libibnetdisc/src/internal.h +++ b/infiniband-diags/libibnetdisc/src/internal.h @@ -79,7 +79,6 @@ struct ibnd_fabric { ibnd_fabric_t fabric; /* internal use only */ - void *ibmad_port; struct ibnd_node *nodestbl[HTSZ]; struct ibnd_port *portstbl[HTSZ]; struct ibnd_node *nodesdist[MAXHOPS+1]; diff --git a/infiniband-diags/libibnetdisc/test/testleaks.c b/infiniband-diags/libibnetdisc/test/testleaks.c index 1fabaac..0d009c3 100644 --- a/infiniband-diags/libibnetdisc/test/testleaks.c +++ b/infiniband-diags/libibnetdisc/test/testleaks.c @@ -84,6 +84,7 @@ usage(void) int main(int argc, char **argv) { + int rc = 0; char *ca = 0; int ca_port = 0; ibnd_fabric_t *fabric = NULL; @@ -94,6 +95,9 @@ main(int argc, char **argv) ib_portid_t port_id; int iters = -1; + struct ibmad_port *ibmad_port; + int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; + static char const str_opts[] = "S:D:n:C:P:t:shuf:i:"; static const struct option long_opts[] = { { "S", 1, 0, 'S'}, @@ -155,25 +159,31 @@ main(int argc, char **argv) argc -= optind; argv += optind; + ibmad_port = mad_rpc_open_port(ca, ca_port, mgmt_classes, 2); + while (iters == -1 || iters-- > 0) { if (from) { /* only scan part of the fabric */ str2drpath(&(port_id.drpath), from, 0, 0); - if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, + if ((fabric = ibnd_discover_fabric(ibmad_port, timeout_ms, &port_id, hops)) == NULL) { fprintf(stderr, "discover failed\n"); - exit(1); + rc = 1; + goto close_port; } guid = 0; } else { - if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, NULL, -1)) == NULL) { + if ((fabric = ibnd_discover_fabric(ibmad_port, timeout_ms, NULL, -1)) == NULL) { fprintf(stderr, "discover failed\n"); - exit(1); + rc = 1; + goto close_port; } } ibnd_destroy_fabric(fabric); } - exit(0); +close_port: + mad_rpc_close_port(ibmad_port); + exit(rc); } -- 1.5.4.5 From weiny2 at llnl.gov Thu Apr 23 13:31:09 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 23 Apr 2009 13:31:09 -0700 Subject: [ofa-general] [PATCH 6/8] Convert ibnetdiscover and iblinkinfo to use the new interface to libibnetdisc Message-ID: <20090423133109.e227975d.weiny2@llnl.gov> From: Ira Weiny Date: Thu, 23 Apr 2009 10:57:10 -0700 Subject: [PATCH] Convert ibnetdiscover and iblinkinfo to use the new interface to libibnetdisc Signed-off-by: Ira Weiny --- infiniband-diags/src/iblinkinfo.c | 24 +++++++++++++++++++----- infiniband-diags/src/ibnetdiscover.c | 14 ++++++++++---- 2 files changed, 29 insertions(+), 9 deletions(-) diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c index 39de7a2..82c2ce8 100644 --- a/infiniband-diags/src/iblinkinfo.c +++ b/infiniband-diags/src/iblinkinfo.c @@ -257,6 +257,7 @@ usage(void) int main(int argc, char **argv) { + int rc = 0; char *ca = 0; int ca_port = 0; ibnd_fabric_t *fabric = NULL; @@ -266,6 +267,9 @@ main(int argc, char **argv) int hops = 0; ib_portid_t port_id; + struct ibmad_port *ibmad_port; + int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; + static char const str_opts[] = "S:D:n:C:P:t:sldgphuf:R"; static const struct option long_opts[] = { { "S", 1, 0, 'S'}, @@ -354,20 +358,28 @@ main(int argc, char **argv) if (argc && !(f = fopen(argv[0], "w"))) fprintf(stderr, "can't open file %s for writing", argv[0]); + ibmad_port = mad_rpc_open_port(ca, ca_port, mgmt_classes, 2); + if (!ibmad_port) { + fprintf(stderr, "Failed to open %s port %d", ca, ca_port); + exit(1); + } + node_name_map = open_node_name_map(node_name_map_file); if (from) { /* only scan part of the fabric */ str2drpath(&(port_id.drpath), from, 0, 0); - if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, &port_id, hops)) == NULL) { + if ((fabric = ibnd_discover_fabric(ibmad_port, timeout_ms, &port_id, hops)) == NULL) { fprintf(stderr, "discover failed\n"); - exit(1); + rc = 1; + goto close_port; } guid = 0; } else { - if ((fabric = ibnd_discover_fabric(ca, ca_port, timeout_ms, NULL, -1)) == NULL) { + if ((fabric = ibnd_discover_fabric(ibmad_port, timeout_ms, NULL, -1)) == NULL) { fprintf(stderr, "discover failed\n"); - exit(1); + rc = 1; + goto close_port; } } @@ -383,6 +395,8 @@ main(int argc, char **argv) ibnd_destroy_fabric(fabric); +close_port: close_node_name_map(node_name_map); - exit(0); + mad_rpc_close_port(ibmad_port); + exit(rc); } diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 99750f0..69fc5fb 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -650,6 +650,9 @@ int main(int argc, char **argv) { ibnd_fabric_t *fabric = NULL; + struct ibmad_port *ibmad_port; + int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; + const struct ibdiag_opt opts[] = { { "show", 's', 0, NULL, "show more information" }, { "list", 'l', 0, NULL, "list of connected nodes" }, @@ -677,15 +680,17 @@ int main(int argc, char **argv) if (ibverbose) ibnd_debug(1); + ibmad_port = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 2); + if (!ibmad_port) + IBERROR("Failed to open %s port %d", ibd_ca, ibd_ca_port); + if (argc && !(f = fopen(argv[0], "w"))) IBERROR("can't open file %s for writing", argv[0]); node_name_map = open_node_name_map(node_name_map_file); - if ((fabric = ibnd_discover_fabric(ibd_ca, ibd_ca_port, ibd_timeout, NULL, -1)) == NULL) { - fprintf(stderr, "discover failed\n"); - exit(1); - } + if ((fabric = ibnd_discover_fabric(ibmad_port, ibd_timeout, NULL, -1)) == NULL) + IBERROR("discover failed\n"); if (ports_report) ibnd_iter_nodes(fabric, @@ -698,5 +703,6 @@ int main(int argc, char **argv) ibnd_destroy_fabric(fabric); close_node_name_map(node_name_map); + mad_rpc_close_port(ibmad_port); exit(0); } -- 1.5.4.5 From weiny2 at llnl.gov Thu Apr 23 13:31:15 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 23 Apr 2009 13:31:15 -0700 Subject: [ofa-general] [PATCH 7/8] Add mad_field_name function Message-ID: <20090423133115.385e4e1b.weiny2@llnl.gov> From: Ira Weiny Date: Thu, 23 Apr 2009 10:57:10 -0700 Subject: [PATCH] Add mad_field_name function returns the "name" of the field specified Signed-off-by: Ira Weiny --- libibmad/include/infiniband/mad.h | 1 + libibmad/src/fields.c | 5 +++++ libibmad/src/libibmad.map | 1 + 3 files changed, 7 insertions(+), 0 deletions(-) diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index b8290a7..b6f4b60 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -722,6 +722,7 @@ MAD_EXPORT void mad_encode_field(uint8_t * buf, enum MAD_FIELDS field, void *val MAD_EXPORT int mad_print_field(enum MAD_FIELDS field, const char *name, void *val); MAD_EXPORT char *mad_dump_field(enum MAD_FIELDS field, char *buf, int bufsz, void *val); MAD_EXPORT char *mad_dump_val(enum MAD_FIELDS field, char *buf, int bufsz, void *val); +MAD_EXPORT const char *mad_field_name(enum MAD_FIELDS field); /* mad.c */ MAD_EXPORT void *mad_encode(void *buf, ib_rpc_t * rpc, ib_dr_path_t * drpath, diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c index 60faf73..e6cd1a1 100644 --- a/libibmad/src/fields.c +++ b/libibmad/src/fields.c @@ -686,3 +686,8 @@ char *mad_dump_val(enum MAD_FIELDS field, char *buf, int bufsz, void *val) return 0; return _mad_dump_val(ib_mad_f + field, buf, bufsz, val); } + +const char *mad_field_name(enum MAD_FIELDS field) +{ + return (ib_mad_f[field].name); +} diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map index 4306dbc..6b77784 100644 --- a/libibmad/src/libibmad.map +++ b/libibmad/src/libibmad.map @@ -102,5 +102,6 @@ IBMAD_1.3 { ib_resolve_guid_via; ib_resolve_portid_str_via; ib_resolve_self_via; + mad_field_name; local: *; }; -- 1.5.4.5 From weiny2 at llnl.gov Thu Apr 23 13:31:20 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 23 Apr 2009 13:31:20 -0700 Subject: [ofa-general] [PATCH 8/8] Convert ibqueryerrors.pl to C and use new ibnetdisc library. Message-ID: <20090423133120.acf0af63.weiny2@llnl.gov> From: Ira Weiny Date: Thu, 23 Apr 2009 10:57:10 -0700 Subject: [PATCH] Convert ibqueryerrors.pl to C and use new ibnetdisc library. Signed-off-by: Ira Weiny --- infiniband-diags/Makefile.am | 5 +- infiniband-diags/configure.in | 1 + infiniband-diags/scripts/ibqueryerrors.pl | 230 ------------- infiniband-diags/scripts/ibqueryerrors.pl.in | 40 +++ infiniband-diags/src/ibqueryerrors.c | 469 ++++++++++++++++++++++++++ 5 files changed, 514 insertions(+), 231 deletions(-) delete mode 100755 infiniband-diags/scripts/ibqueryerrors.pl create mode 100755 infiniband-diags/scripts/ibqueryerrors.pl.in create mode 100644 infiniband-diags/src/ibqueryerrors.c diff --git a/infiniband-diags/Makefile.am b/infiniband-diags/Makefile.am index bebb35e..a2eabd7 100644 --- a/infiniband-diags/Makefile.am +++ b/infiniband-diags/Makefile.am @@ -12,7 +12,8 @@ endif sbin_PROGRAMS = src/ibaddr src/ibnetdiscover src/ibping src/ibportstate \ src/ibroute src/ibstat src/ibsysstat src/ibtracert \ src/perfquery src/sminfo src/smpdump src/smpquery \ - src/saquery src/vendstat src/iblinkinfo + src/saquery src/vendstat src/iblinkinfo \ + src/ibqueryerrors if ENABLE_TEST_UTILS sbin_PROGRAMS += src/ibsendtrap src/mcm_rereg_test @@ -59,6 +60,8 @@ src_vendstat_SOURCES = src/vendstat.c src_mcm_rereg_test_SOURCES = src/mcm_rereg_test.c src_iblinkinfo_SOURCES = src/iblinkinfo.c src_iblinkinfo_LDFLAGS = -L$(top_builddir)/libibnetdisc -libnetdisc +src_ibqueryerrors_SOURCES = src/ibqueryerrors.c +src_ibqueryerrors_LDFLAGS = -L$(top_builddir)/libibnetdisc -libnetdisc man_MANS = man/ibaddr.8 man/ibcheckerrors.8 man/ibcheckerrs.8 \ man/ibchecknet.8 man/ibchecknode.8 man/ibcheckport.8 \ diff --git a/infiniband-diags/configure.in b/infiniband-diags/configure.in index 4516dfa..ae492b8 100644 --- a/infiniband-diags/configure.in +++ b/infiniband-diags/configure.in @@ -167,6 +167,7 @@ AC_CONFIG_FILES([\ scripts/ibswitches \ scripts/ibrouters \ scripts/iblinkinfo.pl \ + scripts/ibqueryerrors.pl \ libibnetdisc/Makefile ]) AC_OUTPUT diff --git a/infiniband-diags/scripts/ibqueryerrors.pl b/infiniband-diags/scripts/ibqueryerrors.pl deleted file mode 100755 index 99adac7..0000000 --- a/infiniband-diags/scripts/ibqueryerrors.pl +++ /dev/null @@ -1,230 +0,0 @@ -#!/usr/bin/perl -# -# Copyright (c) 2008 Voltaire, Inc. All rights reserved. -# Copyright (c) 2006 The Regents of the University of California. -# -# Produced at Lawrence Livermore National Laboratory. -# Written by Ira Weiny . -# -# This software is available to you under a choice of one of two -# licenses. You may choose to be licensed under the terms of the GNU -# General Public License (GPL) Version 2, available from the file -# COPYING in the main directory of this source tree, or the -# OpenIB.org BSD license below: -# -# Redistribution and use in source and binary forms, with or -# without modification, are permitted provided that the following -# conditions are met: -# -# - Redistributions of source code must retain the above -# copyright notice, this list of conditions and the following -# disclaimer. -# -# - Redistributions in binary form must reproduce the above -# copyright notice, this list of conditions and the following -# disclaimer in the documentation and/or other materials -# provided with the distribution. -# -# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, -# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF -# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND -# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS -# BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN -# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN -# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -# SOFTWARE. -# - -use strict; - -use Getopt::Std; -use IBswcountlimits; - -my $print_action = "no"; -my $report_port_info = undef; -my $single_switch = undef; -my $include_data_counters = undef; -my $cache_file = ""; -my $switch_found = "no"; - -# ========================================================================= -# -sub report_counts -{ - my $addr = $_[0]; - my $port = $_[1]; - my $ca_name = $_[2]; - my $ca_port = $_[3]; - my $extra_params = get_ca_name_port_param_string($ca_name, $ca_port); - - if (any_counts()) { - print(" GUID $addr port $port:"); - check_counters($print_action); - if ($include_data_counters) { - check_data_counters($print_action); - } - print("\n"); - - if ($report_port_info) { - my $lid = ""; - my $speed = ""; - my $width = ""; - my $data = `smpquery $extra_params -G portinfo $addr $port`; - my @lines = split("\n", $data); - foreach my $line (@lines) { - if ($line =~ /^# Port info: Lid (\w+) port.*/) { $lid = $1; } - if ($line =~ /^LinkSpeedActive:\.+(.*)/) { $speed = $1; } - if ($line =~ /^LinkWidthActive:\.+(.*)/) { $width = $1; } - } - my $hr = $IBswcountlimits::link_ends{"$addr"}{$port}; - if ($hr) { - printf( -" Link info: %6s %4s[%2s] ==(%3s %s)==> %18s %4s[%2s] \"%s\"\n", - $lid, $port, - $hr->{loc_ext_port}, $width, - $speed, $hr->{rem_guid}, - $hr->{rem_port}, $hr->{rem_ext_port}, - $hr->{rem_desc} - ); - } else { - printf( -" Link info: %6s %4s[ ] ==(%3s %s)==> (Disconnected)\n", - $lid, $port, $width, $speed); - } - } - } -} - -# ========================================================================= -# use perfquery to get the counters. -sub get_counts -{ - my $addr = $_[0]; - my $port = $_[1]; - my $ca_name = $_[2]; - my $ca_port = $_[3]; - my $extra_params = get_ca_name_port_param_string($ca_name, $ca_port); - - my $data = `perfquery $extra_params -G $addr $port` || - die "'perfquery $extra_params -G $addr $port' FAILED.\n"; - my @lines = split("\n", $data); - foreach my $line (@lines) { - foreach my $count (@IBswcountlimits::counters) { - if ($line =~ /^$count:\.+(\d+)/) { - $IBswcountlimits::cur_counts{$count} = $1; - } - } - } -} - -# ========================================================================= -# -my %switches = (); - -sub get_switches -{ - my $data = `ibswitches $cache_file` || - die "'ibswitches $cache_file' failed.\n"; - my @lines = split("\n", $data); - foreach my $line (@lines) { - if ($line =~ /^Switch\s+:\s+(\w+)\s+ports\s+(\d+)\s+.*/) { - $switches{$1} = $2; - } - } -} - -# ========================================================================= -# -sub usage_and_exit -{ - my $prog = $_[0]; - print -"Usage: $prog [-a -c -r -R -s -S -D -d -C -P ]\n"; - print " Report counters on all switches in subnet\n"; - print " -a Report an action to take\n"; - print " -c suppress some of the common counters\n"; - print " -r report port configuration information\n"; - print " -R Recalculate ibnetdiscover information\n"; - print " -s suppress errors listed\n"; - print -" -D output only the switch specified by direct route path\n"; - print " -S query only (hex format)\n"; - print " -d include the data counters in the output\n"; - print " -C use selected Channel Adaptor name for queries\n"; - print " -P use selected channel adaptor port for queries\n"; - exit 2; -} - -my $argv0 = `basename $0`; -my $regenerate_map = undef; -my $single_switch = undef; -my $direct_route = undef; -my $ca_name = ""; -my $ca_port = ""; - -chomp $argv0; -if (!getopts("has:crRS:D:dC:P:")) { usage_and_exit $argv0; } -if (defined $Getopt::Std::opt_h) { usage_and_exit $argv0; } -if (defined $Getopt::Std::opt_a) { $print_action = "yes"; } -if (defined $Getopt::Std::opt_s) { - @IBswcountlimits::suppress_errors = split(",", $Getopt::Std::opt_s); -} -if (defined $Getopt::Std::opt_c) { - @IBswcountlimits::suppress_errors = split(",", "RcvSwRelayErrors"); -} -if (defined $Getopt::Std::opt_r) { $report_port_info = $Getopt::Std::opt_r; } -if (defined $Getopt::Std::opt_R) { $regenerate_map = $Getopt::Std::opt_R; } -if (defined $Getopt::Std::opt_D) { $direct_route = $Getopt::Std::opt_D; } -if (defined $Getopt::Std::opt_S) { - $single_switch = format_guid($Getopt::Std::opt_S); -} -if (defined $Getopt::Std::opt_d) { - $include_data_counters = $Getopt::Std::opt_d; -} -if (defined $Getopt::Std::opt_C) { $ca_name = $Getopt::Std::opt_C; } -if (defined $Getopt::Std::opt_P) { $ca_port = $Getopt::Std::opt_P; } - -$cache_file = get_cache_file($ca_name, $ca_port); - -sub main -{ - if (@IBswcountlimits::suppress_errors) { - my $msg = join(",", @IBswcountlimits::suppress_errors); - print "Suppressing: $msg\n"; - } - get_link_ends($regenerate_map, $ca_name, $ca_port); - get_switches; - if (defined($direct_route)) { - # convert DR to guid, then use original single_switch option - $single_switch = convert_dr_to_guid($direct_route); - if (!defined($single_switch) || !is_switch($single_switch)) { - printf("The direct route (%s) does not map to a switch.\n", - $direct_route); - return; - } - } - foreach my $sw_addr (keys %switches) { - if ($single_switch && $sw_addr ne "$single_switch") { - next; - } else { - $switch_found = "yes"; - } - - my $switch_prompt = "no"; - foreach my $sw_port (1 .. $switches{$sw_addr}) { - clear_counters; - get_counts($sw_addr, $sw_port, $ca_name, $ca_port); - if (any_counts() && $switch_prompt eq "no") { - my $hr = $IBswcountlimits::link_ends{"$sw_addr"}{$sw_port}; - printf("Errors for %18s \"%s\"\n", $sw_addr, $hr->{loc_desc}); - $switch_prompt = "yes"; - } - report_counts($sw_addr, $sw_port); - } - } - if ($single_switch && $switch_found ne "yes") { - printf("Switch \"%s\" not found.\n", $single_switch); - } -} -main; - diff --git a/infiniband-diags/scripts/ibqueryerrors.pl.in b/infiniband-diags/scripts/ibqueryerrors.pl.in new file mode 100755 index 0000000..30e610c --- /dev/null +++ b/infiniband-diags/scripts/ibqueryerrors.pl.in @@ -0,0 +1,40 @@ +#!/usr/bin/perl +# +# Copyright (c) 2009 Lawrence Livermore National Security +# +# Produced at Lawrence Livermore National Laboratory. +# Written by Ira Weiny . +# +# This software is available to you under a choice of one of two +# licenses. You may choose to be licensed under the terms of the GNU +# General Public License (GPL) Version 2, available from the file +# COPYING in the main directory of this source tree, or the +# OpenIB.org BSD license below: +# +# Redistribution and use in source and binary forms, with or +# without modification, are permitted provided that the following +# conditions are met: +# +# - Redistributions of source code must retain the above +# copyright notice, this list of conditions and the following +# disclaimer. +# +# - Redistributions in binary form must reproduce the above +# copyright notice, this list of conditions and the following +# disclaimer in the documentation and/or other materials +# provided with the distribution. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS +# BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN +# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN +# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +# + + +# this is now just a wrapper for the C based utility +$str = join " ", at ARGV; +exec "@IBSCRIPTPATH@/ibqueryerrors $str"; diff --git a/infiniband-diags/src/ibqueryerrors.c b/infiniband-diags/src/ibqueryerrors.c new file mode 100644 index 0000000..9d96190 --- /dev/null +++ b/infiniband-diags/src/ibqueryerrors.c @@ -0,0 +1,469 @@ +/* + * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. + * Copyright (c) 2008 Lawrence Livermore National Lab. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#if HAVE_CONFIG_H +# include +#endif /* HAVE_CONFIG_H */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include "ibdiag_common.h" + +char *argv0 = "ibqueryerrors"; +static FILE *f; + +struct ibmad_port *ibmad_port; +static char *node_name_map_file = NULL; +static nn_map_t *node_name_map = NULL; +int data_counters = 0; +int port_config = 0; +uint64_t switch_guid = 0; +char *switch_guid_str = NULL; +int sup_total = 0; +enum MAD_FIELDS *suppressed_fields = NULL; +char *dr_path = NULL; +int all_nodes = 0; + +static unsigned int +get_max(unsigned int num) +{ + unsigned int v = num; // 32-bit word to find the log base 2 of + unsigned r = 0; // r will be lg(v) + + while (v >>= 1) // unroll for more speed... + { + r++; + } + + return (1 << r); +} + +static void +get_msg(char *width_msg, char *speed_msg, int msg_size, ibnd_port_t *port) +{ + char buf[64]; + uint32_t max_speed = 0; + + uint32_t max_width = get_max(mad_get_field(port->info, 0, + IB_PORT_LINK_WIDTH_SUPPORTED_F) + & mad_get_field(port->remoteport->info, 0, + IB_PORT_LINK_WIDTH_SUPPORTED_F)); + if ((max_width & mad_get_field(port->info, 0, + IB_PORT_LINK_WIDTH_ACTIVE_F)) == 0) { + // we are not at the max supported width + // print what we could be at. + snprintf(width_msg, msg_size, "Could be %s", + mad_dump_val(IB_PORT_LINK_WIDTH_ACTIVE_F, + buf, 64, &max_width)); + } + + max_speed = get_max(mad_get_field(port->info, 0, + IB_PORT_LINK_SPEED_SUPPORTED_F) + & mad_get_field(port->remoteport->info, 0, + IB_PORT_LINK_SPEED_SUPPORTED_F)); + if ((max_speed & mad_get_field(port->info, 0, + IB_PORT_LINK_SPEED_ACTIVE_F)) == 0) { + // we are not at the max supported speed + // print what we could be at. + snprintf(speed_msg, msg_size, "Could be %s", + mad_dump_val(IB_PORT_LINK_SPEED_ACTIVE_F, + buf, 64, &max_speed)); + } +} + +static void +print_port_config(ibnd_node_t *node, int portnum) +{ + char width[64], speed[64], state[64], physstate[64]; + char remote_str[256]; + char link_str[256]; + char width_msg[256]; + char speed_msg[256]; + char ext_port_str[256]; + int iwidth, ispeed, istate, iphystate; + int n = 0; + + ibnd_port_t *port = node->ports[portnum]; + + if (!port) + return; + + iwidth = mad_get_field(port->info, 0, IB_PORT_LINK_WIDTH_ACTIVE_F); + ispeed = mad_get_field(port->info, 0, IB_PORT_LINK_SPEED_ACTIVE_F); + istate = mad_get_field(port->info, 0, IB_PORT_STATE_F); + iphystate = mad_get_field(port->info, 0, IB_PORT_PHYS_STATE_F); + + remote_str[0] = '\0'; + link_str[0] = '\0'; + width_msg[0] = '\0'; + speed_msg[0] = '\0'; + + n = snprintf(link_str, 256, "(%3s %s %6s/%8s)", + mad_dump_val(IB_PORT_LINK_WIDTH_ACTIVE_F, width, 64, &iwidth), + mad_dump_val(IB_PORT_LINK_SPEED_ACTIVE_F, speed, 64, &ispeed), + mad_dump_val(IB_PORT_STATE_F, state, 64, &istate), + mad_dump_val(IB_PORT_PHYS_STATE_F, physstate, 64, &iphystate)); + + if (port->remoteport) { + char *remap = remap_node_name(node_name_map, port->remoteport->node->guid, + port->remoteport->node->nodedesc); + + if (port->remoteport->ext_portnum) + snprintf(ext_port_str, 256, "%d", port->remoteport->ext_portnum); + else + ext_port_str[0] = '\0'; + + get_msg(width_msg, speed_msg, 256, port); + + snprintf(remote_str, 256, + "0x%016"PRIx64" %6d %4d[%2s] \"%s\" (%s %s)\n", + port->remoteport->node->guid, + port->remoteport->base_lid ? port->remoteport->base_lid : + port->remoteport->node->smalid, + port->remoteport->portnum, + ext_port_str, + remap, + width_msg, + speed_msg); + free(remap); + } else + snprintf(remote_str, 256, " [ ] \"\" ( )\n"); + + if (port->ext_portnum) + snprintf(ext_port_str, 256, "%d", port->ext_portnum); + else + ext_port_str[0] = '\0'; + + if (node->type == IB_NODE_SWITCH) + printf(" %6d", node->smalid); + else + printf(" %6d", port->base_lid); + + printf("%4d[%2s] ==%s==> %s", + port->portnum, ext_port_str, link_str, remote_str); +} + +static int +suppress(enum MAD_FIELDS field) +{ + int i = 0; + if (suppressed_fields) + for (i = 0; i < sup_total; i++) { + if (field == suppressed_fields[i]) + return (1); + } + return (0); +} + +static void +report_suppressed(void) +{ + int i = 0; + if (suppressed_fields) { + printf("Suppressing:"); + for (i = 0; i < sup_total; i++) { + printf(" %s", mad_field_name(suppressed_fields[i])); + } + printf("\n"); + } +} + +static void +print_results(ibnd_node_t *node, uint8_t *pc, int portnum) +{ + char buf[1024]; + char *str = buf; + uint32_t val = 0; + int n = 0; + int i = 0; + + for (n = 0, i = IB_PC_ERR_SYM_F; i <= IB_PC_VL15_DROPPED_F; i++) { + if (suppress(i)) + continue; + + mad_decode_field(pc, i, (void *)&val); + if (val) + n += snprintf(str+n, 1024-n, " [%s == %d]", + mad_field_name(i), val); + } + + if (!suppress(IB_PC_XMT_WAIT_F)) { + mad_decode_field(pc, IB_PC_XMT_WAIT_F, (void *)&val); + if (val) + n += snprintf(str+n, 1024-n, " [%s == %d]", mad_field_name(i), val); + } + + /* if we found errors. */ + if (n != 0) { + char *nodename = remap_node_name(node_name_map, node->guid, node->nodedesc); + if (data_counters) + for (i = IB_PC_XMT_BYTES_F; i <= IB_PC_RCV_PKTS_F; i++) { + uint64_t val64 = 0; + mad_decode_field(pc, i, (void *)&val64); + if (val64) + n += snprintf(str+n, 1024-n, " [%s == %"PRId64"]", + mad_field_name(i), val64); + } + + printf("Errors for 0x%" PRIx64 " \"%s\"\n", node->guid, nodename); + printf(" GUID 0x%" PRIx64 " port %d:%s\n", + node->guid, portnum, str); + if (port_config) + print_port_config(node, portnum); + free(nodename); + } +} + +static void +print_port(ibnd_node_t *node, int portnum) +{ + uint8_t pc[1024]; + uint16_t cap_mask; + ib_portid_t portid = {0}; + char *nodename = remap_node_name(node_name_map, node->guid, node->nodedesc); + + if (node->type == IB_NODE_SWITCH) + ib_portid_set(&portid, node->smalid, 0, 0); + else + ib_portid_set(&portid, node->ports[portnum]->base_lid, 0, 0); + + /* PerfMgt ClassPortInfo is a required attribute */ + if (!pma_query_via(pc, &portid, portnum, ibd_timeout, CLASS_PORT_INFO, + ibmad_port)) { + IBWARN("classportinfo query failed on %s, %s port %d", + nodename, portid2str(&portid), portnum); + goto cleanup; + } + /* ClassPortInfo should be supported as part of libibmad */ + memcpy(&cap_mask, pc + 2, sizeof(cap_mask)); /* CapabilityMask */ + + if (!pma_query_via(pc, &portid, portnum, ibd_timeout, + IB_GSI_PORT_COUNTERS, + ibmad_port)) { + IBWARN("IB_GSI_PORT_COUNTERS query failed on %s, %s port %d\n", + nodename, portid2str(&portid), portnum); + goto cleanup; + } + if (!(cap_mask & 0x1000)) { + /* if PortCounters:PortXmitWait not suppported clear this counter */ + uint32_t foo = 0; + mad_encode_field(pc, IB_PC_XMT_WAIT_F, &foo); + } + print_results(node, pc, portnum); + +cleanup: + free(nodename); +} + +void +print_node(ibnd_node_t *node, void *user_data) +{ + int p = 0; + int startport = 1; + + if (!all_nodes && node->type != IB_NODE_SWITCH) + return; + + if (node->type == IB_NODE_SWITCH && node->smaenhsp0) + startport = 0; + + for (p = startport; p <= node->numports; p++) { + if (node->ports[p]) { + print_port(node, p); + } + } +} + +static void +add_suppressed(enum MAD_FIELDS field) +{ + suppressed_fields = realloc(suppressed_fields, sizeof(enum MAD_FIELDS)); + suppressed_fields[sup_total] = field; + sup_total++; +} + +static void +calculate_suppressed_fields(char *str) +{ + enum MAD_FIELDS f = 0; + char *tmp = strdup(str); + char *lasts, *val; + + val = strtok_r(tmp, ",", &lasts); + while (val) { + for (f = IB_PC_FIRST_F; f <= IB_PC_LAST_F; f++) { + if (strcmp(val, mad_field_name(f)) == 0) { + add_suppressed(f); + } + } + val = strtok_r(NULL, ",", &lasts); + } + + free(tmp); +} + +static int process_opt(void *context, int ch, char *optarg) +{ + switch (ch) { + case 's': + calculate_suppressed_fields(optarg); + break; + case 'c': + /* Right now this is the only "common" error */ + add_suppressed(IB_PC_ERR_SWITCH_REL_F); + break; + case 1: + node_name_map_file = strdup(optarg); + break; + case 2: + data_counters++; + break; + case 3: + all_nodes++; + break; + case 'S': + switch_guid_str = strdup(optarg); + switch_guid = (uint64_t)strtoull(switch_guid_str, 0, 0); + break; + case 'D': + dr_path = strdup(optarg); + break; + case 'r': + port_config++; + break; + case 'R': /* nop */ + break; + default: + return -1; + } + + return 0; +} + +int +main(int argc, char **argv) +{ + int rc = 0; + ibnd_fabric_t *fabric = NULL; + + int mgmt_classes[4] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS, IB_PERFORMANCE_CLASS}; + + const struct ibdiag_opt opts[] = { + { "suppress", 's', 1, "", "suppress errors listed" }, + { "suppress-common", 'c', 0, NULL, "suppress some of the common counters" }, + { "node-name-map", 1, 1, "", "node name map file" }, + { "switch", 'S', 1, "", "query only (hex format)"}, + { "Direct", 'D', 1, "", "query only switch specified by "}, + { "report-port", 'r', 0, NULL, "report port configuration information"}, + { "GNDN", 'R', 0, NULL, "(This option is obsolete and does nothing)"}, + { "data", 2, 0, NULL, "include the data counters in the output"}, + { "all", 3, 0, NULL, "output all nodes (not just switches)"}, + { 0 } + }; + char usage_args[] = ""; + + ibdiag_process_opts(argc, argv, "sDLG", "snSrR", opts, process_opt, + usage_args, NULL); + + f = stdout; + + argc -= optind; + argv += optind; + + if (ibverbose) + ibnd_debug(1); + + ibmad_port = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 4); + if (!ibmad_port) + IBERROR("Failed to open port; %s:%d\n", ibd_ca, ibd_ca_port); + + node_name_map = open_node_name_map(node_name_map_file); + + if (switch_guid) { + /* limit the scan the fabric around the target */ + ib_portid_t portid = {0}; + + if (ib_resolve_portid_str_via(&portid, switch_guid_str, IB_DEST_GUID, + ibd_sm_id, ibmad_port) < 0) { + fprintf(stderr, "can't resolve destination port %s %p\n", + switch_guid_str, ibd_sm_id); + rc = 1; + goto close_port; + } + + if ((fabric = ibnd_discover_fabric(ibmad_port, ibd_timeout, &portid, 1)) == NULL) { + fprintf(stderr, "discover failed\n"); + rc = 1; + goto close_port; + } + } else { + if ((fabric = ibnd_discover_fabric(ibmad_port, ibd_timeout, NULL, -1)) == NULL) { + fprintf(stderr, "discover failed\n"); + rc = 1; + goto close_port; + } + } + + report_suppressed(); + + if (switch_guid) { + ibnd_node_t *node = ibnd_find_node_guid(fabric, switch_guid); + print_node(node, NULL); + } else if (dr_path) { + ibnd_node_t *node = ibnd_find_node_dr(fabric, dr_path); + print_node(node, NULL); + } else + ibnd_iter_nodes(fabric, print_node, NULL); + + ibnd_destroy_fabric(fabric); + +close_port: + mad_rpc_close_port(ibmad_port); + close_node_name_map(node_name_map); + exit(rc); +} -- 1.5.4.5 From weiny2 at llnl.gov Thu Apr 23 13:40:55 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 23 Apr 2009 13:40:55 -0700 Subject: [ofa-general] [PATCH v3 1/3] Create a new library libibnetdisc In-Reply-To: References: <20090403154251.dec181f2.weiny2@llnl.gov> <20090423070210.GA8281@sk> Message-ID: <20090423134055.164ab69f.weiny2@llnl.gov> Sorry I missed this thread. There is also an ibdebug defined in libibmad. extern int ibdebug; This is the one it is using... :-/ I think there should be a wrapper function. Perhaps madrpc_show_errors? Ira On Thu, 23 Apr 2009 11:49:36 -0700 "Sean Hefty" wrote: > >> Where does the definition for ibdebug come from? > > > >It is in ibdiag_common.c. Every infiniband-ibdiag tool is linked with > >it. And yes, using this in this library can be problematic since > >introduces a "hidden" dependency. > > How does that work? The library doesn't link ibdiag_common.c, so I'm not sure > what definition it picks up. Maybe it defaults to undefined, assumed int... > > To get things to build and run on Windows, I defined it as a static in the > library. > -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab weiny2 at llnl.gov From worleys at gmail.com Thu Apr 23 14:25:50 2009 From: worleys at gmail.com (Chris Worley) Date: Thu, 23 Apr 2009 15:25:50 -0600 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: References: Message-ID: Given the deafening silent "NO", might I suggest that OFED build procedures include in a generated RPM: 1) The OFED release/version number, and 2) The ofed.conf used in the install.pl procedure, much like Linux's /proc/config.gz (but it need not be a /proc file). Thanks, Chris On Wed, Apr 22, 2009 at 11:36 AM, Chris Worley wrote: > Using an Ubuntu 8.10 distro w/ a 2.6.27-11 kernel, I'm wondering: from > what OFED version were their built-in IB modules derived? > > The only version message I see is from the MLX4 driver: > > mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April 4, 2008) > > Would that imply 1.3? > > Thanks, > > Chris > From jsquyres at cisco.com Thu Apr 23 14:28:33 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 23 Apr 2009 17:28:33 -0400 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IBmodules? In-Reply-To: References: Message-ID: On Apr 23, 2009, at 5:25 PM, Chris Worley wrote: > Given the deafening silent "NO", might I suggest that OFED build > procedures include in a generated RPM: > > 1) The OFED release/version number, and > 2) The ofed.conf used in the install.pl procedure, much like Linux's > /proc/config.gz (but it need not be a /proc file). > There is an "ofed_info" command that returns the version. The output is not too well-formed, but it should have the info. > > Thanks, > > Chris > On Wed, Apr 22, 2009 at 11:36 AM, Chris Worley > wrote: > > Using an Ubuntu 8.10 distro w/ a 2.6.27-11 kernel, I'm wondering: > from > > what OFED version were their built-in IB modules derived? > > > > The only version message I see is from the MLX4 driver: > > > > mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April 4, 2008) > > > > Would that imply 1.3? > > > > Thanks, > > > > Chris > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Jeff Squyres Cisco Systems From worleys at gmail.com Thu Apr 23 14:29:29 2009 From: worleys at gmail.com (Chris Worley) Date: Thu, 23 Apr 2009 15:29:29 -0600 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IBmodules? In-Reply-To: References: Message-ID: On Thu, Apr 23, 2009 at 3:28 PM, Jeff Squyres wrote: > On Apr 23, 2009, at 5:25 PM, Chris Worley wrote: > >> Given the deafening silent "NO", might I suggest that OFED build >> procedures include in a generated RPM: >> >> 1) The OFED release/version number, and >> 2) The ofed.conf used in the install.pl procedure, much like Linux's >> /proc/config.gz (but it need not be a /proc file). >> > > There is an "ofed_info" command that returns the version.  The output is not > too well-formed, but it should have the info. ... but the distro makers don't include it :( Chris > >> >> Thanks, >> >> Chris >> On Wed, Apr 22, 2009 at 11:36 AM, Chris Worley wrote: >> > Using an Ubuntu 8.10 distro w/ a 2.6.27-11 kernel, I'm wondering: from >> > what OFED version were their built-in IB modules derived? >> > >> > The only version message I see is from the MLX4 driver: >> > >> > mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April 4, 2008) >> > >> > Would that imply 1.3? >> > >> > Thanks, >> > >> > Chris >> > >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > > > -- > Jeff Squyres > Cisco Systems > > From sean.hefty at intel.com Thu Apr 23 14:29:54 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 23 Apr 2009 14:29:54 -0700 Subject: [ofa-general] [PATCH v3 1/3] Create a new library libibnetdisc In-Reply-To: <20090423134055.164ab69f.weiny2@llnl.gov> References: <20090403154251.dec181f2.weiny2@llnl.gov> <20090423070210.GA8281@sk> <20090423134055.164ab69f.weiny2@llnl.gov> Message-ID: <28769EB1C4FB4999975354DD93FC2107@amr.corp.intel.com> >There is also an ibdebug defined in libibmad. > >extern int ibdebug; > >This is the one it is using... :-/ I think there should be a wrapper >function. Perhaps madrpc_show_errors? Yes - that's the one it picks up. Adding a wrapper makes sense to me. (I don't think that declaring a variable as extern is sufficient to share it across library boundaries in windows.) From jsquyres at cisco.com Thu Apr 23 14:52:23 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 23 Apr 2009 17:52:23 -0400 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IBmodules? In-Reply-To: References: Message-ID: <699B1CE5-9FFC-4499-A798-CF54090666CA@cisco.com> On Apr 23, 2009, at 5:29 PM, Chris Worley wrote: > > There is an "ofed_info" command that returns the version. The > output is not > > too well-formed, but it should have the info. > > ... but the distro makers don't include it :( > Doh! Yes, that's a big bummer. :-( -- Jeff Squyres Cisco Systems From weiny2 at llnl.gov Thu Apr 23 15:09:43 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 23 Apr 2009 15:09:43 -0700 Subject: [ofa-general] [PATCH v3 1/3] Create a new library libibnetdisc In-Reply-To: <28769EB1C4FB4999975354DD93FC2107@amr.corp.intel.com> References: <20090403154251.dec181f2.weiny2@llnl.gov> <20090423070210.GA8281@sk> <20090423134055.164ab69f.weiny2@llnl.gov> <28769EB1C4FB4999975354DD93FC2107@amr.corp.intel.com> Message-ID: <20090423150943.c512fecb.weiny2@llnl.gov> On Thu, 23 Apr 2009 14:29:54 -0700 "Sean Hefty" wrote: > >There is also an ibdebug defined in libibmad. > > > >extern int ibdebug; > > > >This is the one it is using... :-/ I think there should be a wrapper > >function. Perhaps madrpc_show_errors? > > Yes - that's the one it picks up. Adding a wrapper makes sense to me. (I don't > think that declaring a variable as extern is sufficient to share it across > library boundaries in windows.) > Patch below. From: Ira Weiny Date: Thu, 23 Apr 2009 15:08:28 -0700 Subject: [PATCH] libibmad: create a wrapper for ibdebug and make libibnetdisc use it Signed-off-by: Ira Weiny --- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 7 +++++-- infiniband-diags/libibnetdisc/src/internal.h | 2 +- libibmad/include/infiniband/mad.h | 3 +-- libibmad/src/gs.c | 1 + libibmad/src/libibmad.map | 1 + libibmad/src/mad_internal.h | 2 ++ libibmad/src/portid.c | 1 + libibmad/src/rpc.c | 5 +++++ 8 files changed, 17 insertions(+), 5 deletions(-) diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index 410e2dd..cee4c95 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -59,6 +59,7 @@ static int timeout_ms = 2000; static int show_progress = 0; +static int ibnd_debug_flg = 0; void decode_port_info(ibnd_port_t *port) @@ -638,11 +639,13 @@ void ibnd_debug(int i) { if (i) { - ibdebug++; + ibnd_debug_flg = 1; + madrpc_show_debug(1); madrpc_show_errors(1); umad_debug(i); } else { - ibdebug = 0; + ibnd_debug_flg = 0; + madrpc_show_debug(0); madrpc_show_errors(0); umad_debug(0); } diff --git a/infiniband-diags/libibnetdisc/src/internal.h b/infiniband-diags/libibnetdisc/src/internal.h index 4e6bb18..58ba2a8 100644 --- a/infiniband-diags/libibnetdisc/src/internal.h +++ b/infiniband-diags/libibnetdisc/src/internal.h @@ -43,7 +43,7 @@ #define MAXHOPS 63 #define IBND_DEBUG(fmt, ...) \ - if (ibdebug) { \ + if (ibnd_debug_flg) { \ printf("%s:%u; " fmt, __FILE__, __LINE__, ## __VA_ARGS__); \ } #define IBND_ERROR(fmt, ...) \ diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index b6f4b60..1fbbf1c 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -733,6 +733,7 @@ MAD_EXPORT int mad_build_pkt(void *umad, ib_rpc_t * rpc, ib_portid_t * dport, /* New interface */ MAD_EXPORT void madrpc_show_errors(int set); +MAD_EXPORT void madrpc_show_debug(int set); MAD_EXPORT int madrpc_set_retries(int retries); MAD_EXPORT int madrpc_set_timeout(int timeout); MAD_EXPORT struct ibmad_port *mad_rpc_open_port(char *dev_name, int dev_port, @@ -892,8 +893,6 @@ MAD_EXPORT ib_mad_dump_fn mad_dump_switchinfo, mad_dump_perfcounters, mad_dump_perfcounters_ext, mad_dump_perfcounters_xmt_sl, mad_dump_perfcounters_rcv_sl; -extern int ibdebug; - #if __BYTE_ORDER == __LITTLE_ENDIAN #ifndef ntohll static inline uint64_t ntohll(uint64_t x) diff --git a/libibmad/src/gs.c b/libibmad/src/gs.c index dbca9e9..eea4a29 100644 --- a/libibmad/src/gs.c +++ b/libibmad/src/gs.c @@ -41,6 +41,7 @@ #include #include +#include "mad_internal.h" #undef DEBUG #define DEBUG if (ibdebug) IBWARN diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map index 6b77784..67c32bb 100644 --- a/libibmad/src/libibmad.map +++ b/libibmad/src/libibmad.map @@ -80,6 +80,7 @@ IBMAD_1.3 { madrpc_set_retries; madrpc_set_timeout; madrpc_show_errors; + madrpc_show_debug; ib_path_query; sa_call; sa_rpc_call; diff --git a/libibmad/src/mad_internal.h b/libibmad/src/mad_internal.h index 24418cc..0038197 100644 --- a/libibmad/src/mad_internal.h +++ b/libibmad/src/mad_internal.h @@ -44,4 +44,6 @@ struct ibmad_port { extern struct ibmad_port *ibmp; +extern int ibdebug; + #endif /* _MAD_INTERNAL_H_ */ diff --git a/libibmad/src/portid.c b/libibmad/src/portid.c index de9e2d3..773fa5b 100644 --- a/libibmad/src/portid.c +++ b/libibmad/src/portid.c @@ -40,6 +40,7 @@ #include #include +#include "mad_internal.h" #undef DEBUG #define DEBUG if (ibdebug) IBWARN diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c index ebeb835..78e68a8 100644 --- a/libibmad/src/rpc.c +++ b/libibmad/src/rpc.c @@ -72,6 +72,11 @@ void madrpc_show_errors(int set) iberrs = set; } +void madrpc_show_debug(int set) +{ + ibdebug = set; +} + void madrpc_save_mad(void *madbuf, int len) { save_mad = madbuf; -- 1.5.4.5 From sean.hefty at intel.com Thu Apr 23 15:47:08 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Thu, 23 Apr 2009 15:47:08 -0700 Subject: [ofa-general] RE: [PATCH v2] rdma_cm: Add debugfs entries to monitor rdma_cm connections In-Reply-To: <49F05AAE.4020606@Voltaire.COM> References: <49F05AAE.4020606@Voltaire.COM> Message-ID: <90A07B5E8FAD49C187EAE583A976A3CD@amr.corp.intel.com> The output is much easier to read. :) >@@ -59,6 +62,10 @@ MODULE_LICENSE("Dual BSD/GPL"); > #define CMA_MAX_CM_RETRIES 15 > #define CMA_CM_MRA_SETTING (IB_CM_MRA_FLAG_DELAY | 24) > >+#define CASE_RET(val, ret) case val: return #ret; I would just drop this abstraction. >+static const char *format_node_type(enum rdma_node_type nt) >+{ >+ enum rdma_transport_type tt; >+ if (nt) { >+ tt = rdma_node_get_transport(nt); >+ switch (tt) { We don't really need the local variable tt. >+static int cma_rdma_id_seq_show(struct seq_file *file, void *v) >+{ >+ struct rdma_id_private *id_priv; >+ char local_addr[64], remote_addr[64]; >+ >+ if (!v) >+ return 0; >+ if (v == SEQ_START_TOKEN) { >+ seq_printf(file, >+ "%-5s" >+ "%-8s" >+ "%-5s" >+ "%-8s" >+ "%-52s" >+ "%-52s" >+ "%-6s" >+ "%-15s" >+ "%-8s" >+ "\n", >+ "TYPE", "DEVICE", "PORT", "NET_DEV", "SRC_ADDR", "DST_ADDR", >"SPACE", "STATE", "QP_NUM"); >+ } else { >+ id_priv = list_entry(v, struct rdma_id_private, list); >+ format_addr((struct sockaddr *)&id_priv->id.route.addr.src_addr, >+ local_addr); >+ format_addr((struct sockaddr *)&id_priv->id.route.addr.dst_addr, >+ remote_addr); >+ >+ seq_printf(file, >+ "%-5s" >+ "%-8s" >+ "%-5d" >+ "%-8s" >+ "%-52s" >+ "%-52s" >+ "%-6s" >+ "%-15s" >+ "%-8d" >+ "\n", >+ format_node_type(id_priv- >>id.route.addr.dev_addr.dev_type), >+ (id_priv->id.device) ? id_priv->id.device->name : "", >+ id_priv->id.port_num, >+ (id_priv->id.route.addr.dev_addr.src_dev) ? id_priv- >>id.route.addr.dev_addr.src_dev->name : "", >+ local_addr, >+ remote_addr, >+ format_port_space(id_priv->id.ps), >+ format_cma_state(id_priv->state), >+ id_priv->qp_num); >+ } I still think this requires a lot of scrolling to get past a couple of print statements. Can we at least collapse the "%-5s" ... "\n" stuff down to a single line? From rdreier at cisco.com Thu Apr 23 16:02:23 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 23 Apr 2009 16:02:23 -0700 Subject: [ofa-general] Re: mlx4_ib_post_send(): incorrect test on wr->opcode? In-Reply-To: <49F08E5A.7060808@gmail.com> (Roel Kluin's message of "Thu, 23 Apr 2009 17:50:50 +0200") References: <49F08E5A.7060808@gmail.com> Message-ID: > wr->opcode cannot be less than 0, can it? why not? The struct field is signed, isn't it? What prevents a consumer from passing in a negative garbage value? > but note below that in mlx4_ib_opcodeIB_WR_RDMA_READ_WITH_INV is > missing, so shouldn't this be: > if (wr->opcode > IB_WR_FAST_REG_MR) { > err = -EINVAL; > goto out; > } I don't understand the difference between explicitly saying IB_WR_FAST_REG_MR and using ARRAY_SIZE(), except that if we add new opcodes then the ARRAY_SIZE() way works without having to remember it. It is true that if someone passes in IB_WR_RDMA_READ_WITH_INV, then the mlx4 code won't return an error for that unimplemented operation, but I don't see how your change fixes that (or indeed fixes anything). - R. From ido at uchicago.edu Thu Apr 23 16:42:58 2009 From: ido at uchicago.edu (Ido Rosen) Date: Thu, 23 Apr 2009 19:42:58 -0400 Subject: [ofa-general] test Message-ID: <592164a70904231642g1f853e74h38b4ca6422d28506@mail.gmail.com> just making sure this still works. From jgunthorpe at obsidianresearch.com Thu Apr 23 16:48:39 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 23 Apr 2009 17:48:39 -0600 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: References: Message-ID: <20090423234839.GE4431@obsidianresearch.com> AFAIK, Ubuntu does not do any work on their IB drivers, so the driver is stock 2.6.27. In principle OFED is supposed to start with an upstream kernel and backport those drivers to various distributions. OFED 1.3 was using 2.6.24, OFED 1.4 is apparently using 2.6.27. So it should be similar to OFED 1.4 Though bear in mind OFED still patches things with stuff that is not yet accepted upstream so there will be some differences. It should be compatible with the OFED 1.4 userspace. On Thu, Apr 23, 2009 at 03:25:50PM -0600, Chris Worley wrote: > Given the deafening silent "NO", might I suggest that OFED build > procedures include in a generated RPM: > > 1) The OFED release/version number, and > 2) The ofed.conf used in the install.pl procedure, much like Linux's > /proc/config.gz (but it need not be a /proc file). > > Thanks, > > Chris > On Wed, Apr 22, 2009 at 11:36 AM, Chris Worley wrote: > > Using an Ubuntu 8.10 distro w/ a 2.6.27-11 kernel, I'm wondering: from > > what OFED version were their built-in IB modules derived? > > > > The only version message I see is from the MLX4 driver: > > > > mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April 4, 2008) > > > > Would that imply 1.3? > > > > Thanks, > > > > Chris From Zhen.Liang at Sun.COM Thu Apr 23 20:06:28 2009 From: Zhen.Liang at Sun.COM (Liang Zhen) Date: Fri, 24 Apr 2009 11:06:28 +0800 Subject: [ofa-general] Possible bug of inkernel OFED of RHEL5.3? Message-ID: <49F12CB4.6030302@sun.com> Hi there, I've posted this in rhel5-list, but I'm not sure whether it's the right place so I post it here again... We got this assertion while running inkernel OFED of RHEL5.3: Apr 15 08:06:24 cl8-0 kernel: RTNL: assertion failed at net/core/fib_rules.c (388) Apr 15 08:06:24 cl8-0 kernel: Apr 15 08:06:24 cl8-0 kernel: Call Trace: Apr 15 08:06:24 cl8-0 kernel: [] fib_rules_event+0x3d/0xff Apr 15 08:06:24 cl8-0 kernel: [] notifier_call_chain+0x20/0x32 Apr 15 08:06:24 cl8-0 kernel: [] dev_set_mtu+0x5a/0x60 Apr 15 08:06:24 cl8-0 kernel: [] :ib_ipoib:set_mode+0x94/0x134 Apr 15 08:06:24 cl8-0 kernel: [] sysfs_write_file+0xb9/0xe8 Apr 15 08:06:24 cl8-0 kernel: [] vfs_write+0xce/0x174 Apr 15 08:06:24 cl8-0 kernel: [] sys_write+0x45/0x6e Apr 15 08:06:24 cl8-0 kernel: [] system_call+0x7e/0x83 Apr 15 08:06:24 cl8-0 kernel: Apr 15 08:06:24 cl8-0 kernel: RTNL: assertion failed at net/ipv4/devinet.c (986) Apr 15 08:06:24 cl8-0 kernel: Apr 15 08:06:24 cl8-0 kernel: Call Trace: Apr 15 08:06:24 cl8-0 kernel: [] inetdev_event+0x48/0x282 Apr 15 08:06:24 cl8-0 kernel: [] notifier_call_chain+0x20/0x32 Apr 15 08:06:24 cl8-0 kernel: [] dev_set_mtu+0x5a/0x60 Apr 15 08:06:24 cl8-0 kernel: [] :ib_ipoib:set_mode+0x94/0x134 Apr 15 08:06:24 cl8-0 kernel: [] sysfs_write_file+0xb9/0xe8 Apr 15 08:06:24 cl8-0 kernel: [] vfs_write+0xce/0x174 Apr 15 08:06:24 cl8-0 kernel: [] sys_write+0x45/0x6e Apr 15 08:06:24 cl8-0 kernel: [] system_call+0x7e/0x83 Apr 15 08:06:24 cl8-0 kernel: When looking into code I found: sysfs_write_file()->flush_write_buffer()->store()->ipoib_cm.c::set_mode()->dev_set_mtu()->raw_notifier_call_chain->notifier_call_chain()->fib_rules_event()->ASSERT_RTNL(). So, ipoib_cm called dev_set_mtu without rtnl_lock, but dev_set_mtu will assert caller already has rtnl_lock. I think we may need this patch, could somebody confirm this? Thanks Liang --- drivers/infiniband/ulp/ipoib/ipoib_cm.c 2009-04-16 12:49:04.000000000 -0400 +++ drivers/infiniband/ulp/ipoib/ipoib_cm.c 2009-04-16 12:48:52.000000000 -0400 @@ -1481,7 +1481,9 @@ static ssize_t set_mode(struct class_dev if (ipoib_cm_max_mtu(dev) > priv->mcast_mtu) ipoib_warn(priv, "mtu > %d will cause multicast packet drops.\n", priv->mcast_mtu); + rtnl_lock(); dev_set_mtu(dev, ipoib_cm_max_mtu(dev)); + rtnl_unlock(); ipoib_flush_paths(dev); return count; From nicolas.morey-chaisemartin at ext.bull.net Thu Apr 23 23:58:24 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey-Chaisemartin) Date: Fri, 24 Apr 2009 08:58:24 +0200 Subject: [ofa-general] [PATCH/Resend] Fixed capability mask problem in ibstat introduec by commit 722b6c6428c9e4921a81f4a6db2838bcee660bb7 Message-ID: <49F16310.1080902@ext.bull.net> Signed-off-by: Nicolas Morey-Chaisemartin --- I don't know if compilation on WinOF is still working with this patch as I have no way to test it but it fixes the problem for Linux. If it doesn't work anymore, ntohll result should be shift of 32 bits right (>>32) before being cast to unsigned. infiniband-diags/src/ibstat.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/infiniband-diags/src/ibstat.c b/infiniband-diags/src/ibstat.c index 7985be1..99af9a8 100644 --- a/infiniband-diags/src/ibstat.c +++ b/infiniband-diags/src/ibstat.c @@ -111,7 +111,7 @@ port_dump(umad_port_t *port, int alone) printf("%sBase lid: %d\n", pre, port->base_lid); printf("%sLMC: %d\n", pre, port->lmc); printf("%sSM lid: %d\n", pre, port->sm_lid); - printf("%sCapability mask: 0x%08x\n", pre, (unsigned)ntohll(port->capmask)); + printf("%sCapability mask: 0x%08x\n", pre, (unsigned)(ntohl((uint32_t)(port->capmask)))); printf("%sPort GUID: 0x%016llx\n", pre, (long long unsigned)ntohll(port->port_guid)); return 0; } -- 1.6.2.GIT From vlad at lists.openfabrics.org Fri Apr 24 03:23:16 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Fri, 24 Apr 2009 03:23:16 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090424-0200 daily build status Message-ID: <20090424102316.7C22AE611A5@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From yosefe at voltaire.com Fri Apr 24 03:45:17 2009 From: yosefe at voltaire.com (Yossi Etigin) Date: Fri, 24 Apr 2009 13:45:17 +0300 Subject: [ofa-general] Possible bug of inkernel OFED of RHEL5.3? In-Reply-To: <49F12CB4.6030302@sun.com> References: <49F12CB4.6030302@sun.com> Message-ID: <49F1983D.8080204@voltaire.com> Liang Zhen wrote: > Hi there, > I've posted this in rhel5-list, but I'm not sure whether it's the right > place so I post it here again... > > We got this assertion while running inkernel OFED of RHEL5.3: > Thanks, this is already fixed in OFED 1.4 pretty much the same way. --Yossi From celine.bourde at ext.bull.net Fri Apr 24 04:13:18 2009 From: celine.bourde at ext.bull.net (Celine Bourde) Date: Fri, 24 Apr 2009 13:13:18 +0200 Subject: [Fwd: Re: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition]] In-Reply-To: <49F07710.3070002@opengridcomputing.com> References: <49F02280.7010005@ext.bull.net> <49F07710.3070002@opengridcomputing.com> Message-ID: <49F19ECE.9080007@ext.bull.net> Hi Steve, This email summarizes the situation: Standard mount -> OK --------------------- [root at twind ~]# mount -o rw 192.168.0.215:/vol0 /mnt/ Command works fine. rdma mount -> KO ----------------- [root at twind ~]# mount -o rdma,port=2050 192.168.0.215:/vol0 /mnt/ Command blocks ! I should perform Ctr+C to kill process. or [root at twind ofa_kernel-1.4.1]# strace mount.nfs 192.168.0.215:/vol0 /mnt/ -o rdma,port=2050 [..] fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 connect(3, {sa_family=AF_INET, sin_port=htons(610), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 fcntl(3, F_SETFL, O_RDWR) = 0 sendto(3, "-3\245\357\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., 40, 0, {sa_family=AF_INET, sin_port=htons(610), sin_addr=inet_addr("127.0.0.1")}, 16) = 40 poll([{fd=3, events=POLLIN}], 1, 3000) = 1 ([{fd=3, revents=POLLIN}]) recvfrom(3, "-3\245\357\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 8800, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(610), sin_addr=inet_addr("127.0.0.1")}, [16]) = 24 close(3) = 0 mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, "rdma,port=2050,addr=192.168.0.215" ..same problem [root at twind tmp]# dmesg rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16 rpcrdma: connection to 192.168.0.215:2050 closed (-103) rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16 rpcrdma: connection to 192.168.0.215:2050 closed (-103) rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16 rpcrdma: connection to 192.168.0.215:2050 closed (-103) rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16 rpcrdma: connection to 192.168.0.215:2050 closed (-103) Rdma cm tests ------------- * With ib_rdma_bw tool : [root at twing ~]# ib_rdma_bw -c 4960: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=1 | 4960: Local address: LID 0000, QPN 000000, PSN 0x24cafe RKey 0x18002400 VAddr 0x007fd3a03da000 4960: Remote address: LID 0000, QPN 000000, PSN 0x5f7a53, RKey 0x20002700 VAddr 0x007fbac1525000 [root at twind ofa_kernel-1.4.1]# ib_rdma_bw -c 192.168.0.215 31739: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=1 | 31739: Local address: LID 0000, QPN 000000, PSN 0x5f7a53 RKey 0x20002700 VAddr 0x007fbac1525000 31739: Remote address: LID 0000, QPN 000000, PSN 0x24cafe, RKey 0x18002400 VAddr 0x007fd3a03da000 Conflicting CPU frequency values detected: 2667.000000 != 2000.000000 31739: Bandwidth peak (#0 to #569): 0 MB/sec 31739: Bandwidth average: 0 MB/sec 31739: Service Demand peak (#0 to #569): 1949 cycles/KB 31739: Service Demand Avg : 1949 cycles/KB * With rping tool : [root at twing ~]# rping -s server DISCONNECT EVENT... wait for RDMA_READ_ADV state 9 cq completion failed status 5 [root at twind ofa_kernel-1.4.1]# rping -Vv -C14 -c -a 192.168.0.215 ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA ping data: rdma-ping-10: KLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA ping data: rdma-ping-11: LMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzAB ping data: rdma-ping-12: MNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABC ping data: rdma-ping-13: NOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCD cq completion failed status 5 client DISCONNECT EVENT... My configuration : ------------------- OFED-1.4.1-rc3 modules (ip_ipoib, mlx4_ib, rdma_cm, etc.) [root at twing ~]# cat /proc/fs/nfsd/portlist rdma 2050 tcp 2049 udp 2049 [root at twind tmp]# mount.nfs -V mount.nfs (linux nfs-utils 1.1.6) [root at twind tmp]# rpm -qf /usr/bin/rping librdmacm-utils-1.0.8-1.ofed1.4.1.rc3 [root at twind tmp]# rpm -qf /usr/bin/ib_rdma_bw perftest-1.2-1.ofed1.4.1.rc3 [root at twind tmp]# uname -ar Linux twind 2.6.27 #2 SMP Thu Apr 9 18:38:19 CEST 2009 x86_64 x86_64 x86_64 GNU/Linux [root at twind tmp]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 5.3 Beta (Tikanga) Celine. Steve Wise wrote: > Celine Bourde wrote: > >> Hi, >> >> I've updated nfs-utils package: >> >> [root at my_host ~]# mount.nfs -V >> >> mount.nfs (linux nfs-utils 1.1.6) >> >> >>> [root at my_host ~]# strace mount.nfs 192.168.0.215:/vol0 /mnt/ -o >>> rdma,port=2050 >>> >>> Does it work without rdma? >>> >> The problem is exactly the same without rdma: >> >> [root at my_host ~]# strace mount.nfs 192.168.0.215:/vol0 /mnt/ -o >> rw,port=2050 >> >> [..] >> > You cannot use port 2050 for tcp mounts. So remove the 'port=2050' and > it will attempt a tcp mount to port 2049. > > Steve. > > > From song.xian-guang at hotmail.com Fri Apr 24 04:31:50 2009 From: song.xian-guang at hotmail.com (SongXian-Guang) Date: Fri, 24 Apr 2009 19:31:50 +0800 Subject: [ofa-general] ***SPAM*** Message-ID: Hi folks, I am new to infiniband, so forgive me for asking some shallow questions. I would very appreciate your help if you can give me some tips, thanks in advance. I don't have any HCA card available, so all my questions are based on imagination^_^ After loading ipoib(from OFED) driver, I assume there appears ib0, ib1, ib2 in the system, my question is, how can I know the which ibX interface corresponds to which HCA port? How can i get the mapping b/w ibX and port GUID? will the command "ip show dev ibX" help? Can I get port's GUID from its output? regards, beta. _________________________________________________________________ MSN 表情魔法书,改变你的对话时代! http://im.live.cn/emoticons/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From song.xian-guang at hotmail.com Fri Apr 24 04:36:05 2009 From: song.xian-guang at hotmail.com (SongXian-Guang) Date: Fri, 24 Apr 2009 19:36:05 +0800 Subject: [ofa-general] How can I get IPoIB interface ibX's GUID under linux? Message-ID: Hi folks, I am new to infiniband, so forgive me for asking some shallow questions. I would very appreciate your help if you can give me some tips, thanks in advance. I don't have any HCA card available, so all my questions are based on imagination^_^ After loading ipoib(from OFED) driver, I assume there appears ib0, ib1, ib2 in the system, my question is, how can I know the which ibX interface corresponds to which HCA port? How can i get the mapping b/w ibX and port GUID? will the command "ip show dev ibX" help? Can I get port's GUID from its output? regards, beta. _________________________________________________________________ Live Search视频搜索,快速检索视频的利器! http://www.live.com/?scope=video -------------- next part -------------- An HTML attachment was scrubbed... URL: From brian at sun.com Fri Apr 24 04:52:37 2009 From: brian at sun.com (Brian J. Murrell) Date: Fri, 24 Apr 2009 07:52:37 -0400 Subject: [ofa-general] Possible bug of inkernel OFED of RHEL5.3? In-Reply-To: <49F1983D.8080204@voltaire.com> References: <49F12CB4.6030302@sun.com> <49F1983D.8080204@voltaire.com> Message-ID: <1240573957.17704.1609.camel@pc.interlinx.bc.ca> On Fri, 2009-04-24 at 13:45 +0300, Yossi Etigin wrote: > Liang Zhen wrote: > > Hi there, > > I've posted this in rhel5-list, but I'm not sure whether it's the right > > place so I post it here again... > > > > We got this assertion while running inkernel OFED of RHEL5.3: > > > > Thanks, > this is already fixed in OFED 1.4 pretty much the same way. So, is this a confirmation that the OFED in question (in the RHEL 5.3 kernel) is in fact broken? Is the breakage serious enough that one should want to avoid using that version? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From yosefe at voltaire.com Fri Apr 24 05:30:36 2009 From: yosefe at voltaire.com (Yossi Etigin) Date: Fri, 24 Apr 2009 15:30:36 +0300 Subject: [ofa-general] Possible bug of inkernel OFED of RHEL5.3? In-Reply-To: <1240573957.17704.1609.camel@pc.interlinx.bc.ca> References: <49F12CB4.6030302@sun.com> <49F1983D.8080204@voltaire.com> <1240573957.17704.1609.camel@pc.interlinx.bc.ca> Message-ID: <49F1B0EC.4040906@voltaire.com> Brian J. Murrell wrote: > So, is this a confirmation that the OFED in question (in the RHEL 5.3 > kernel) is in fact broken? Is the breakage serious enough that one > should want to avoid using that version? > > b. > Liang's report is enough to show taht it indeed has a bug. Looks like the only affected case is setting an IPoIB interface to connected mode via sysfs. This bug can be avoid if you avoid using connected mode and make sure all network/openib start scripts do not use connected mode. If you want to use connected mode, you'll probably want to upgrade the OFED version. --Yossi From bart.vanassche at gmail.com Fri Apr 24 05:36:34 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Fri, 24 Apr 2009 14:36:34 +0200 Subject: [ofa-general] Possible bug of inkernel OFED of RHEL5.3? In-Reply-To: <49F1B0EC.4040906@voltaire.com> References: <49F12CB4.6030302@sun.com> <49F1983D.8080204@voltaire.com> <1240573957.17704.1609.camel@pc.interlinx.bc.ca> <49F1B0EC.4040906@voltaire.com> Message-ID: On Fri, Apr 24, 2009 at 2:30 PM, Yossi Etigin wrote: > Brian J. Murrell wrote: >> So, is this a confirmation that the OFED in question (in the RHEL 5.3 >> kernel) is in fact broken?  Is the breakage serious enough that one >> should want to avoid using that version? > > Liang's report is enough to show taht it indeed has a bug. > > Looks like the only affected case is setting an IPoIB interface to connected > mode via sysfs. This bug can be avoid if you avoid using connected mode and make > sure all  network/openib start scripts do not use connected mode. > > If you want to use connected mode, you'll probably want to upgrade the OFED version. Replacing RHEL InfiniBand kernel drivers by OFED voids official RHEL support. Shouldn't Red Hat fix this issue ? Bart. From gmpc at sanger.ac.uk Fri Apr 24 05:39:30 2009 From: gmpc at sanger.ac.uk (Guy Coates) Date: Fri, 24 Apr 2009 13:39:30 +0100 Subject: [ofa-general] How can I get IPoIB interface ibX's GUID under linux? In-Reply-To: References: Message-ID: <49F1B302.1070704@sanger.ac.uk> You can get the address out of /sys/class/net/ibX/address Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From Zhen.Liang at Sun.COM Fri Apr 24 05:54:37 2009 From: Zhen.Liang at Sun.COM (Liang Zhen) Date: Fri, 24 Apr 2009 20:54:37 +0800 Subject: [ofa-general] Possible bug of inkernel OFED of RHEL5.3? In-Reply-To: References: <49F12CB4.6030302@sun.com> <49F1983D.8080204@voltaire.com> <1240573957.17704.1609.camel@pc.interlinx.bc.ca> <49F1B0EC.4040906@voltaire.com> Message-ID: <49F1B68D.6010008@sun.com> Bart Van Assche wrote: > > Replacing RHEL InfiniBand kernel drivers by OFED voids official RHEL > support. Shouldn't Red Hat fix this issue ? > > Bart. > Hope they will, I've already posted it in rhel-list Regards Liang > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From swise at opengridcomputing.com Fri Apr 24 07:40:44 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Fri, 24 Apr 2009 09:40:44 -0500 Subject: [Fwd: Re: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition]] In-Reply-To: <49F19ECE.9080007@ext.bull.net> References: <49F02280.7010005@ext.bull.net> <49F07710.3070002@opengridcomputing.com> <49F19ECE.9080007@ext.bull.net> Message-ID: <49F1CF6C.3090703@opengridcomputing.com> Hey Celine, Thanks for gathering all this info! So the rdma connections work fine with everything _but_ nfsrdma. And errno 103 indicates the connection was aborted, maybe by the server (since no failures are logged by the client). More below: Celine Bourde wrote: > Hi Steve, > > This email summarizes the situation: > > Standard mount -> OK > --------------------- > > [root at twind ~]# mount -o rw 192.168.0.215:/vol0 /mnt/ > Command works fine. > > rdma mount -> KO > ----------------- > > [root at twind ~]# mount -o rdma,port=2050 192.168.0.215:/vol0 /mnt/ > Command blocks ! I should perform Ctr+C to kill process. > > or > > [root at twind ofa_kernel-1.4.1]# strace mount.nfs 192.168.0.215:/vol0 > /mnt/ -o rdma,port=2050 > [..] > fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > connect(3, {sa_family=AF_INET, sin_port=htons(610), > sin_addr=inet_addr("127.0.0.1")}, 16) = 0 > fcntl(3, F_SETFL, O_RDWR) = 0 > sendto(3, > "-3\245\357\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., > 40, 0, {sa_family=AF_INET, sin_port=htons(610), > sin_addr=inet_addr("127.0.0.1")}, 16) = 40 > poll([{fd=3, events=POLLIN}], 1, 3000) = 1 ([{fd=3, revents=POLLIN}]) > recvfrom(3, "-3\245\357\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", > 8800, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(610), > sin_addr=inet_addr("127.0.0.1")}, [16]) = 24 > close(3) = 0 > mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, > "rdma,port=2050,addr=192.168.0.215" > ..same problem > > [root at twind tmp]# dmesg > rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 > ird 16 > rpcrdma: connection to 192.168.0.215:2050 closed (-103) > rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 > ird 16 > rpcrdma: connection to 192.168.0.215:2050 closed (-103) > rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 > ird 16 > rpcrdma: connection to 192.168.0.215:2050 closed (-103) > rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 > ird 16 > rpcrdma: connection to 192.168.0.215:2050 closed (-103) > > Is there anything logged on the server side? Also, can you try this again, but on both systems do this before attempting the mount: echo 32768 > /proc/sys/sunrpc/rpc_debug This will enable all the rpc trace points and add a bunch of logging to /var/log/messages. Maybe that will show us something. It think the server is aborting the connection for some reason. Steve. From faisal.latif at intel.com Fri Apr 24 09:06:58 2009 From: faisal.latif at intel.com (Faisal Latif) Date: Fri, 24 Apr 2009 11:06:58 -0500 Subject: [ofa-general] [PATCH] RDMA/nes: fix error handling Message-ID: <20090424160658.GA17724@flatif-MOBL> If reg_phys_mem() fails, we need to free memory allocated for MPA frame with private data before returning error. Also moving nes_add_ref() after the reg_phys_mem() is successful. Signed-off-by: Faisal Latif --- drivers/infiniband/hw/nes/nes_cm.c | 5 ++++- 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index bd49230..13d72b5 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -2709,7 +2709,6 @@ int nes_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) /* associate the node with the QP */ nesqp->cm_node = (void *)cm_node; cm_node->nesqp = nesqp; - nes_add_ref(&nesqp->ibqp); nes_debug(NES_DBG_CM, "QP%u, cm_node=%p, jiffies = %lu listener = %p\n", nesqp->hwqp.qp_id, cm_node, jiffies, cm_node->listener); @@ -2762,6 +2761,9 @@ int nes_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) nes_debug(NES_DBG_CM, "Unable to register memory region" "for lSMM for cm_node = %p \n", cm_node); + pci_free_consistent(nesdev->pcidev, + nesqp->private_data_len+sizeof(struct ietf_mpa_frame), + nesqp->ietf_frame, nesqp->ietf_frame_pbase); return -ENOMEM; } @@ -2878,6 +2880,7 @@ int nes_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) /* notify OF layer that accept event was successful */ cm_id->add_ref(cm_id); + nes_add_ref(&nesqp->ibqp); cm_event.event = IW_CM_EVENT_ESTABLISHED; cm_event.status = IW_CM_EVENT_STATUS_ACCEPTED; -- 1.5.3.3 From sean.hefty at intel.com Fri Apr 24 10:27:24 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Fri, 24 Apr 2009 10:27:24 -0700 Subject: [ofa-general] [PATCH/Resend] Fixed capability mask problem in ibstat introduec by commit 722b6c6428c9e4921a81f4a6db2838bcee660bb7 In-Reply-To: <49F16310.1080902@ext.bull.net> References: <49F16310.1080902@ext.bull.net> Message-ID: <132D7B1EACCC462387A1C7FB9EAC4F2D@amr.corp.intel.com> >diff --git a/infiniband-diags/src/ibstat.c b/infiniband-diags/src/ibstat.c >index 7985be1..99af9a8 100644 >--- a/infiniband-diags/src/ibstat.c >+++ b/infiniband-diags/src/ibstat.c >@@ -111,7 +111,7 @@ port_dump(umad_port_t *port, int alone) > printf("%sBase lid: %d\n", pre, port->base_lid); > printf("%sLMC: %d\n", pre, port->lmc); > printf("%sSM lid: %d\n", pre, port->sm_lid); >- printf("%sCapability mask: 0x%08x\n", pre, (unsigned)ntohll(port- >>capmask)); >+ printf("%sCapability mask: 0x%08x\n", pre, >(unsigned)(ntohl((uint32_t)(port->capmask)))); Casting from 64-bit to 32-bit, then byte swapping doesn't look right. I think the problem may be in libibumad, umad.c, line 166: if (sys_read_uint64(port_dir, SYS_PORT_CAPMASK, &port->capmask) < 0) goto clean; port->capmask = htonl(port->capmask); capmask is read as a 64-bit value, but only 32-bit swap is used. (libibumad is not shared between Linux and Windows, so this problem doesn't show up on Windows.) - Sean From sashak at voltaire.com Fri Apr 24 10:47:37 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 24 Apr 2009 20:47:37 +0300 Subject: [ofa-general] Re: [PATCH v3 1/3] Create a new library libibnetdisc In-Reply-To: <20090423070829.GB8281@sk> References: <20090403154251.dec181f2.weiny2@llnl.gov> <20090423070829.GB8281@sk> Message-ID: <20090424174737.GC5465@sk> On 10:08 Thu 23 Apr , Sasha Khapyorsky wrote: > > Applied. Thanks. I almost pushed this up, but in the last minute found the issue (fortunately it crashed on one of my machines). I will comment over original patch. Sasha From vuhuong at mellanox.com Fri Apr 24 10:54:55 2009 From: vuhuong at mellanox.com (Vu Pham) Date: Fri, 24 Apr 2009 10:54:55 -0700 Subject: [Fwd: Re: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition]] In-Reply-To: <49F1CF6C.3090703@opengridcomputing.com> References: <49F02280.7010005@ext.bull.net> <49F07710.3070002@opengridcomputing.com> <49F19ECE.9080007@ext.bull.net> <49F1CF6C.3090703@opengridcomputing.com> Message-ID: <49F1FCEF.3030305@mellanox.com> Hi Celine, What HCA do you have on your system? Is it ConnectX? If yes, what is its firmware version? -vu > Hey Celine, > > Thanks for gathering all this info! So the rdma connections work fine > with everything _but_ nfsrdma. And errno 103 indicates the connection > was aborted, maybe by the server (since no failures are logged by the > client). > > > More below: > > > Celine Bourde wrote: >> Hi Steve, >> >> This email summarizes the situation: >> >> Standard mount -> OK >> --------------------- >> >> [root at twind ~]# mount -o rw 192.168.0.215:/vol0 /mnt/ >> Command works fine. >> >> rdma mount -> KO >> ----------------- >> >> [root at twind ~]# mount -o rdma,port=2050 192.168.0.215:/vol0 /mnt/ >> Command blocks ! I should perform Ctr+C to kill process. >> >> or >> >> [root at twind ofa_kernel-1.4.1]# strace mount.nfs 192.168.0.215:/vol0 >> /mnt/ -o rdma,port=2050 >> [..] >> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 >> connect(3, {sa_family=AF_INET, sin_port=htons(610), >> sin_addr=inet_addr("127.0.0.1")}, 16) = 0 >> fcntl(3, F_SETFL, O_RDWR) = 0 >> sendto(3, >> "-3\245\357\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., >> 40, 0, {sa_family=AF_INET, sin_port=htons(610), >> sin_addr=inet_addr("127.0.0.1")}, 16) = 40 >> poll([{fd=3, events=POLLIN}], 1, 3000) = 1 ([{fd=3, revents=POLLIN}]) >> recvfrom(3, "-3\245\357\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", >> 8800, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(610), >> sin_addr=inet_addr("127.0.0.1")}, [16]) = 24 >> close(3) = 0 >> mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, >> "rdma,port=2050,addr=192.168.0.215" >> ..same problem >> >> [root at twind tmp]# dmesg >> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >> 32 ird 16 >> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >> 32 ird 16 >> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >> 32 ird 16 >> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >> 32 ird 16 >> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >> >> > > Is there anything logged on the server side? > > Also, can you try this again, but on both systems do this before > attempting the mount: > > echo 32768 > /proc/sys/sunrpc/rpc_debug > > This will enable all the rpc trace points and add a bunch of logging > to /var/log/messages. > Maybe that will show us something. It think the server is aborting > the connection for some reason. > > Steve. > > > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From sashak at voltaire.com Fri Apr 24 10:53:25 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Fri, 24 Apr 2009 20:53:25 +0300 Subject: [ofa-general] Re: [PATCH v3 1/3] Create a new library libibnetdisc In-Reply-To: <20090403154251.dec181f2.weiny2@llnl.gov> References: <20090403154251.dec181f2.weiny2@llnl.gov> Message-ID: <20090424175325.GD5465@sk> On 15:42 Fri 03 Apr , Ira Weiny wrote: > diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h > new file mode 100644 > index 0000000..a882994 > --- /dev/null > +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h > @@ -0,0 +1,188 @@ > +/* > + * Copyright (c) 2008 Lawrence Livermore National Lab. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > + > +#ifndef _IBNETDISC_H_ > +#define _IBNETDISC_H_ > + > +#include > +#include > +#include > + > +struct ib_fabric; /* forward declare */ > +struct chassis; /* forward declare */ > +struct port; /* forward declare */ > + > +/** ========================================================================= > + * Node > + */ > +typedef struct node { > + struct node *next; /* all node list in fabric */ > + struct ib_fabric *fabric; /* the fabric node belongs to */ > + > + ib_portid_t path_portid; /* path from "from_node" */ > + int dist; /* num of hops from "from_node" */ > + int smalid; > + int smalmc; > + > + /* quick cache of switchinfo below */ > + int smaenhsp0; > + /* use libibmad decoder functions for switchinfo */ > + //WHY does this not work??? > + //uint8_t switchinfo[sizeof (ib_switch_info_t)]; This is a right question - sizeof(ib_switch_info_t) < 64. > + uint8_t switchinfo[64]; > + > + /* quick cache of info below */ > + uint64_t guid; > + int type; > + int numports; > + /* use libibmad decoder functions for info */ > + uint8_t info[sizeof(ib_node_info_t)]; Above, here and in some other places. Those buffers are used as rcvdata with smp_query_via(), it assumes SMP MADs and 64 bytes of data is always copied there. So when actual buffer is smaller bad things may happen. I'm fixing this with such addition: diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h index a882994..bc108ab 100644 --- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h @@ -57,18 +57,16 @@ typedef struct node { /* quick cache of switchinfo below */ int smaenhsp0; /* use libibmad decoder functions for switchinfo */ - //WHY does this not work??? - //uint8_t switchinfo[sizeof (ib_switch_info_t)]; - uint8_t switchinfo[64]; + uint8_t switchinfo[IB_SMP_DATA_SIZE]; /* quick cache of info below */ uint64_t guid; int type; int numports; /* use libibmad decoder functions for info */ - uint8_t info[sizeof(ib_node_info_t)]; + uint8_t info[IB_SMP_DATA_SIZE]; - char nodedesc[IB_NODE_DESCRIPTION_SIZE]; + char nodedesc[IB_SMP_DATA_SIZE]; struct port **ports; /* in order array of port pointers */ /* the size of this array is info.numports + 1 */ @@ -96,7 +94,7 @@ typedef struct port { uint16_t base_lid; uint8_t lmc; /* use libibmad decoder functions for info */ - uint8_t info[sizeof(ib_port_info_t)]; + uint8_t info[IB_SMP_DATA_SIZE]; } ibnd_port_t; diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index 479bae7..3fd3b76 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -231,7 +231,7 @@ ibnd_find_node_guid(ibnd_fabric_t *fabric, uint64_t guid) ibnd_node_t * ibnd_update_node(ibnd_node_t *node) { - char portinfo_port0[sizeof (ib_port_info_t)]; + char portinfo_port0[IB_SMP_DATA_SIZE]; void *nd = node->nodedesc; int p = 0; struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(node->fabric); Sasha From vuhuong at mellanox.com Fri Apr 24 11:40:32 2009 From: vuhuong at mellanox.com (Vu Pham) Date: Fri, 24 Apr 2009 11:40:32 -0700 Subject: [Fwd: Re: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition]] In-Reply-To: <49F1FCEF.3030305@mellanox.com> References: <49F02280.7010005@ext.bull.net> <49F07710.3070002@opengridcomputing.com> <49F19ECE.9080007@ext.bull.net> <49F1CF6C.3090703@opengridcomputing.com> <49F1FCEF.3030305@mellanox.com> Message-ID: <49F207A0.6090507@mellanox.com> Celine, I'm seeing mlx4 in the log so it is connectX. nfsrdma does not work with any official connectX' fw release 2.6.0 because of fast registering work request problems between nfsrdma and the firmware. We are currently debugging/fixing those problems. Do you have direct contact with Mellanox field application engineer? Please contact him/her. If not I can send you a contact on private channel. thanks, -vu > Hi Celine, > > What HCA do you have on your system? Is it ConnectX? If yes, what is > its firmware version? > > -vu > >> Hey Celine, >> >> Thanks for gathering all this info! So the rdma connections work >> fine with everything _but_ nfsrdma. And errno 103 indicates the >> connection was aborted, maybe by the server (since no failures are >> logged by the client). >> >> >> More below: >> >> >> Celine Bourde wrote: >>> Hi Steve, >>> >>> This email summarizes the situation: >>> >>> Standard mount -> OK >>> --------------------- >>> >>> [root at twind ~]# mount -o rw 192.168.0.215:/vol0 /mnt/ >>> Command works fine. >>> >>> rdma mount -> KO >>> ----------------- >>> >>> [root at twind ~]# mount -o rdma,port=2050 192.168.0.215:/vol0 /mnt/ >>> Command blocks ! I should perform Ctr+C to kill process. >>> >>> or >>> >>> [root at twind ofa_kernel-1.4.1]# strace mount.nfs 192.168.0.215:/vol0 >>> /mnt/ -o rdma,port=2050 >>> [..] >>> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 >>> connect(3, {sa_family=AF_INET, sin_port=htons(610), >>> sin_addr=inet_addr("127.0.0.1")}, 16) = 0 >>> fcntl(3, F_SETFL, O_RDWR) = 0 >>> sendto(3, >>> "-3\245\357\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., >>> 40, 0, {sa_family=AF_INET, sin_port=htons(610), >>> sin_addr=inet_addr("127.0.0.1")}, 16) = 40 >>> poll([{fd=3, events=POLLIN}], 1, 3000) = 1 ([{fd=3, revents=POLLIN}]) >>> recvfrom(3, "-3\245\357\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", >>> 8800, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(610), >>> sin_addr=inet_addr("127.0.0.1")}, [16]) = 24 >>> close(3) = 0 >>> mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, >>> "rdma,port=2050,addr=192.168.0.215" >>> ..same problem >>> >>> [root at twind tmp]# dmesg >>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>> 32 ird 16 >>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>> 32 ird 16 >>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>> 32 ird 16 >>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>> 32 ird 16 >>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>> >>> >> >> Is there anything logged on the server side? >> >> Also, can you try this again, but on both systems do this before >> attempting the mount: >> >> echo 32768 > /proc/sys/sunrpc/rpc_debug >> >> This will enable all the rpc trace points and add a bunch of logging >> to /var/log/messages. >> Maybe that will show us something. It think the server is aborting >> the connection for some reason. >> >> Steve. >> >> >> >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From arlin.r.davis at intel.com Fri Apr 24 14:58:48 2009 From: arlin.r.davis at intel.com (Arlin Davis) Date: Fri, 24 Apr 2009 14:58:48 -0700 Subject: [ofa-general] IPoIB performance numbers? Message-ID: Does anyone have IPoIB performance numbers comparing connected versus unconnected modes? Thanks, -arlin From sashak at voltaire.com Sat Apr 25 03:07:10 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Apr 2009 13:07:10 +0300 Subject: [ofa-general] Re: [PATCH 4/4] ib-mgmt/ibn3 branch: libibnetdisc add windows support In-Reply-To: References: <5A8239DBACE14A9281405317269D03A3@amr.corp.intel.com> Message-ID: <20090425100710.GA28604@sk> Hi Sean, On 12:06 Tue 21 Apr , Sean Hefty wrote: > > diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h > b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h > index a882994..370ae31 100644 > --- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h > +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h > @@ -37,6 +37,7 @@ > #include > #include > #include > +#include Why is this inclusion needed? mad_osd.h is included via mad.h. Sasha From vlad at lists.openfabrics.org Sat Apr 25 03:22:23 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sat, 25 Apr 2009 03:22:23 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090425-0200 daily build status Message-ID: <20090425102223.CDC12E61290@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From sashak at voltaire.com Sat Apr 25 03:32:17 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Apr 2009 13:32:17 +0300 Subject: [ofa-general] Re: [PATCH v3 3/3] Convert ibnetdiscover to use new ibnetdisc library. In-Reply-To: <20090423100206.c2621310.weiny2@llnl.gov> References: <20090403154301.f656e7a4.weiny2@llnl.gov> <20090423082535.GD8281@sk> <20090423100206.c2621310.weiny2@llnl.gov> Message-ID: <20090425103216.GB28604@sk> On 10:02 Thu 23 Apr , Ira Weiny wrote: > > Somewhere along the line I broke this and then this got put into the > ibnetdiscover patch. This should not even have been here. Anyway, LDFLAGS is > required for the -L I believe? 'info automake' says (Top -> Programs -> A Program): `PROG_LDADD' is inappropriate for passing program-specific linker flags (except for `-l', `-L', `-dlopen' and `-dlpreopen'). So, use the `PROG_LDFLAGS' variable for this purpose. So '-L' is exception suitable for LDADD. Sasha From sashak at voltaire.com Sat Apr 25 07:42:24 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Apr 2009 17:42:24 +0300 Subject: [ofa-general] [PATCH] ibnetdiscover: fix types to avoid portability castings In-Reply-To: References: <5A8239DBACE14A9281405317269D03A3@amr.corp.intel.com> Message-ID: <20090425144224.GC28604@sk> We did this before, but somehow it was lost in libibnetdisc patches. Signed-off-by: Sasha Khapyorsky --- .../libibnetdisc/include/infiniband/ibnetdisc.h | 4 ++-- infiniband-diags/src/ibnetdiscover.c | 6 +++--- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h index 8324ca9..4fe0f21 100644 --- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h @@ -104,10 +104,10 @@ typedef struct port { typedef struct chassis { struct chassis *next; uint64_t chassisguid; - int chassisnum; + unsigned char chassisnum; /* generic grouping by SystemImageGUID */ - int nodecount; + unsigned char nodecount; ibnd_node_t *nodes; /* specific to voltaire type nodes */ diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 2ca696e..e874fe4 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -205,12 +205,12 @@ out_ids(ibnd_node_t *node, int group, char *chname) } uint64_t -out_chassis(ibnd_fabric_t *fabric, int chassisnum) +out_chassis(ibnd_fabric_t *fabric, unsigned char chassisnum) { uint64_t guid; - fprintf(f, "\nChassis %d", chassisnum); - guid = ibnd_get_chassis_guid(fabric, (unsigned char) chassisnum); + fprintf(f, "\nChassis %u", chassisnum); + guid = ibnd_get_chassis_guid(fabric, chassisnum); if (guid) fprintf(f, " (guid 0x%" PRIx64 ")", guid); fprintf(f, "\n"); -- 1.6.1.2.319.gbd9e From sashak at voltaire.com Sat Apr 25 07:43:03 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Apr 2009 17:43:03 +0300 Subject: [ofa-general] Re: [PATCH 1/4] ib-mgmt/ibn3 branch: diags updated for continued windows support In-Reply-To: <5A8239DBACE14A9281405317269D03A3@amr.corp.intel.com> References: <5A8239DBACE14A9281405317269D03A3@amr.corp.intel.com> Message-ID: <20090425144303.GD28604@sk> On 12:02 Tue 21 Apr , Sean Hefty wrote: > Signed-off-by: Sean Hefty All applied. Thanks. Sasha From sashak at voltaire.com Sat Apr 25 08:54:41 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Apr 2009 18:54:41 +0300 Subject: [ofa-general] Re: [PATCH 8/8] Convert ibqueryerrors.pl to C and use new ibnetdisc library. In-Reply-To: <20090423133120.acf0af63.weiny2@llnl.gov> References: <20090423133120.acf0af63.weiny2@llnl.gov> Message-ID: <20090425155441.GE28604@sk> On 13:31 Thu 23 Apr , Ira Weiny wrote: > diff --git a/infiniband-diags/src/ibqueryerrors.c b/infiniband-diags/src/ibqueryerrors.c > new file mode 100644 > index 0000000..9d96190 > --- /dev/null > +++ b/infiniband-diags/src/ibqueryerrors.c > @@ -0,0 +1,469 @@ > +/* > + * Copyright (c) 2004-2007 Voltaire Inc. All rights reserved. > + * Copyright (c) 2007 Xsigo Systems Inc. All rights reserved. > + * Copyright (c) 2008 Lawrence Livermore National Lab. All rights reserved. > + * > + * This software is available to you under a choice of one of two > + * licenses. You may choose to be licensed under the terms of the GNU > + * General Public License (GPL) Version 2, available from the file > + * COPYING in the main directory of this source tree, or the > + * OpenIB.org BSD license below: > + * > + * Redistribution and use in source and binary forms, with or > + * without modification, are permitted provided that the following > + * conditions are met: > + * > + * - Redistributions of source code must retain the above > + * copyright notice, this list of conditions and the following > + * disclaimer. > + * > + * - Redistributions in binary form must reproduce the above > + * copyright notice, this list of conditions and the following > + * disclaimer in the documentation and/or other materials > + * provided with the distribution. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + */ > + > +#if HAVE_CONFIG_H > +# include > +#endif /* HAVE_CONFIG_H */ > + > +#define _GNU_SOURCE > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include AFAIR WinOF doesn't like such inclusion and uses #include instead. I'm changing. > +#include > +#include > + > +#include "ibdiag_common.h" > + > +char *argv0 = "ibqueryerrors"; argv0 variable is not needed if you are using ibdiag_common stuff. Removing. > +static FILE *f; I don't see where 'f' is used (except 'f = stdout;' below). Removing. [snip...] > +static int process_opt(void *context, int ch, char *optarg) > +{ > + switch (ch) { > + case 's': > + calculate_suppressed_fields(optarg); > + break; > + case 'c': > + /* Right now this is the only "common" error */ > + add_suppressed(IB_PC_ERR_SWITCH_REL_F); > + break; > + case 1: > + node_name_map_file = strdup(optarg); > + break; > + case 2: > + data_counters++; > + break; > + case 3: > + all_nodes++; > + break; > + case 'S': > + switch_guid_str = strdup(optarg); Why should optarg be strdup()ed? > + switch_guid = (uint64_t)strtoull(switch_guid_str, 0, 0); > + break; > + case 'D': > + dr_path = strdup(optarg); > + break; > + case 'r': > + port_config++; > + break; > + case 'R': /* nop */ > + break; > + default: > + return -1; > + } > + > + return 0; > +} > + > +int > +main(int argc, char **argv) > +{ > + int rc = 0; > + ibnd_fabric_t *fabric = NULL; > + > + int mgmt_classes[4] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS, IB_PERFORMANCE_CLASS}; > + > + const struct ibdiag_opt opts[] = { > + { "suppress", 's', 1, "", "suppress errors listed" }, > + { "suppress-common", 'c', 0, NULL, "suppress some of the common counters" }, > + { "node-name-map", 1, 1, "", "node name map file" }, > + { "switch", 'S', 1, "", "query only (hex format)"}, > + { "Direct", 'D', 1, "", "query only switch specified by "}, > + { "report-port", 'r', 0, NULL, "report port configuration information"}, > + { "GNDN", 'R', 0, NULL, "(This option is obsolete and does nothing)"}, > + { "data", 2, 0, NULL, "include the data counters in the output"}, > + { "all", 3, 0, NULL, "output all nodes (not just switches)"}, > + { 0 } > + }; > + char usage_args[] = ""; > + > + ibdiag_process_opts(argc, argv, "sDLG", "snSrR", opts, process_opt, > + usage_args, NULL); > + > + f = stdout; > + > + argc -= optind; > + argv += optind; > + > + if (ibverbose) > + ibnd_debug(1); > + > + ibmad_port = mad_rpc_open_port(ibd_ca, ibd_ca_port, mgmt_classes, 4); > + if (!ibmad_port) > + IBERROR("Failed to open port; %s:%d\n", ibd_ca, ibd_ca_port); > + > + node_name_map = open_node_name_map(node_name_map_file); > + > + if (switch_guid) { > + /* limit the scan the fabric around the target */ > + ib_portid_t portid = {0}; > + > + if (ib_resolve_portid_str_via(&portid, switch_guid_str, IB_DEST_GUID, > + ibd_sm_id, ibmad_port) < 0) { > + fprintf(stderr, "can't resolve destination port %s %p\n", > + switch_guid_str, ibd_sm_id); > + rc = 1; > + goto close_port; > + } > + > + if ((fabric = ibnd_discover_fabric(ibmad_port, ibd_timeout, &portid, 1)) == NULL) { > + fprintf(stderr, "discover failed\n"); > + rc = 1; > + goto close_port; > + } > + } else { > + if ((fabric = ibnd_discover_fabric(ibmad_port, ibd_timeout, NULL, -1)) == NULL) { > + fprintf(stderr, "discover failed\n"); > + rc = 1; > + goto close_port; > + } Above you are using IBERROR(), here is fprintf(stderr, ...). Could it be consistent? (if yes - it is subsequent patch). > + } > + > + report_suppressed(); > + > + if (switch_guid) { > + ibnd_node_t *node = ibnd_find_node_guid(fabric, switch_guid); > + print_node(node, NULL); > + } else if (dr_path) { > + ibnd_node_t *node = ibnd_find_node_dr(fabric, dr_path); > + print_node(node, NULL); When GUID or DR Path are specified we don't need to discover whole fabric, but can try to resolve LID using SA or querying PortInfo. Although when in GUID is specified and SA is not responsive there is probably no other choice than discover. Sasha From sashak at voltaire.com Sat Apr 25 10:03:58 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Apr 2009 20:03:58 +0300 Subject: [ofa-general] [PATCH v3 1/3] Create a new library libibnetdisc In-Reply-To: <28769EB1C4FB4999975354DD93FC2107@amr.corp.intel.com> References: <20090403154251.dec181f2.weiny2@llnl.gov> <20090423070210.GA8281@sk> <20090423134055.164ab69f.weiny2@llnl.gov> <28769EB1C4FB4999975354DD93FC2107@amr.corp.intel.com> Message-ID: <20090425170358.GF28604@sk> On 14:29 Thu 23 Apr , Sean Hefty wrote: > > Yes - that's the one it picks up. Adding a wrapper makes sense to me. (I don't > think that declaring a variable as extern is sufficient to share it across > library boundaries in windows.) Should it be just defined with MAD_EXPORT (in windows)? Sasha From sashak at voltaire.com Sat Apr 25 10:10:36 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Apr 2009 20:10:36 +0300 Subject: [ofa-general] [PATCH v3 1/3] Create a new library libibnetdisc In-Reply-To: <20090423150943.c512fecb.weiny2@llnl.gov> References: <20090403154251.dec181f2.weiny2@llnl.gov> <20090423070210.GA8281@sk> <20090423134055.164ab69f.weiny2@llnl.gov> <28769EB1C4FB4999975354DD93FC2107@amr.corp.intel.com> <20090423150943.c512fecb.weiny2@llnl.gov> Message-ID: <20090425171036.GG28604@sk> On 15:09 Thu 23 Apr , Ira Weiny wrote: > > diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c > index 410e2dd..cee4c95 100644 > --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c > +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c > @@ -59,6 +59,7 @@ > > static int timeout_ms = 2000; > static int show_progress = 0; > +static int ibnd_debug_flg = 0; > > void > decode_port_info(ibnd_port_t *port) > @@ -638,11 +639,13 @@ void > ibnd_debug(int i) > { > if (i) { > - ibdebug++; > + ibnd_debug_flg = 1; > + madrpc_show_debug(1); ibdebug can be incremented couple of times - debug prints will refer its value, not just yes or no. Sasha From sashak at voltaire.com Sat Apr 25 10:13:35 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Apr 2009 20:13:35 +0300 Subject: [ofa-general] [PATCH] ibdiag_common: remove duplicated ibdebug variable In-Reply-To: <20090423134055.164ab69f.weiny2@llnl.gov> References: <20090403154251.dec181f2.weiny2@llnl.gov> <20090423070210.GA8281@sk> <20090423134055.164ab69f.weiny2@llnl.gov> Message-ID: <20090425171335.GH28604@sk> ibdebug is defined already in libibmad. Remove duplication in ibdiag_common.c. Signed-off-by: Sasha Khapyorsky --- infiniband-diags/include/ibdiag_common.h | 1 - infiniband-diags/src/ibdiag_common.c | 1 - 2 files changed, 0 insertions(+), 2 deletions(-) diff --git a/infiniband-diags/include/ibdiag_common.h b/infiniband-diags/include/ibdiag_common.h index 52fd147..5c74b07 100644 --- a/infiniband-diags/include/ibdiag_common.h +++ b/infiniband-diags/include/ibdiag_common.h @@ -37,7 +37,6 @@ #include -extern int ibdebug; extern int ibverbose; extern char *ibd_ca; extern int ibd_ca_port; diff --git a/infiniband-diags/src/ibdiag_common.c b/infiniband-diags/src/ibdiag_common.c index c0421f6..4ffa3f0 100644 --- a/infiniband-diags/src/ibdiag_common.c +++ b/infiniband-diags/src/ibdiag_common.c @@ -53,7 +53,6 @@ #include #include -int ibdebug; int ibverbose; char *ibd_ca; int ibd_ca_port; -- 1.6.1.2.319.gbd9e From mike.marty at gmail.com Sat Apr 25 10:27:23 2009 From: mike.marty at gmail.com (Mike Marty) Date: Sat, 25 Apr 2009 12:27:23 -0500 Subject: [ofa-general] madvise() MADV_DONTNEED to IB memory region Message-ID: <229af89c0904251027x3e3ae0e8uc44070bb081a9e7a@mail.gmail.com> Can calling madvise() with the MADV_DONTNEED advice result in pinned IB memory (ibv_reg_mr) being unpinned? From sashak at voltaire.com Sat Apr 25 10:57:10 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Apr 2009 20:57:10 +0300 Subject: [ofa-general] Re: [PATCH 0/5] Follow on patch series to libibnetdisc including converting ibqueryerrors.pl In-Reply-To: <20090422185441.6f8601dc.weiny2@llnl.gov> References: <20090422185441.6f8601dc.weiny2@llnl.gov> Message-ID: <20090425175710.GI28604@sk> Hi, On 18:54 Wed 22 Apr , Ira Weiny wrote: > > When do you plan to merge pq/ibn3? I merged all libibnetdiscover related patch series with noted fixes into new branch pq/ibn4 and pushed it out. Please verify it once again (including win compatibility) before merging upstream. I also figured out that ibnetdiscover changes an order of switches and ports (now it is from high to low). It could be not bad things for debugging scripts which use this output, but for human readability I think reverse order (from low to high) is preferable. Sasha From sashak at voltaire.com Sat Apr 25 11:24:33 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Apr 2009 21:24:33 +0300 Subject: [ofa-general] Re: [PATCH] libibmad: added support for handling of BM (Baseboard management) MADs - FIXED without rmpp In-Reply-To: <200903191625.24559.itaib@mellanox.com> References: <200903191625.24559.itaib@mellanox.com> Message-ID: <20090425182433.GJ28604@sk> On 16:25 Thu 19 Mar , Itai Baz wrote: > This patch adds support for handling of BM (Baseboard management) MADs. > > I checked Hal's comment regarding RMPP, indeed there is no need for it for BM, so I have removed rmpp, and i'm using now mad_rpc > > > > Signed-off-by: Itai Baz Applied. Thanks. Sasha From sashak at voltaire.com Sat Apr 25 12:01:13 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sat, 25 Apr 2009 22:01:13 +0300 Subject: [ofa-general] Re: [PATCH] libibmad: added support for handling of BM (Baseboard management) MADs - FIXED without rmpp In-Reply-To: <200903191625.24559.itaib@mellanox.com> References: <200903191625.24559.itaib@mellanox.com> Message-ID: <20090425190113.GK28604@sk> On 16:25 Thu 19 Mar , Itai Baz wrote: > This patch adds support for handling of BM (Baseboard management) MADs. > +#if HAVE_CONFIG_H > +# include > +#endif /* HAVE_CONFIG_H */ > + > +#include > +#include > +#include > +#include > +#include > +#include When committing I've removed unneeded include files to prevent portability issues. Sasha From sashak at voltaire.com Sat Apr 25 14:02:55 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 26 Apr 2009 00:02:55 +0300 Subject: [ofa-general] [PATCH/Resend] Fixed capability mask problem in ibstat introduec by commit 722b6c6428c9e4921a81f4a6db2838bcee660bb7 In-Reply-To: <132D7B1EACCC462387A1C7FB9EAC4F2D@amr.corp.intel.com> References: <49F16310.1080902@ext.bull.net> <132D7B1EACCC462387A1C7FB9EAC4F2D@amr.corp.intel.com> Message-ID: <20090425210255.GL28604@sk> On 10:27 Fri 24 Apr , Sean Hefty wrote: > > I think the problem may be in libibumad, umad.c, line 166: > > if (sys_read_uint64(port_dir, SYS_PORT_CAPMASK, &port->capmask) < 0) > goto clean; > > port->capmask = htonl(port->capmask); Yes, the problem is likely here. OTOH I cannot understand why port->capmask is defined as uint64_t and not as 32-bit. Kernel uses 32-bit value and it is shown in this file as 0x%0x. What about to convert type of port->capmask to uint32_t? Sasha From sashak at voltaire.com Sat Apr 25 14:06:14 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 26 Apr 2009 00:06:14 +0300 Subject: [ofa-general] Re: [PATCH] opensm/doc/performance-manager-HOWTO.txt: Indicate (previously implied) master state In-Reply-To: <20090421111300.GA483@comcast.net> References: <20090421111300.GA483@comcast.net> Message-ID: <20090425210614.GM28604@sk> On 07:13 Tue 21 Apr , Hal Rosenstock wrote: > > Also, fix some typos > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sat Apr 25 14:10:01 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 26 Apr 2009 00:10:01 +0300 Subject: [ofa-general] Re: [PATCH] opensm/osm_perfmgr.c: Add assert in sweep_hop_1 In-Reply-To: <20090421110305.GA409@comcast.net> References: <20090421110305.GA409@comcast.net> Message-ID: <20090425211001.GN28604@sk> On 07:03 Tue 21 Apr , Hal Rosenstock wrote: > > as found in osm_state_mgr.c:state_mgr_sweep_hop_1 > > Signed-off-by: Hal Rosenstock > --- > diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c > index 8d5ed97..20ee57d 100644 > --- a/opensm/opensm/osm_perfmgr.c > +++ b/opensm/opensm/osm_perfmgr.c > @@ -568,6 +568,8 @@ static int sweep_hop_1(osm_sm_t * sm) > } > > p_node = p_port->p_node; > + CL_ASSERT(p_node); > + port is created using node reference at first - this check is not needed. Actually instead of copying discovery related code from osm_state_mgr we need to share it (once it was impossible due to crazy state machine there, now, then it is reworked already I believe we can reuse most of the code). Sasha From monis at Voltaire.COM Sun Apr 26 02:45:36 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Sun, 26 Apr 2009 12:45:36 +0300 Subject: [ofa-general] RE: [PATCH v2] rdma_cm: Add debugfs entries to monitor rdma_cm connections In-Reply-To: <90A07B5E8FAD49C187EAE583A976A3CD@amr.corp.intel.com> References: <49F05AAE.4020606@Voltaire.COM> <90A07B5E8FAD49C187EAE583A976A3CD@amr.corp.intel.com> Message-ID: <49F42D40.5000200@Voltaire.COM> Thanks Sean. I think this takes care of the issues you brought up. ------------------------------------ Create a virtual file under debugfs for each cma device and use it to print information about each rdma_id that is attached to this device. Here is an example of 'cat /sys/kernel/debug/rdma_cm/mthca0_rdma_id'. This example is for a host that runs a rping server (when a remote client is connected to it) and a rping client to a remote server. TYPE DEVICE PORT NET_DEV SRC_ADDR DST_ADDR SPACE STATE QP_NUM mthca0 0 0.0.0.0:7174 TCP LISTEN 0 IB mthca0 1 ib0 192.30.3.249:46079 192.30.3.248:7174 TCP CONNECT 132102 IB mthca0 1 ib0 192.30.3.249:7174 192.30.3.248:42561 TCP CONNECT 132103 Signed-off-by: Moni Shoua -- drivers/infiniband/core/cma.c | 183 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 183 insertions(+) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 2a2e508..ce393e7 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -51,6 +51,9 @@ #include #include +#include +#include + MODULE_AUTHOR("Sean Hefty"); MODULE_DESCRIPTION("Generic RDMA CM Agent"); MODULE_LICENSE("Dual BSD/GPL"); @@ -59,6 +62,8 @@ MODULE_LICENSE("Dual BSD/GPL"); #define CMA_MAX_CM_RETRIES 15 #define CMA_CM_MRA_SETTING (IB_CM_MRA_FLAG_DELAY | 24) +static struct dentry *cma_root_dentry; + static void cma_add_one(struct ib_device *device); static void cma_remove_one(struct ib_device *device); @@ -86,6 +91,7 @@ struct cma_device { struct completion comp; atomic_t refcount; struct list_head id_list; + struct dentry *rdma_id_dentry; }; enum cma_state { @@ -102,6 +108,47 @@ enum cma_state { CMA_DESTROYING }; +static const char *format_cma_state(enum cma_state s) +{ + switch (s) { + case CMA_IDLE: return "IDLE"; + case CMA_ADDR_QUERY: return "ADDR_QUERY"; + case CMA_ADDR_RESOLVED: return "ADDR_RESOLVED"; + case CMA_ROUTE_QUERY: return "ROUTE_QUERY"; + case CMA_ROUTE_RESOLVED: return "ROUTE_RESOLVED"; + case CMA_CONNECT: return "CONNECT"; + case CMA_DISCONNECT: return "DISCONNECT"; + case CMA_ADDR_BOUND: return "ADDR_BOUND"; + case CMA_LISTEN: return "LISTEN"; + case CMA_DEVICE_REMOVAL: return "DEVICE_REMOVAL"; + case CMA_DESTROYING: return "DESTROYING"; + } + return ""; +} + +static const char *format_port_space(enum rdma_port_space ps) +{ + switch (ps) { + case RDMA_PS_SDP: return "SDP"; + case RDMA_PS_IPOIB: return "IPOIB"; + case RDMA_PS_TCP: return "TCP"; + case RDMA_PS_UDP: return "UDP"; + case RDMA_PS_SCTP: return "SCTP"; + } + return ""; +} + +static const char *format_node_type(enum rdma_node_type nt) +{ + if (nt) { + switch (rdma_node_get_transport(nt)) { + case RDMA_TRANSPORT_IB: return "IB"; + case RDMA_TRANSPORT_IWARP: return "IW"; + } + } + return ""; +} + struct rdma_bind_list { struct idr *ps; struct hlist_head owners; @@ -2850,6 +2897,131 @@ static struct notifier_block cma_nb = { .notifier_call = cma_netdev_callback }; +static void *cma_rdma_id_seq_start(struct seq_file *file, loff_t *pos) +{ + struct cma_device *cma_dev = file->private; + void *ret; + + mutex_lock(&lock); + if (*pos == 0) + return SEQ_START_TOKEN; + ret = seq_list_start_head(&cma_dev->id_list, *pos); + return ret; +} + +static void *cma_rdma_id_seq_next(struct seq_file *file, void *v, loff_t *pos) +{ + void *ret; + struct cma_device *cma_dev = file->private; + if (v == SEQ_START_TOKEN) { + ++*pos; + if (!list_empty(&cma_dev->id_list)) + ret = cma_dev->id_list.next; + else + ret = NULL; + } else { + ret = seq_list_next(v, &cma_dev->id_list, pos); + } + return ret; +} + +static void cma_rdma_id_seq_stop(struct seq_file *file, void *iter_ptr) +{ + mutex_unlock(&lock); +} + +static void format_addr(struct sockaddr *sa, char* buf) +{ + switch (sa->sa_family) { + case AF_INET: { + struct sockaddr_in *sin = (struct sockaddr_in *)sa; + sprintf(buf, "%pI4:%u", &sin->sin_addr.s_addr, + be16_to_cpu(cma_port(sa))); + break; + } + case AF_INET6: { + struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)sa; + sprintf(buf, "%pI6:%u", &sin6->sin6_addr, + be16_to_cpu(cma_port(sa))); + break; + } + default: + buf[0] = 0; + } +} + +static int cma_rdma_id_seq_show(struct seq_file *file, void *v) +{ + struct rdma_id_private *id_priv; + char local_addr[64], remote_addr[64]; + + if (!v) + return 0; + if (v == SEQ_START_TOKEN) { + seq_printf(file, + "%-5s %-8s %-5s %-8s %-52s %-52s %-6s %-15s %-8s \n", + "TYPE", "DEVICE", "PORT", "NET_DEV", "SRC_ADDR", "DST_ADDR", "SPACE", "STATE", "QP_NUM"); + } else { + id_priv = list_entry(v, struct rdma_id_private, list); + format_addr((struct sockaddr *)&id_priv->id.route.addr.src_addr, + local_addr); + format_addr((struct sockaddr *)&id_priv->id.route.addr.dst_addr, + remote_addr); + + seq_printf(file, + "%-5s %-8s %-5d %-8s %-52s %-52s %-6s %-15s %-8d \n", + format_node_type(id_priv->id.route.addr.dev_addr.dev_type), + (id_priv->id.device) ? id_priv->id.device->name : "", + id_priv->id.port_num, + (id_priv->id.route.addr.dev_addr.src_dev) ? id_priv->id.route.addr.dev_addr.src_dev->name : "", + local_addr, remote_addr, + format_port_space(id_priv->id.ps), + format_cma_state(id_priv->state), + id_priv->qp_num); + } + return 0; +} + +static const struct seq_operations cma_rdma_id_seq_ops = { + .start = cma_rdma_id_seq_start, + .next = cma_rdma_id_seq_next, + .stop = cma_rdma_id_seq_stop, + .show = cma_rdma_id_seq_show, +}; + +static int cma_rdma_id_open(struct inode *inode, struct file *file) +{ + struct seq_file *seq; + int ret; + + ret = seq_open(file, &cma_rdma_id_seq_ops); + if (ret) + return ret; + + seq = file->private_data; + seq->private = inode->i_private; + + return 0; +} + +static const struct file_operations cma_rdma_id_fops = { + .owner = THIS_MODULE, + .open = cma_rdma_id_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release +}; + +void cma_create_debug_files(struct cma_device *cma_dev) +{ + char name[IB_DEVICE_NAME_MAX + sizeof "_rdma_id"]; + snprintf(name, sizeof name, "%s_rdma_id", cma_dev->device->name); + cma_dev->rdma_id_dentry = debugfs_create_file(name, S_IFREG | S_IRUGO, + cma_root_dentry, cma_dev, &cma_rdma_id_fops); + if (!cma_dev->rdma_id_dentry) + printk(KERN_WARNING "RDMA CMA: failed to create debugfs file %s\n", name); +} + static void cma_add_one(struct ib_device *device) { struct cma_device *cma_dev; @@ -2871,6 +3043,7 @@ static void cma_add_one(struct ib_device *device) list_for_each_entry(id_priv, &listen_any_list, list) cma_listen_on_dev(id_priv, cma_dev); mutex_unlock(&lock); + cma_create_debug_files(cma_dev); } static int cma_remove_id_dev(struct rdma_id_private *id_priv) @@ -2905,6 +3078,8 @@ static void cma_process_remove(struct cma_device *cma_dev) int ret; mutex_lock(&lock); + if (cma_dev->rdma_id_dentry) + debugfs_remove(cma_dev->rdma_id_dentry); while (!list_empty(&cma_dev->id_list)) { id_priv = list_entry(cma_dev->id_list.next, struct rdma_id_private, list); @@ -2940,6 +3115,7 @@ static void cma_remove_one(struct ib_device *device) mutex_unlock(&lock); cma_process_remove(cma_dev); + kfree(cma_dev); } @@ -2947,6 +3123,12 @@ static int cma_init(void) { int ret, low, high, remaining; + cma_root_dentry = debugfs_create_dir("rdma_cm", NULL); + if (!cma_root_dentry) { + printk(KERN_ERR "RDMA CMA: failed to create debugfs dir\n"); + return -ENOMEM; + } + get_random_bytes(&next_port, sizeof next_port); inet_get_local_port_range(&low, &high); remaining = (high - low) + 1; @@ -2984,6 +3166,7 @@ static void cma_cleanup(void) idr_destroy(&tcp_ps); idr_destroy(&udp_ps); idr_destroy(&ipoib_ps); + debugfs_remove(cma_root_dentry); } module_init(cma_init); From vlad at lists.openfabrics.org Sun Apr 26 03:27:43 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Sun, 26 Apr 2009 03:27:43 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090426-0200 daily build status Message-ID: <20090426102743.8DE97E61388@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From sashak at voltaire.com Sun Apr 26 03:43:57 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 26 Apr 2009 13:43:57 +0300 Subject: [ofa-general] Re: [PATCH v3 1/2] opensm: setup routing engine when in use and delete when fail In-Reply-To: <49C24AB4.9060505@gmail.com> References: <49A6B618.1090300@gmail.com> <49A6B6EB.80700@gmail.com> <20090312132137.GB8818@sashak.voltaire.com> <49B92C27.7060904@gmail.com> <20090312160528.GW8818@sashak.voltaire.com> <49BE0CC4.6030600@gmail.com> <20090317133548.GL12557@sashak.voltaire.com> <49C24AB4.9060505@gmail.com> Message-ID: <20090426104357.GA23250@sk> Hi Eli, On 15:37 Thu 19 Mar , Eli Dorfman (Voltaire) wrote: > setup routing engine when in use and delete when fail > > setup routing engine and allocate resources before use. > delete resources when routing algorithm fails. > this will save allocation for routing algorithms that are not used. > > Signed-off-by: Eli Dorfman > --- > opensm/include/opensm/osm_opensm.h | 6 ++++++ > opensm/opensm/osm_opensm.c | 10 ++-------- > opensm/opensm/osm_ucast_mgr.c | 17 +++++++++++++++++ > 3 files changed, 25 insertions(+), 8 deletions(-) > > diff --git a/opensm/include/opensm/osm_opensm.h b/opensm/include/opensm/osm_opensm.h > index c121be4..8d1b276 100644 > --- a/opensm/include/opensm/osm_opensm.h > +++ b/opensm/include/opensm/osm_opensm.h > @@ -109,6 +109,9 @@ typedef enum _osm_routing_engine_type { > } osm_routing_engine_type_t; > /***********/ > > +struct osm_routing_engine; > +struct osm_opensm; > + > /****s* OpenSM: OpenSM/osm_routing_engine > * NAME > * struct osm_routing_engine > @@ -122,6 +125,7 @@ typedef enum _osm_routing_engine_type { > struct osm_routing_engine { > const char *name; > void *context; > + int (*setup) (struct osm_routing_engine *re, struct osm_opensm *p_osm); > int (*build_lid_matrices) (void *context); > int (*ucast_build_fwd_tables) (void *context); > void (*ucast_dump_tables) (void *context); > @@ -523,5 +527,7 @@ extern volatile unsigned int osm_exit_flag; > * Set to one to cause all threads to leave > *********/ > > +void osm_update_routing_engines(osm_opensm_t *osm, const char *engine_names); > + This function is not implemented in this patch. Please move it to related patch. > END_C_DECLS > #endif /* _OSM_OPENSM_H_ */ > diff --git a/opensm/opensm/osm_opensm.c b/opensm/opensm/osm_opensm.c > index 50d1349..9739122 100644 > --- a/opensm/opensm/osm_opensm.c > +++ b/opensm/opensm/osm_opensm.c > @@ -169,14 +169,7 @@ static void setup_routing_engine(osm_opensm_t *osm, const char *name) > memset(re, 0, sizeof(struct osm_routing_engine)); > > re->name = m->name; > - if (m->setup(re, osm)) { > - OSM_LOG(&osm->log, OSM_LOG_VERBOSE, > - "setup of routing" > - " engine \'%s\' failed\n", name); > - return; > - } > - OSM_LOG(&osm->log, OSM_LOG_DEBUG, > - "\'%s\' routing engine set up\n", re->name); > + re->setup = m->setup; Ok, only 'setup' callback is initialized here. That is fine. But later in destroy_routing_engines() for all routing engines delete() method is called unconditionally, which obviously should crash OpenSM. It is still be broken IMO. Sasha From dorfman.eli at gmail.com Sun Apr 26 04:01:08 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Sun, 26 Apr 2009 14:01:08 +0300 Subject: [ofa-general] Re: [PATCH v3 1/2] opensm: setup routing engine when in use and delete when fail In-Reply-To: <20090426104357.GA23250@sk> References: <49A6B618.1090300@gmail.com> <49A6B6EB.80700@gmail.com> <20090312132137.GB8818@sashak.voltaire.com> <49B92C27.7060904@gmail.com> <20090312160528.GW8818@sashak.voltaire.com> <49BE0CC4.6030600@gmail.com> <20090317133548.GL12557@sashak.voltaire.com> <49C24AB4.9060505@gmail.com> <20090426104357.GA23250@sk> Message-ID: <49F43EF4.5070305@gmail.com> Sasha Khapyorsky wrote: > Hi Eli, > > On 15:37 Thu 19 Mar , Eli Dorfman (Voltaire) wrote: >> setup routing engine when in use and delete when fail >> >> setup routing engine and allocate resources before use. >> delete resources when routing algorithm fails. >> this will save allocation for routing algorithms that are not used. >> >> Signed-off-by: Eli Dorfman >> --- >> opensm/include/opensm/osm_opensm.h | 6 ++++++ >> opensm/opensm/osm_opensm.c | 10 ++-------- >> opensm/opensm/osm_ucast_mgr.c | 17 +++++++++++++++++ >> 3 files changed, 25 insertions(+), 8 deletions(-) >> >> diff --git a/opensm/include/opensm/osm_opensm.h b/opensm/include/opensm/osm_opensm.h >> index c121be4..8d1b276 100644 >> --- a/opensm/include/opensm/osm_opensm.h >> +++ b/opensm/include/opensm/osm_opensm.h >> @@ -109,6 +109,9 @@ typedef enum _osm_routing_engine_type { >> } osm_routing_engine_type_t; >> /***********/ >> >> +struct osm_routing_engine; >> +struct osm_opensm; >> + >> /****s* OpenSM: OpenSM/osm_routing_engine >> * NAME >> * struct osm_routing_engine >> @@ -122,6 +125,7 @@ typedef enum _osm_routing_engine_type { >> struct osm_routing_engine { >> const char *name; >> void *context; >> + int (*setup) (struct osm_routing_engine *re, struct osm_opensm *p_osm); >> int (*build_lid_matrices) (void *context); >> int (*ucast_build_fwd_tables) (void *context); >> void (*ucast_dump_tables) (void *context); >> @@ -523,5 +527,7 @@ extern volatile unsigned int osm_exit_flag; >> * Set to one to cause all threads to leave >> *********/ >> >> +void osm_update_routing_engines(osm_opensm_t *osm, const char *engine_names); >> + > > This function is not implemented in this patch. Please move it to > related patch. > >> END_C_DECLS >> #endif /* _OSM_OPENSM_H_ */ >> diff --git a/opensm/opensm/osm_opensm.c b/opensm/opensm/osm_opensm.c >> index 50d1349..9739122 100644 >> --- a/opensm/opensm/osm_opensm.c >> +++ b/opensm/opensm/osm_opensm.c >> @@ -169,14 +169,7 @@ static void setup_routing_engine(osm_opensm_t *osm, const char *name) >> memset(re, 0, sizeof(struct osm_routing_engine)); >> >> re->name = m->name; >> - if (m->setup(re, osm)) { >> - OSM_LOG(&osm->log, OSM_LOG_VERBOSE, >> - "setup of routing" >> - " engine \'%s\' failed\n", name); >> - return; >> - } >> - OSM_LOG(&osm->log, OSM_LOG_DEBUG, >> - "\'%s\' routing engine set up\n", re->name); >> + re->setup = m->setup; > > Ok, only 'setup' callback is initialized here. That is fine. But later > in destroy_routing_engines() for all routing engines delete() method is > called unconditionally, which obviously should crash OpenSM. > > It is still be broken IMO. No. routing_engine->setup is called by osm_ucast_mgr_process() and if routing algorithm fails then delete is called (and is already set). I already tested this patch and opensm does not crash. Eli From jackm at dev.mellanox.co.il Sun Apr 26 04:31:18 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Sun, 26 Apr 2009 14:31:18 +0300 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: <20090423234839.GE4431@obsidianresearch.com> References: <20090423234839.GE4431@obsidianresearch.com> Message-ID: <200904261431.18620.jackm@dev.mellanox.co.il> On Friday 24 April 2009 02:48, Jason Gunthorpe wrote: > AFAIK, Ubuntu does not do any work on their IB drivers, so the driver > is stock 2.6.27. > > In principle OFED is supposed to start with an upstream kernel and > backport those drivers to various distributions. OFED 1.3 was using > 2.6.24, OFED 1.4 is apparently using 2.6.27. > > So it should be similar to OFED 1.4 > > Though bear in mind OFED still patches things with stuff that is not > yet accepted upstream so there will be some differences. > > It should be compatible with the OFED 1.4 userspace. > Beware -- you should not use OFED userspace with a non-ofed kernel for ConnectX HCAs. The OFED 1.4 ConnectX driver includes the XRC (Extended RC) patches -- which grab 23 (the MSB) of the QP number to indicate an XRC SRQ in CQEs. Non-OFED kernels do not reserve bit 23 for this usage, so you will experience incompatibility problems. In general, you should not use OFED userspace libraries with non-OFED kernel distributions. - Jack From sashak at voltaire.com Sun Apr 26 04:27:31 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 26 Apr 2009 14:27:31 +0300 Subject: [ofa-general] Re: [PATCH v3 1/2] opensm: setup routing engine when in use and delete when fail In-Reply-To: <49F43EF4.5070305@gmail.com> References: <49A6B618.1090300@gmail.com> <49A6B6EB.80700@gmail.com> <20090312132137.GB8818@sashak.voltaire.com> <49B92C27.7060904@gmail.com> <20090312160528.GW8818@sashak.voltaire.com> <49BE0CC4.6030600@gmail.com> <20090317133548.GL12557@sashak.voltaire.com> <49C24AB4.9060505@gmail.com> <20090426104357.GA23250@sk> <49F43EF4.5070305@gmail.com> Message-ID: <20090426112731.GB23250@sk> On 14:01 Sun 26 Apr , Eli Dorfman (Voltaire) wrote: > > No. > routing_engine->setup is called by osm_ucast_mgr_process() > and if routing algorithm fails then delete is called (and is already set). Right. And after all on OpenSM exit destroy_routing_engines() is called where all routing engines (including already destroyed and yet not created) are destroyed using delete() method unconditionally. Sasha From dorfman.eli at gmail.com Sun Apr 26 04:48:41 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Sun, 26 Apr 2009 14:48:41 +0300 Subject: [ofa-general] Re: [PATCH v3 1/2] opensm: setup routing engine when in use and delete when fail In-Reply-To: <20090426112731.GB23250@sk> References: <49A6B618.1090300@gmail.com> <49A6B6EB.80700@gmail.com> <20090312132137.GB8818@sashak.voltaire.com> <49B92C27.7060904@gmail.com> <20090312160528.GW8818@sashak.voltaire.com> <49BE0CC4.6030600@gmail.com> <20090317133548.GL12557@sashak.voltaire.com> <49C24AB4.9060505@gmail.com> <20090426104357.GA23250@sk> <49F43EF4.5070305@gmail.com> <20090426112731.GB23250@sk> Message-ID: <49F44A19.1060407@gmail.com> Sasha Khapyorsky wrote: > On 14:01 Sun 26 Apr , Eli Dorfman (Voltaire) wrote: >> No. >> routing_engine->setup is called by osm_ucast_mgr_process() >> and if routing algorithm fails then delete is called (and is already set). > > Right. And after all on OpenSM exit destroy_routing_engines() is called > where all routing engines (including already destroyed and yet not > created) are destroyed using delete() method unconditionally. > delete() is called conditionally from destroy_routing_engines(osm_opensm_t *osm) if (r->delete) r->delete(r->context); Also all re(s) are cleared when created so delete is NULL if setup() was not called. So I don't see any problem here. Eli From bart.vanassche at gmail.com Sun Apr 26 05:25:44 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Sun, 26 Apr 2009 14:25:44 +0200 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: <200904261431.18620.jackm@dev.mellanox.co.il> References: <20090423234839.GE4431@obsidianresearch.com> <200904261431.18620.jackm@dev.mellanox.co.il> Message-ID: On Sun, Apr 26, 2009 at 1:31 PM, Jack Morgenstein wrote: > On Friday 24 April 2009 02:48, Jason Gunthorpe wrote: >> AFAIK, Ubuntu does not do any work on their IB drivers, so the driver >> is stock 2.6.27. >> >> In principle OFED is supposed to start with an upstream kernel and >> backport those drivers to various distributions. OFED 1.3 was using >> 2.6.24, OFED 1.4 is apparently using 2.6.27. >> >> So it should be similar to OFED 1.4 >> >> Though bear in mind OFED still patches things with stuff that is not >> yet accepted upstream so there will be some differences. >> >> It should be compatible with the OFED 1.4 userspace. >> > Beware -- you should not use OFED userspace with a non-ofed kernel for ConnectX HCAs. > The OFED 1.4 ConnectX driver includes the XRC (Extended RC) patches -- which grab > 23 (the MSB) of the QP number to indicate an XRC SRQ in CQEs.  Non-OFED kernels do not > reserve bit 23 for this usage, so you will experience incompatibility problems. > > In general, you should not use OFED userspace libraries with non-OFED kernel distributions. For mainstream Linux kernel development it is considered crucial to keep the interface between user space and kernel backwards compatible. Does the above imply that the OFED kernel components do not follow this rule ? Bart. From hnrose at comcast.net Sun Apr 26 05:30:09 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Sun, 26 Apr 2009 08:30:09 -0400 Subject: [ofa-general] [PATCH] opensm/PerfMgr: Change redir_tbl_size to num_ports for better clarity Message-ID: <20090426123009.GA25119@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/include/opensm/osm_perfmgr.h b/opensm/include/opensm/osm_perfmgr.h index 45dec54..16a59ef 100644 --- a/opensm/include/opensm/osm_perfmgr.h +++ b/opensm/include/opensm/osm_perfmgr.h @@ -103,7 +103,7 @@ typedef struct _monitored_node { uint64_t guid; boolean_t esp0; char *name; - uint32_t redir_tbl_size; + uint32_t num_ports; redir_t redir_port[1]; /* redirection on a per port basis */ } __monitored_node_t; diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c index 8d5ed97..7c24819 100644 --- a/opensm/opensm/osm_perfmgr.c +++ b/opensm/opensm/osm_perfmgr.c @@ -218,13 +218,13 @@ static void perfmgr_mad_send_err_callback(void *bind_context, /* First, find the node in the monitored map */ cl_plock_acquire(pm->lock); /* Now, validate port number */ - if (port >= p_mon_node->redir_tbl_size) { + if (port >= p_mon_node->num_ports) { cl_plock_release(pm->lock); OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C16: " "Invalid port num %u for %s (GUID 0x%016" PRIx64 ") num ports %u\n", port, p_mon_node->name, p_mon_node->guid, - p_mon_node->redir_tbl_size); + p_mon_node->num_ports); goto Exit; } /* Clear redirection info */ @@ -309,8 +309,7 @@ static ib_net32_t get_qp(__monitored_node_t * mon_node, uint8_t port) { ib_net32_t qp = cl_ntoh32(1); - if (mon_node && mon_node->redir_tbl_size && - port < mon_node->redir_tbl_size && + if (mon_node && mon_node->num_ports && port < mon_node->num_ports && mon_node->redir_port[port].redir_lid && mon_node->redir_port[port].redir_qp) qp = mon_node->redir_port[port].redir_qp; @@ -325,8 +324,7 @@ static ib_net32_t get_qp(__monitored_node_t * mon_node, uint8_t port) static ib_net16_t get_lid(osm_node_t * p_node, uint8_t port, __monitored_node_t * mon_node) { - if (mon_node && mon_node->redir_tbl_size && - port < mon_node->redir_tbl_size && + if (mon_node && mon_node->num_ports && port < mon_node->num_ports && mon_node->redir_port[port].redir_lid) return mon_node->redir_port[port].redir_lid; @@ -422,15 +420,16 @@ static void __collect_guids(cl_map_item_t * p_map_item, void *context) uint64_t node_guid = cl_ntoh64(node->node_info.node_guid); osm_perfmgr_t *pm = (osm_perfmgr_t *) context; __monitored_node_t *mon_node = NULL; - uint32_t size; + uint32_t num_ports; OSM_LOG_ENTER(pm->log); if (cl_qmap_get(&pm->monitored_map, node_guid) == cl_qmap_end(&pm->monitored_map)) { /* if not already in our map add it */ - size = osm_node_get_num_physp(node); - mon_node = malloc(sizeof(*mon_node) + sizeof(redir_t) * size); + num_ports = osm_node_get_num_physp(node); + mon_node = malloc(sizeof(*mon_node) + + sizeof(redir_t) * num_ports); if (!mon_node) { OSM_LOG(pm->log, OSM_LOG_ERROR, "PerfMgr: ERR 4C06: " "malloc failed: not handling node %s" @@ -438,10 +437,11 @@ static void __collect_guids(cl_map_item_t * p_map_item, void *context) node_guid); goto Exit; } - memset(mon_node, 0, sizeof(*mon_node) + sizeof(redir_t) * size); + memset(mon_node, 0, + sizeof(*mon_node) + sizeof(redir_t) * num_ports); mon_node->guid = node_guid; mon_node->name = strdup(node->print_desc); - mon_node->redir_tbl_size = size; + mon_node->num_ports = num_ports; /* check for enhanced switch port 0 */ mon_node->esp0 = (node->sw && ib_switch_info_is_enhanced_port0(&node->sw-> @@ -1119,12 +1119,12 @@ static void pc_rcv_process(void *context, void *data) /* LID redirection support (easier than GID redirection) */ cl_plock_acquire(pm->lock); /* Now, validate port number */ - if (port >= p_mon_node->redir_tbl_size) { + if (port >= p_mon_node->num_ports) { cl_plock_release(pm->lock); OSM_LOG(pm->log, OSM_LOG_ERROR, "ERR 4C13: " "Invalid port num %d for GUID 0x%016" PRIx64 " num ports %d\n", port, node_guid, - p_mon_node->redir_tbl_size); + p_mon_node->num_ports); goto Exit; } p_mon_node->redir_port[port].redir_lid = cpi->redir_lid; From sashak at voltaire.com Sun Apr 26 05:58:27 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 26 Apr 2009 15:58:27 +0300 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: <200904261431.18620.jackm@dev.mellanox.co.il> References: <20090423234839.GE4431@obsidianresearch.com> <200904261431.18620.jackm@dev.mellanox.co.il> Message-ID: <20090426125827.GA6513@sk> On 14:31 Sun 26 Apr , Jack Morgenstein wrote: > > > > It should be compatible with the OFED 1.4 userspace. > > > Beware -- you should not use OFED userspace with a non-ofed kernel for ConnectX HCAs. I don't think that it is affected all userspace packages (personally I'm never using OFED kernels). > In general, you should not use OFED userspace libraries with non-OFED kernel distributions. In general such requirement seems fundamentally bad for me. OFED goal is to provide support for IB and iWARP, and not to develop its own linux kernel. Sasha From sashak at voltaire.com Sun Apr 26 06:14:15 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 26 Apr 2009 16:14:15 +0300 Subject: [ofa-general] Re: [PATCH v3 1/2] opensm: setup routing engine when in use and delete when fail In-Reply-To: <49F44A19.1060407@gmail.com> References: <20090312132137.GB8818@sashak.voltaire.com> <49B92C27.7060904@gmail.com> <20090312160528.GW8818@sashak.voltaire.com> <49BE0CC4.6030600@gmail.com> <20090317133548.GL12557@sashak.voltaire.com> <49C24AB4.9060505@gmail.com> <20090426104357.GA23250@sk> <49F43EF4.5070305@gmail.com> <20090426112731.GB23250@sk> <49F44A19.1060407@gmail.com> Message-ID: <20090426131415.GC6513@sk> On 14:48 Sun 26 Apr , Eli Dorfman (Voltaire) wrote: > > delete() is called conditionally from destroy_routing_engines(osm_opensm_t *osm) > > if (r->delete) > r->delete(r->context); > > Also all re(s) are cleared when created so delete is NULL if setup() was not called. Ok, you are partially right (I forgot that delete() is initialized only in setup() phase), but what will happen with RE where setup() and delete() were already called? Assumption that delete() will clear RE again is wrong - it just destroys its internal data. > So I don't see any problem here. Try with LASH or UPDN.... (ftree will not fail just occasionally - it returns when context is NULL). Sasha From sashak at voltaire.com Sun Apr 26 07:17:40 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 26 Apr 2009 17:17:40 +0300 Subject: [ofa-general] Re: [PATCH] libibmad: Add decode support for SwitchInfo OptimizedSLtoVLMappingProgramming In-Reply-To: <20090414135419.GA27549@comcast.net> References: <20090414135419.GA27549@comcast.net> Message-ID: <20090426141740.GE6513@sk> On 09:54 Tue 14 Apr , Hal Rosenstock wrote: > > Signed-off-by: Hal Rosenstock Applied. Thanks. Sasha From sashak at voltaire.com Sun Apr 26 07:19:25 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 26 Apr 2009 17:19:25 +0300 Subject: [ofa-general] Re: [PATCH] libibmad: bump library interface version In-Reply-To: <20090311144403.5524d85c.weiny2@llnl.gov> References: <20090311144403.5524d85c.weiny2@llnl.gov> Message-ID: <20090426141925.GF6513@sk> On 14:44 Wed 11 Mar , Ira Weiny wrote: > > From: Ira Weiny > Date: Wed, 11 Mar 2009 10:44:28 -0700 > Subject: [PATCH] libibmad: bump library interface version > > There has been enough interface changes to warrant a new version. > > Signed-off-by: Ira Weiny Applied. Thanks. Sasha From kliteyn at dev.mellanox.co.il Sun Apr 26 07:44:10 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Sun, 26 Apr 2009 17:44:10 +0300 Subject: [ofa-general] [PATCH] ibutils: git-log calls have been changed to git log as git-xxx syntax is not working with latest git releases In-Reply-To: <20090422052845.GA13093@obsidianresearch.com> References: <49ED7AAE.3010707@ext.bull.net> <20090422052845.GA13093@obsidianresearch.com> Message-ID: <49F4733A.3090603@dev.mellanox.co.il> Jason, Nicolas, Thanks a lot. Jason Gunthorpe wrote: > On Tue, Apr 21, 2009 at 09:50:06AM +0200, Nicolas Morey-Chaisemartin wrote: > >> Signed-off-by: Nicolas Morey-Chaisemartin >> >> ibdiag/src/Makefile.am | 2 +- >> ibdm/ibdm/Makefile.am | 2 +- >> ibis/src/Makefile.am | 2 +- >> ibmgtsim/src/Makefile.am | 2 +- >> 4 files changed, 4 insertions(+), 4 deletions(-) >> >> diff --git a/ibdiag/src/Makefile.am b/ibdiag/src/Makefile.am >> index def8b0a..7158bbd 100644 >> +++ b/ibdiag/src/Makefile.am >> @@ -42,7 +42,7 @@ GIT=$(shell which git) >> >> git_version.tcl : @MAINTAINER_MODE_TRUE@ FORCE >> if test x$(GIT) != x ; then \ >> - gitver=`cd $(srcdir) ; git-log | head -1 | cut -f2 -d\ `; \ >> + gitver=`cd $(srcdir) ; git log | head -1 | cut -f2 -d\ `; \ >> changes=`cd $(srcdir) ; git diff . | grep ^diff | wc -l`; \ > > Gah, that is an awful choice of command for this purpose anyhow. All > of those should just be: > > git rev-parse --verify HEAD I'm using this suggestion. Something like this: diff --git a/ibdiag/src/Makefile.am b/ibdiag/src/Makefile.am index def8b0a..d32d914 100644 --- a/ibdiag/src/Makefile.am +++ b/ibdiag/src/Makefile.am @@ -42,7 +42,7 @@ GIT=$(shell which git) git_version.tcl : @MAINTAINER_MODE_TRUE@ FORCE if test x$(GIT) != x ; then \ - gitver=`cd $(srcdir) ; git-log | head -1 | cut -f2 -d\ `; \ + gitver=`cd $(srcdir) ; rev-parse --verify HEAD`; \ changes=`cd $(srcdir) ; git diff . | grep ^diff | wc -l`; \ else \ gitver=undefined; changes=0; \ diff --git a/ibdm/ibdm/Makefile.am b/ibdm/ibdm/Makefile.am index b0958fc..1c57b3b 100644 --- a/ibdm/ibdm/Makefile.am +++ b/ibdm/ibdm/Makefile.am @@ -96,7 +96,7 @@ GIT=$(shell which git) $(srcdir)/git_version.h: @MAINTAINER_MODE_TRUE@ FORCE if test x$(GIT) != x ; then \ - gitver=`cd $(srcdir) ; git-log | head -1 | cut -f2 -d\ `; \ + gitver=`cd $(srcdir) ; git rev-parse --verify HEAD`; \ changes=`cd $(srcdir) ; git diff . | grep ^diff | wc -l`; \ else \ gitver=undefined; changes=0; \ diff --git a/ibis/src/Makefile.am b/ibis/src/Makefile.am index 7f415f0..b535297 100644 --- a/ibis/src/Makefile.am +++ b/ibis/src/Makefile.am @@ -98,7 +98,7 @@ GIT=$(shell which git) $(srcdir)/git_version.h: @MAINTAINER_MODE_TRUE@ FORCE if test x$(GIT) != x ; then \ - gitver=`cd $(srcdir) ; git-log | head -1 | cut -f2 -d\ `; \ + gitver=`cd $(srcdir) ; git rev-parse --verify HEAD`; \ changes=`cd $(srcdir) ; git diff . | grep ^diff | wc -l`; \ else \ gitver=undefined; changes=0; \ diff --git a/ibmgtsim/src/Makefile.am b/ibmgtsim/src/Makefile.am index 6585a11..f23f2d8 100644 --- a/ibmgtsim/src/Makefile.am +++ b/ibmgtsim/src/Makefile.am @@ -95,7 +95,7 @@ GIT=$(shell which git) $(srcdir)/git_version.h: @MAINTAINER_MODE_TRUE@ FORCE if test x$(GIT) != x ; then \ - gitver=`cd $(srcdir) ; git-log | head -1 | cut -f2 -d\ `; \ + gitver=`cd $(srcdir) ; git rev-parse --verify HEAD`; \ changes=`cd $(srcdir) ; git diff . | grep ^diff | wc -l`; \ else \ gitver=undefined; changes=0; \ Will push it shortly. -- Yevgeny > Which gives the same output, dramatically faster. > > Jason From sashak at voltaire.com Sun Apr 26 07:42:55 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 26 Apr 2009 17:42:55 +0300 Subject: [ofa-general] [ANNOUNCE] management tarballs release Message-ID: <20090426144255.GG6513@sk> Hi, There is a new release of the management (OpenSM and infiniband diagnostics) tarballs available in: http://www.openfabrics.org/downloads/management/ md5sum: 8d88ad53f0adeb9a5be24754d7c3058c libibumad-1.3.2.tar.gz b3af39af187b0d7da13f15fd2fc7987d libibmad-1.3.2.tar.gz 8925a54defa3e87573a6d127b8790f7f opensm-3.3.2.tar.gz e10289e4b428abfc1001d65eb34cfa9c infiniband-diags-1.5.2.tar.gz All component versions are from recent master branch. Full change log is below. Sasha Dale Purdy (1): opensm: Implement weighted routing David A. McMillen (1): infiniband-diags/ibnetdiscover: Apply --node-name-map remapping to names printed by --ports option Eli Dorfman (Voltaire) (5): send trap144 when local priority is higher than master priority fix local port smlid opensm: set IS_SM bit during opensm init opensm/osm_opensm.c: add newline to log message ib_types.h: fix commit 103891092f5f6f0b2cf56555e19fdf008f164c41 Hal Rosenstock (71): opensm/osm_perfmgr.c: In osm_perfmgr_shutdown, add missing cl_disp_unregister opensm/osm_perfmgr.c: Improve assert in osm_pc_rcv_process opensm: Return error status when cl_disp_register fails opensm/ib_types.h: Add attribute ID for PortCountersExtended libibumad/umad.c: Cosmetic changes opensm/osm_port.h: Fix a commentary typo opensm/osm_state_mgr.c: Cosmetic commentary change opensm/osm_trap_rcv.c: Cosmetic changes opensm/libvendor/osm_vendor_ibumad.c: Commentary changes opensm/libvendor/osm_vendor_ibumad.c: In clear_madw, fix tid endian in message opensm/osm_req.c: In osm_send_trap144, set producer type according to node type opensm/infiniband-diags: Changes for C rather than C++ style comments opensm/PerfMgr: A few more esp0 changes opensm/osm_req.c: Shouldn't reveal port's MKey on Trap method opensm/osm_inform.c: In __osm_send_report, make sure p_report_madw valid before using Add pkey table support to osm_get_all_port_attr opensm: Handle trap repress on trap 144 generation libvendor/osm_vendor_mlx_dispatcher.c: Eliminate no longer needed osmv_mad_is_response libibmad/register.c: Cosmetic formatting change opensm/osm_trap_rcv.c: Remove extraneous comment infiniband-diags/perfquery.8: Update man page for PortXmit/RcvDataSL infiniband-diags/perfquery.c: Fix some memory leaks on exit libibmad/mad.h: Cosmetic formatting changes infiniband-diags/vendstat.c: Add missing mad_rpc_close_port call infiniband-diags/perfquery.8: Extended counters are now -x rather than -e opensm/osm_req.c: Update log message based on commit 3551389dcb7353ffd51c66e6ad518648bc1dd19e opensm/osm_req.c: Update send_trap144() log message libibmad/libibmad.map: Eliminate perf_classportinfo_query_via opensm: Add common ib_gid_is_notzero routine opensm: Utilize ib_gid_is_notzero routine infiniband-diags/perfquery.c: Label PortXmit/RcvDataSL counters in headings libibmad/sa.c: No need to specify NumbPath field in Get request of SA PathRecord infiniband-diags/perfquery.8: Fix typo in short option for PortXmitDataSL counters opensm/include/ib_types.h: Fix some typos opensm: Some cosmetic changes opensm: Remove __osm_ prefixes libibmad/sa.c: Cosmetic formatting changes libibmad: Add PortSelect and CounterSelect fields for PortXmit/RcvDataSL libibmad/dump.c: Cosmetic formatting changes opensm/iba/ib_types.h: Add PortXmit/RcvDataSL PerfMgt attributes opensm/osm_sminfo_rcv.c: Minor simplification opensm/osm_sm_state_mgr.c: Remove unneeded return statement opensm: Remove some __ prefixes infiniband-diags/vendstat: Update man page and examples for PortXmit/RcvDataSL counter support opensm/partition-config.txt: Update for defmember feature opensm/include/ib_types.h: Add ib_switch_info_get_state_opt_sl2vlmapping routine opensm/osm_qos.c: Cosmetic formatting changes opensm/osm_qos.c: Cosmetic formatting changes opensm/osm_slvl_map_rcv.c: Cosmetic formatting changes opensm/osm_link_mgr.c: Remove extraneous parentheses libibmad/fields.c: Display CounterSelect2 in hex rather than decimal opensm/osm_ucast_mgr.c: Cosmetic formatting change libibmad/rpc.c: wrap ERRS macro. opensm/osm_helper.c: Add more info for traps 144 and 256-259 in osm_dump_notice infiniband-diags/ibsendtrap.c: Local link integrity is an "urgent" trap opensm/osm_sa.c: Cosmetic change to a few log messages opensm: Add Dell to known vendor list opensm: Improve some snprintf uses opensm/iba/ib_types.h: Add MaxCreditHint and LinkRoundTripLatency to PortInfo attribute opensm/osm_helper.c: Add support for MaxCreditHint and LinkRoundTripLatency to osm_dump_port_info infiniband-diags/man/vendstat.8: Fix PortXmit/RcvDataSL examples infiniband-diags/ibsendtrap.c: Set producer type according to node type opensm/osm_helper.c: Convert remaining helper routines for GID printing format opensm: Some cosmetic formatting changes infiniband-diags/man/vendstat.8: Indicate IS4 config group config not persistent across IS4 reboot opensm/osm_perfmgr.c: Eliminate duplicated error number opensm/man/opensm.8.in: Add mention of backing documentation for QoS policy file and performance manager opensm/include/opensm/osm_pkey.h: Fix commentary typo opensm/osm_pkey_mgr.c: Fix pkey endian in log message opensm/doc/performance-manager-HOWTO.txt: Indicate (previously implied) master state libibmad: Add decode support for SwitchInfo OptimizedSLtoVLMappingProgramming Ira Weiny (14): libibmad: Clean up "new" interface infiniband-diags: Convert ibaddr to "new" ibmad interface infiniband-diags: Convert ibportstate to "new" ibmad interface infiniband-diags: Convert ibroute to "new" ibmad interface infiniband-diags: Convert ibsendtrap to "new" ibmad interface infiniband-diags: Convert ibtracert to "new" ibmad interface infiniband-diags: Convert ibsysstat to "new" ibmad interface infiniband-diags: Convert mcm_rereg_test to "new" ibmad interface infiniband-diags: Convert perfquery, saquery, sminfo, smpquery, and vendstat to "new" ibmad interface infiniband-diags: convert ibnetdiscover to "new" ibmad interface Fix further bugs around console closure and clean up code. Fix ibidsverify.pl to use the correct cache file change missed LID conversion functions from hex to uint libibmad: bump library interface version Itai Baz (3): ib_types.h: Adding BKEY violation trap (259) libibmad/serv.c: Fixed respond function to return proper result code libibmad: added support for handling of BM (Baseboard management) MADs - FIXED without rmpp Julia Volynsky (1): Added send trap for trap 129 (local link integrity) Line.Holen at Sun.COM (3): opensm/osm_link_mgr.c initialize SMSL opensm/osm_link_mgr.c Remove __osm_ prefix opensm/osm_link_mgr.c: indentation fixes Nicolas Morey Chaisemartin (9): opensm: Added io_guid_file and max_reverse_hops options opensm/osm_ucast_ftree.c: Added possible reverse hops for Ftree algorithm. Added documentation for io_guid_file and max_reverse_hop feature opensm/osm_ucast_ftree.c: Removed useless initialisation on switch indexes opensm/osm_switch.h : Fixed wrong comment about return value of osm_switch_set_hops opensm/console: Fixed osm_console poll to handle POLLHUP opensm/osm_console_io.h: Modify osm_console_exit so only the connection is killed, not the socket Fixed cio_close use when ENABLE_OSM_CONSOLE_SOCKET is not set opensm/osm_ucat_ftree.c Enhance min hops counters usage Or Gerlitz (7): generic libibmad perf query/reset api libimad implementation of PortXmtDataSL and PortRcvDataSL perfquery PortXmtDataSL/PortRcvDataSL support infiniband-diags: update configure.in check for libibmad API fix offset used for parsing of XmtDataSL & RcvDataSL infiniband-diags/perfquery: add srcport param ib-diag/vendstat: counter-group-info & config-counter-group vendor mads Sasha Khapyorsky (38): opensm/main.c: cosmetic opensm: indentation fixes opensm/osm_console.c: kill warning: defined but not used opensm/osm_ucast_mgr.c: code simplifications opensm/osm_lid_mgr: use single array for used_lids opensm: initialize all switch ports opensm: remove unneeded anymore physp initializations opensm: PortInfo requests for discovered switches opensm: remove casting of ib_smp_get_payload_ptr() osmtest: remove useless prototypes opensm/osm_console_io.c: remove 'osm_' prefix from static function names opensm: fix build warning with --disable-console-socket libibmad: cleanup deprecated function use opensm/console: move cio_open() function opensm/osm_console_io.c: move cio_close() function libibmad/rpc: fix class registration bug libibmad: per port timeout and retires setup libibmad: add mad_rpc_class_agent() call libibmad: deprecate old API calls libibmad/rpc: fix _do_madrpc() parameter value opensm/osm_ucast_ftree: indentation fixes opensm: some init functions simplification opensm/osm_sa_link_record.c: improve get_base_lid() opensm: kill __osm_ prefixes in static functions opensm/osm_sw_info_rcv.c: consolidate flows infiniband-diags/vendstat.c: code moving opensm: fix indentations with osm_indent. opensm: clean OSM_CDECL macro infiniband-diags/ibsendtrap: code consolidation opensm/osm_helper.c: return then log is inactive opensm/osm_helper.c: consolidate dr path printing code opensm/osm_helper.c: use single buffer in osm_dump_dr_smp() infiniband-diags/perltidy.sh: option to format a single file ibdiag_common: remove duplicated ibdebug variable infiniabnd-diags/bm.c: check mad_rpc() status libibmad: indentation fixes opensm: Release Notes update management: bump package versions Yevgeny Kliteynik (5): opensm/osm_subnet.c: fixing compiler warnings opensm/osm_ucast_ftree.c: remove __osm_ftree prefix in static functions opensm/osm_ucast_ftree.c: some refactoring opensm/osm_ucast_ftree.c: fixing bug in indexing opensm/osm_ucast_ftree.c: lids are always handled in host order From sashak at voltaire.com Sun Apr 26 07:58:40 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Sun, 26 Apr 2009 17:58:40 +0300 Subject: [ofa-general] [PATCH] ibutils: git-log calls have been changed to git log as git-xxx syntax is not working with latest git releases In-Reply-To: <49F4733A.3090603@dev.mellanox.co.il> References: <49ED7AAE.3010707@ext.bull.net> <20090422052845.GA13093@obsidianresearch.com> <49F4733A.3090603@dev.mellanox.co.il> Message-ID: <20090426145840.GH6513@sk> On 17:44 Sun 26 Apr , Yevgeny Kliteynik wrote: > > git_version.tcl : @MAINTAINER_MODE_TRUE@ FORCE > if test x$(GIT) != x ; then \ > - gitver=`cd $(srcdir) ; git-log | head -1 | cut -f2 -d\ `; \ > + gitver=`cd $(srcdir) ; rev-parse --verify HEAD`; \ > changes=`cd $(srcdir) ; git diff . | grep ^diff | wc -l`; \ BTW here you are diffing against local index and not against git tree in repository. So if somebody does changes and run 'git add changed_file' you will not see any differences using just 'git diff .'. Actually return status of: git diff --quiet HEAD , or git diff-index --quiet HEAD should be enough for local changes (any kind) detection. Sasha From bart.vanassche at gmail.com Sun Apr 26 08:26:34 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Sun, 26 Apr 2009 17:26:34 +0200 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: <200904261431.18620.jackm@dev.mellanox.co.il> References: <20090423234839.GE4431@obsidianresearch.com> <200904261431.18620.jackm@dev.mellanox.co.il> Message-ID: On Sun, Apr 26, 2009 at 1:31 PM, Jack Morgenstein wrote: > In general, you should not use OFED userspace libraries with non-OFED kernel distributions. But that's exactly what most Linux distributions do. While I have only verified for Ubuntu 8.10 that it combines the mainstream kernel with OFED userspace, this is probably what all non-enterprise Linux distributions do. Bart. From jgunthorpe at obsidianresearch.com Sun Apr 26 11:01:57 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Sun, 26 Apr 2009 12:01:57 -0600 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: <200904261431.18620.jackm@dev.mellanox.co.il> References: <20090423234839.GE4431@obsidianresearch.com> <200904261431.18620.jackm@dev.mellanox.co.il> Message-ID: <20090426180157.GA29727@obsidianresearch.com> On Sun, Apr 26, 2009 at 02:31:18PM +0300, Jack Morgenstein wrote: > On Friday 24 April 2009 02:48, Jason Gunthorpe wrote: > > AFAIK, Ubuntu does not do any work on their IB drivers, so the driver > > is stock 2.6.27. > > > > In principle OFED is supposed to start with an upstream kernel and > > backport those drivers to various distributions. OFED 1.3 was using > > 2.6.24, OFED 1.4 is apparently using 2.6.27. > > > > So it should be similar to OFED 1.4 > > > > Though bear in mind OFED still patches things with stuff that is not > > yet accepted upstream so there will be some differences. > > > > It should be compatible with the OFED 1.4 userspace. > > > Beware -- you should not use OFED userspace with a non-ofed kernel for ConnectX HCAs. > The OFED 1.4 ConnectX driver includes the XRC (Extended RC) patches -- which grab > 23 (the MSB) of the QP number to indicate an XRC SRQ in CQEs. Non-OFED kernels do not > reserve bit 23 for this usage, so you will experience incompatibility problems. > > In general, you should not use OFED userspace libraries with non-OFED kernel distributions. That is hugely unfriendly and not really 'the linux way'.. Jason From devel-ofed at morey-chaisemartin.com Sun Apr 26 12:31:22 2009 From: devel-ofed at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Sun, 26 Apr 2009 21:31:22 +0200 Subject: [ofa-general] [PATCH] ibutils: git-log calls have been changed to git log as git-xxx syntax is not working with latest git releases In-Reply-To: <49F4733A.3090603@dev.mellanox.co.il> References: <49ED7AAE.3010707@ext.bull.net> <20090422052845.GA13093@obsidianresearch.com> <49F4733A.3090603@dev.mellanox.co.il> Message-ID: <49F4B68A.9030000@morey-chaisemartin.com> Le 26/04/2009 16:44, Yevgeny Kliteynik a écrit : > diff --git a/ibdiag/src/Makefile.am b/ibdiag/src/Makefile.am > index def8b0a..d32d914 100644 > --- a/ibdiag/src/Makefile.am > +++ b/ibdiag/src/Makefile.am > @@ -42,7 +42,7 @@ GIT=$(shell which git) > > git_version.tcl : @MAINTAINER_MODE_TRUE@ FORCE > if test x$(GIT) != x ; then \ > - gitver=`cd $(srcdir) ; git-log | head -1 | cut -f2 -d\ `; \ > + gitver=`cd $(srcdir) ; rev-parse --verify HEAD`; \ > changes=`cd $(srcdir) ; git diff . | grep ^diff | wc -l`; \ > else \ > gitver=undefined; changes=0; \ I think you forgot a git in front of rev-parse here Nicolas From kliteyn at dev.mellanox.co.il Sun Apr 26 14:31:35 2009 From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik) Date: Mon, 27 Apr 2009 00:31:35 +0300 Subject: [ofa-general] [PATCH] ibutils: git-log calls have been changed to git log as git-xxx syntax is not working with latest git releases In-Reply-To: <49F4B68A.9030000@morey-chaisemartin.com> References: <49ED7AAE.3010707@ext.bull.net> <20090422052845.GA13093@obsidianresearch.com> <49F4733A.3090603@dev.mellanox.co.il> <49F4B68A.9030000@morey-chaisemartin.com> Message-ID: <49F4D2B7.2050503@dev.mellanox.co.il> Nicolas Morey-Chaisemartin wrote: > Le 26/04/2009 16:44, Yevgeny Kliteynik a écrit : >> diff --git a/ibdiag/src/Makefile.am b/ibdiag/src/Makefile.am >> index def8b0a..d32d914 100644 >> --- a/ibdiag/src/Makefile.am >> +++ b/ibdiag/src/Makefile.am >> @@ -42,7 +42,7 @@ GIT=$(shell which git) >> >> git_version.tcl : @MAINTAINER_MODE_TRUE@ FORCE >> if test x$(GIT) != x ; then \ >> - gitver=`cd $(srcdir) ; git-log | head -1 | cut -f2 -d\ `; \ >> + gitver=`cd $(srcdir) ; rev-parse --verify HEAD`; \ >> changes=`cd $(srcdir) ; git diff . | grep ^diff | wc -l`; \ >> else \ >> gitver=undefined; changes=0; \ > > I think you forgot a git in front of rev-parse here > Indeed, thanks. Fixed. -- Yevgeny > Nicolas > From jackm at dev.mellanox.co.il Sun Apr 26 23:18:34 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 27 Apr 2009 09:18:34 +0300 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: <20090426180157.GA29727@obsidianresearch.com> References: <200904261431.18620.jackm@dev.mellanox.co.il> <20090426180157.GA29727@obsidianresearch.com> Message-ID: <200904270918.35281.jackm@dev.mellanox.co.il> On Sunday 26 April 2009 21:01, Jason Gunthorpe wrote: > > In general, you should not use OFED userspace libraries with non-OFED kernel distributions. > > That is hugely unfriendly and not really 'the linux way'.. > > Jason > I know. I did A LOT of work to avoid incompatibilities. This particular incompatibility went unnoticed until someone encountered it. In general, though, the OFED distributions are tightly integrated packages which have undergone extensive QA on many Linux distributions. Taking pieces of them and mixing them with non-OFED packages is not recommended, as no QA is done on such mixtures. - Jack From jackm at dev.mellanox.co.il Sun Apr 26 23:27:40 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 27 Apr 2009 09:27:40 +0300 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: <20090426125827.GA6513@sk> References: <200904261431.18620.jackm@dev.mellanox.co.il> <20090426125827.GA6513@sk> Message-ID: <200904270927.40449.jackm@dev.mellanox.co.il> On Sunday 26 April 2009 15:58, Sasha Khapyorsky wrote: > On 14:31 Sun 26 Apr , Jack Morgenstein wrote: > > > > > > It should be compatible with the OFED 1.4 userspace. > > > > > Beware -- you should not use OFED userspace with a non-ofed kernel for ConnectX HCAs. > > I don't think that it is affected all userspace packages (personally I'm > never using OFED kernels). > > > In general, you should not use OFED userspace libraries with non-OFED kernel distributions. > > In general such requirement seems fundamentally bad for me. OFED goal is > to provide support for IB and iWARP, and not to develop its own linux > kernel. > > Sasha > The OFED distributions may contain features that the mainstream kernels and libraries do not support. These features frequently require changes in the Infiniband kernel modules. Such changes are in the form of kernel patches which are applied to the base mainstream kernel on which the OFED release is based. A lag between the mainstream kernel and the OFED kernel is unavoidable, since the new features are first released in the OFED distributions -- and later, gradually (and hopefully), these features make there way into the upstream kernel. The integrated OFED package undergoes extensive QA. There is no QA performed on ofed/non-ofed mixtures. If you are using a non-OFED kernel, you should also be using non-OFED userspace libraries. If you need features that are only in OFED, you should be using the entire package and not pieces of it. - Jack From nicolas.morey-chaisemartin at ext.bull.net Mon Apr 27 00:35:47 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey-Chaisemartin) Date: Mon, 27 Apr 2009 09:35:47 +0200 Subject: [ofa-general] [PATCH] libibverbs: Fixed verbs_man_page.patch Message-ID: <49F56053.1010705@ext.bull.net> This fixes fixes/verbs_man_page.patch so it can be applied on HEAD. Signed-off-by: Nicolas Morey-Chaisemartin --- fixes/verbs_man_page.patch | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/fixes/verbs_man_page.patch b/fixes/verbs_man_page.patch index b69acc6..f60188a 100644 --- a/fixes/verbs_man_page.patch +++ b/fixes/verbs_man_page.patch @@ -13,10 +13,10 @@ diff --git a/Makefile.am b/Makefile.am index 705b184..45914d3 100644 --- a/Makefile.am +++ b/Makefile.am -@@ -52,7 +52,7 @@ man_MANS = man/ibv_asyncwatch.1 man/ibv_devices.1 man/ibv_devinfo.1 \ - man/ibv_post_srq_recv.3 man/ibv_query_device.3 man/ibv_query_gid.3 \ - man/ibv_query_pkey.3 man/ibv_query_port.3 man/ibv_query_qp.3 \ - man/ibv_query_srq.3 man/ibv_rate_to_mult.3 man/ibv_reg_mr.3 \ +@@ -53,7 +53,7 @@ man_MANS = man/ibv_asyncwatch.1 man/ibv_devices.1 man/ibv_devinfo.1 \ + man/ibv_post_srq_recv.3 man/ibv_query_device.3 man/ibv_query_gid.3 \ + man/ibv_query_pkey.3 man/ibv_query_port.3 man/ibv_query_qp.3 \ + man/ibv_query_srq.3 man/ibv_rate_to_mult.3 man/ibv_reg_mr.3 \ - man/ibv_req_notify_cq.3 man/ibv_resize_cq.3 + man/ibv_req_notify_cq.3 man/ibv_resize_cq.3 man/verbs.7 -- 1.6.2.GIT From dorfman.eli at gmail.com Mon Apr 27 01:36:24 2009 From: dorfman.eli at gmail.com (Eli Dorfman (Voltaire)) Date: Mon, 27 Apr 2009 11:36:24 +0300 Subject: [ofa-general] Re: [PATCH v3 1/2] opensm: setup routing engine when in use and delete when fail In-Reply-To: <20090426131415.GC6513@sk> References: <20090312132137.GB8818@sashak.voltaire.com> <49B92C27.7060904@gmail.com> <20090312160528.GW8818@sashak.voltaire.com> <49BE0CC4.6030600@gmail.com> <20090317133548.GL12557@sashak.voltaire.com> <49C24AB4.9060505@gmail.com> <20090426104357.GA23250@sk> <49F43EF4.5070305@gmail.com> <20090426112731.GB23250@sk> <49F44A19.1060407@gmail.com> <20090426131415.GC6513@sk> Message-ID: <49F56E88.5090006@gmail.com> Sasha Khapyorsky wrote: > On 14:48 Sun 26 Apr , Eli Dorfman (Voltaire) wrote: >> delete() is called conditionally from destroy_routing_engines(osm_opensm_t *osm) >> >> if (r->delete) >> r->delete(r->context); >> >> Also all re(s) are cleared when created so delete is NULL if setup() was not called. > > Ok, you are partially right (I forgot that delete() is initialized only > in setup() phase), but what will happen with RE where setup() and > delete() were already called? Assumption that delete() will clear RE > again is wrong - it just destroys its internal data. Ok, so this may crash when opensm goes down. This can be fixed by: /* context is set to NULL after previous delete */ if (r->delete && r->context) r->delete(r->context); > >> So I don't see any problem here. > > Try with LASH or UPDN.... (ftree will not fail just occasionally - it > returns when context is NULL). > From vlad at lists.openfabrics.org Mon Apr 27 03:21:30 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Mon, 27 Apr 2009 03:21:30 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090427-0200 daily build status Message-ID: <20090427102131.3E2A8E610C5@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From monis at Voltaire.COM Mon Apr 27 03:46:24 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Mon, 27 Apr 2009 13:46:24 +0300 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: <200904270918.35281.jackm@dev.mellanox.co.il> References: <200904261431.18620.jackm@dev.mellanox.co.il> <20090426180157.GA29727@obsidianresearch.com> <200904270918.35281.jackm@dev.mellanox.co.il> Message-ID: <49F58D00.30908@Voltaire.COM> Jack Morgenstein wrote: > On Sunday 26 April 2009 21:01, Jason Gunthorpe wrote: >>> In general, you should not use OFED userspace libraries with non-OFED kernel distributions. >> That is hugely unfriendly and not really 'the linux way'.. >> >> Jason >> > I know. I did A LOT of work to avoid incompatibilities. This particular incompatibility went unnoticed until someone encountered it. > > In general, though, the OFED distributions are tightly integrated packages which have undergone extensive QA on many Linux distributions. > Taking pieces of them and mixing them with non-OFED packages is not recommended, as no QA is done on such mixtures. > > - Jack > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > So, Is there an easy way for upstream kernel users that want user space functionality? From nicolas.morey-chaisemartin at ext.bull.net Mon Apr 27 03:51:29 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey-Chaisemartin) Date: Mon, 27 Apr 2009 12:51:29 +0200 Subject: [ofa-general] [PATCH] ibsim: Fixed custom release in SPEC file Message-ID: <49F58E31.3020005@ext.bull.net> Removed a space which make rpmbuild fail when _dist and CUSTOM_RELEASE are set: error: line 15: Tag takes single token only: Release: ofed1.4.1 .fc11 This is due to Release: %rel%{?dist} and %rel having a trailing whitespace. Signed-off-by: Nicolas Morey-Chaisemartin --- ibsim.spec.in | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/ibsim.spec.in b/ibsim.spec.in index b787248..d6ec898 100644 --- a/ibsim.spec.in +++ b/ibsim.spec.in @@ -1,6 +1,6 @@ %define RELEASE @RELEASE@ -%define rel %{?CUSTOM_RELEASE} %{!?CUSTOM_RELEASE:%RELEASE} +%define rel %{?CUSTOM_RELEASE}%{!?CUSTOM_RELEASE:%RELEASE} Summary: InfiniBand fabric simulator for management Name: ibsim -- 1.6.2.GIT From jigar.halani at wipro.com Mon Apr 27 03:51:23 2009 From: jigar.halani at wipro.com (jigar.halani at wipro.com) Date: Mon, 27 Apr 2009 16:21:23 +0530 Subject: [ofa-general] OFED for Solaris Message-ID: <31BCB8E2EBCE02479FF08001B81F16C5025403AD@blr-mrd-msg.wipro.com> Hi all, I am trying to install OFED drivers on Solaris 10. I sow the release not, but Solaris not in the list at all L Could any one please let me know, how to install the same on Solaris? Solaris also comes with pre-installed packages, which is drives for HCA, but not able to get the connectivity between servers and switch. Appreciate the early response. -- Thanks and regards, Jigar Halani Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From nicolas.morey-chaisemartin at ext.bull.net Mon Apr 27 03:59:05 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey-Chaisemartin) Date: Mon, 27 Apr 2009 12:59:05 +0200 Subject: [ofa-general] [PATCH] management: Fixed custom_release in SPEC files Message-ID: <49F58FF9.8070608@ext.bull.net> Removed a space which make rpmbuild fail when _dist and CUSTOM_RELEASE are set: error: line 15: Tag takes single token only: Release: ofed1.4.1 .fc11 This is due to Release: %rel%{?dist} and %rel having a trailing whitespace. Signed-off-by: Nicolas Morey-Chaisemartin --- infiniband-diags/infiniband-diags.spec.in | 2 +- libibmad/libibmad.spec.in | 2 +- libibumad/libibumad.spec.in | 2 +- opensm/opensm.spec.in | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/infiniband-diags/infiniband-diags.spec.in b/infiniband-diags/infiniband-diags.spec.in index 3791eb4..4bbd907 100644 --- a/infiniband-diags/infiniband-diags.spec.in +++ b/infiniband-diags/infiniband-diags.spec.in @@ -1,6 +1,6 @@ %define RELEASE @RELEASE@ -%define rel %{?CUSTOM_RELEASE} %{!?CUSTOM_RELEASE:%RELEASE} +%define rel %{?CUSTOM_RELEASE}%{!?CUSTOM_RELEASE:%RELEASE} Summary: OpenFabrics Alliance InfiniBand Diagnostic Tools Name: infiniband-diags diff --git a/libibmad/libibmad.spec.in b/libibmad/libibmad.spec.in index 5fd10f6..1b556aa 100644 --- a/libibmad/libibmad.spec.in +++ b/libibmad/libibmad.spec.in @@ -1,6 +1,6 @@ %define RELEASE @RELEASE@ -%define rel %{?CUSTOM_RELEASE} %{!?CUSTOM_RELEASE:%RELEASE} +%define rel %{?CUSTOM_RELEASE}%{!?CUSTOM_RELEASE:%RELEASE} Summary: OpenFabrics Alliance InfiniBand MAD library Name: libibmad diff --git a/libibumad/libibumad.spec.in b/libibumad/libibumad.spec.in index 7732edd..b01757f 100644 --- a/libibumad/libibumad.spec.in +++ b/libibumad/libibumad.spec.in @@ -1,6 +1,6 @@ %define RELEASE @RELEASE@ -%define rel %{?CUSTOM_RELEASE} %{!?CUSTOM_RELEASE:%RELEASE} +%define rel %{?CUSTOM_RELEASE}%{!?CUSTOM_RELEASE:%RELEASE} Summary: OpenFabrics Alliance InfiniBand umad (user MAD) library Name: libibumad diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in index 7b82faf..3c89c34 100644 --- a/opensm/opensm.spec.in +++ b/opensm/opensm.spec.in @@ -1,5 +1,5 @@ %define RELEASE @RELEASE@ -%define rel %{?CUSTOM_RELEASE} %{!?CUSTOM_RELEASE:%RELEASE} +%define rel %{?CUSTOM_RELEASE}%{!?CUSTOM_RELEASE:%RELEASE} %if %{?_with_console_socket:1}%{!?_with_console_socket:0} %define _enable_console_socket --enable-console-socket %endif -- 1.6.2.GIT From bart.vanassche at gmail.com Mon Apr 27 04:03:17 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Mon, 27 Apr 2009 13:03:17 +0200 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: <200904270927.40449.jackm@dev.mellanox.co.il> References: <200904261431.18620.jackm@dev.mellanox.co.il> <20090426125827.GA6513@sk> <200904270927.40449.jackm@dev.mellanox.co.il> Message-ID: On Mon, Apr 27, 2009 at 8:27 AM, Jack Morgenstein wrote: > The OFED distributions may contain features that the mainstream kernels and libraries do not support. > These features frequently require changes in the Infiniband kernel modules.  Such changes are in the form > of kernel patches which are applied to the base mainstream kernel on which the OFED release is based. > A lag between the mainstream kernel and the OFED kernel is unavoidable, since the new features are first > released in the OFED distributions -- and later, gradually (and hopefully), these features make there way > into the upstream kernel. I don't doubt that there is a good reason why new features go in the OFED distribution first and later in the mainstream Linux kernel. But it's not clear to me why this process has been chosen. There is wide agreement in the Linux kernel community that new kernel code should go first in the mainstream Linux kernel and from there to the various Linux distributions, and not the other way around. This is called the "upstream first" policy. One of the most highly regarded kernel maintainers (James Bottomley) wrote the following about the "upstream first" policy: * Major distributions have agreed not to incorporate features or drivers unless they are on “upstream track” for the vanilla Linux Kernel - Obviously there’s some flexibility in interpretation of this for their best customers * Primary reason is that it keeps the distribution kernel code and the vanilla kernel code as close as possible, so - Maintenance is reduced: the distro can file a bug with the upstream maintainer if there’s a problem. - Testing is enhanced: users of all distributions are testing the same code - Code Review burden is greatly reduced: Can rely on upstream maintainers to review and accept. More information about the "upstream first" policy can be found here: * James Bottomley, Hacking the Linux Kernel for Fun and Profit, 5 April 2008, http://www.flourishconf.com/flourish2008/images/downloads/flourish2008-jamesbottomley-hackingthelinuxkernel.pdf. * Jonathan Corbet, A Guide to the Linux Kernel Development Process, 2008, http://lwn.net/talks/lfeu2008/devproc/index.html. Bart. From celine.bourde at ext.bull.net Mon Apr 27 03:56:46 2009 From: celine.bourde at ext.bull.net (Celine Bourde) Date: Mon, 27 Apr 2009 12:56:46 +0200 Subject: [Fwd: Re: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition]] In-Reply-To: <49F207A0.6090507@mellanox.com> References: <49F02280.7010005@ext.bull.net> <49F07710.3070002@opengridcomputing.com> <49F19ECE.9080007@ext.bull.net> <49F1CF6C.3090703@opengridcomputing.com> <49F1FCEF.3030305@mellanox.com> <49F207A0.6090507@mellanox.com> Message-ID: <49F58F6E.7080400@ext.bull.net> Thanks for the explanation. Let me know if you have additional information. We have a contact at Mellanox. I will contact him. Thanks, Céline. Vu Pham wrote: > Celine, > > I'm seeing mlx4 in the log so it is connectX. > > nfsrdma does not work with any official connectX' fw release 2.6.0 > because of fast registering work request problems between nfsrdma and > the firmware. > > We are currently debugging/fixing those problems. > > Do you have direct contact with Mellanox field application engineer? > Please contact him/her. > If not I can send you a contact on private channel. > > thanks, > -vu > >> Hi Celine, >> >> What HCA do you have on your system? Is it ConnectX? If yes, what is >> its firmware version? >> >> -vu >> >>> Hey Celine, >>> >>> Thanks for gathering all this info! So the rdma connections work >>> fine with everything _but_ nfsrdma. And errno 103 indicates the >>> connection was aborted, maybe by the server (since no failures are >>> logged by the client). >>> >>> >>> More below: >>> >>> >>> Celine Bourde wrote: >>>> Hi Steve, >>>> >>>> This email summarizes the situation: >>>> >>>> Standard mount -> OK >>>> --------------------- >>>> >>>> [root at twind ~]# mount -o rw 192.168.0.215:/vol0 /mnt/ >>>> Command works fine. >>>> >>>> rdma mount -> KO >>>> ----------------- >>>> >>>> [root at twind ~]# mount -o rdma,port=2050 192.168.0.215:/vol0 /mnt/ >>>> Command blocks ! I should perform Ctr+C to kill process. >>>> >>>> or >>>> >>>> [root at twind ofa_kernel-1.4.1]# strace mount.nfs 192.168.0.215:/vol0 >>>> /mnt/ -o rdma,port=2050 >>>> [..] >>>> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 >>>> connect(3, {sa_family=AF_INET, sin_port=htons(610), >>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 0 >>>> fcntl(3, F_SETFL, O_RDWR) = 0 >>>> sendto(3, >>>> "-3\245\357\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., >>>> 40, 0, {sa_family=AF_INET, sin_port=htons(610), >>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 40 >>>> poll([{fd=3, events=POLLIN}], 1, 3000) = 1 ([{fd=3, revents=POLLIN}]) >>>> recvfrom(3, "-3\245\357\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", >>>> 8800, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(610), >>>> sin_addr=inet_addr("127.0.0.1")}, [16]) = 24 >>>> close(3) = 0 >>>> mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, >>>> "rdma,port=2050,addr=192.168.0.215" >>>> ..same problem >>>> >>>> [root at twind tmp]# dmesg >>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>>> 32 ird 16 >>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>>> 32 ird 16 >>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>>> 32 ird 16 >>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>>> 32 ird 16 >>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>> >>>> >>> >>> Is there anything logged on the server side? >>> >>> Also, can you try this again, but on both systems do this before >>> attempting the mount: >>> >>> echo 32768 > /proc/sys/sunrpc/rpc_debug >>> >>> This will enable all the rpc trace points and add a bunch of logging >>> to /var/log/messages. >>> Maybe that will show us something. It think the server is aborting >>> the connection for some reason. >>> >>> Steve. >>> >>> >>> >>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From hnrose at comcast.net Mon Apr 27 04:06:19 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Mon, 27 Apr 2009 07:06:19 -0400 Subject: [ofa-general] [PATCH] libibmad: Add support for SA PathRecord SL field Message-ID: <20090427110619.GA22089@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index 0e47ccf..c74cb1d 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -500,6 +500,7 @@ enum MAD_FIELDS { IB_SA_PR_DLID_F, IB_SA_PR_SLID_F, IB_SA_PR_NPATH_F, + IB_SA_PR_SL_F, /* * MC Member rec diff --git a/libibmad/src/fields.c b/libibmad/src/fields.c index c24bc12..81693a2 100644 --- a/libibmad/src/fields.c +++ b/libibmad/src/fields.c @@ -305,6 +305,7 @@ static const ib_field_t ib_mad_f[] = { {BITSOFFS(320, 16), "PathRecDLid", mad_dump_uint}, {BITSOFFS(336, 16), "PathRecSLid", mad_dump_uint}, {BITSOFFS(393, 7), "PathRecNumPath", mad_dump_uint}, + {BITSOFFS(428, 4), "PathRecSL", mad_dump_uint}, /* * MC Member rec diff --git a/libibmad/src/resolve.c b/libibmad/src/resolve.c index 691bdc3..f17da11 100644 --- a/libibmad/src/resolve.c +++ b/libibmad/src/resolve.c @@ -59,6 +59,7 @@ int ib_resolve_smlid_via(ib_portid_t * sm_id, int timeout, return -1; mad_decode_field(portinfo, IB_PORT_SMLID_F, &lid); + mad_decode_field(portinfo, IB_PORT_SMSL_F, &sm_id->sl); return ib_portid_set(sm_id, lid, 0, 0); } @@ -74,12 +75,23 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, { ib_portid_t sm_portid; char buf[IB_SA_DATA_SIZE] = { 0 }; + ib_portid_t self = { 0 }; + uint64_t selfguid; + ibmad_gid_t selfgid; + uint8_t nodeinfo[64]; if (!sm_id) { sm_id = &sm_portid; if (ib_resolve_smlid_via(sm_id, timeout, srcport) < 0) return -1; } + + if (!smp_query_via(nodeinfo, &self, IB_ATTR_NODE_INFO, 0, 0, srcport)) + return -1; + mad_decode_field(nodeinfo, IB_NODE_PORT_GUID_F, &selfguid); + mad_set_field64(selfgid, 0, IB_GID_PREFIX_F, IB_DEFAULT_SUBN_PREFIX); + mad_set_field64(selfgid, 0, IB_GID_GUID_F, selfguid); + if (*(uint64_t *) & portid->gid == 0) mad_set_field64(portid->gid, 0, IB_GID_PREFIX_F, IB_DEFAULT_SUBN_PREFIX); @@ -87,10 +99,11 @@ int ib_resolve_guid_via(ib_portid_t * portid, uint64_t * guid, mad_set_field64(portid->gid, 0, IB_GID_GUID_F, *guid); if ((portid->lid = - ib_path_query_via(srcport, portid->gid, portid->gid, sm_id, + ib_path_query_via(srcport, selfgid, portid->gid, sm_id, buf)) < 0) return -1; + mad_decode_field(buf, IB_SA_PR_SL_F, &portid->sl); return 0; } @@ -167,6 +180,7 @@ int ib_resolve_self_via(ib_portid_t * portid, int *portnum, ibmad_gid_t * gid, return -1; mad_decode_field(portinfo, IB_PORT_LID_F, &portid->lid); + mad_decode_field(portinfo, IB_PORT_SMSL_F, &portid->sl); mad_decode_field(portinfo, IB_PORT_GID_PREFIX_F, &prefix); mad_decode_field(nodeinfo, IB_NODE_PORT_GUID_F, &guid); diff --git a/libibmad/src/rpc.c b/libibmad/src/rpc.c index 07b623d..21fcc9a 100644 --- a/libibmad/src/rpc.c +++ b/libibmad/src/rpc.c @@ -187,7 +187,7 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, ib_portid_t * dport, void *payload, void *rcvdata) { int status, len; - uint8_t sndbuf[1024], rcvbuf[1024], *mad; + uint8_t sndbuf[1024], rcvbuf[1024], *mad, mgmtclass; int timeout, retries; len = 0; @@ -209,7 +209,18 @@ void *mad_rpc(const struct ibmad_port *port, ib_rpc_t * rpc, mad = umad_get_mad(rcvbuf); - if ((status = mad_get_field(mad, 0, IB_DRSMP_STATUS_F)) != 0) { + status = mad_get_field(mad, 0, IB_MAD_STATUS_F); + mgmtclass = mad_get_field(mad, 0, IB_MAD_MGMTCLASS_F); + if (mgmtclass == IB_SMI_DIRECT_CLASS) + status &= 0x7fff; + else if (mgmtclass != IB_SMI_CLASS) { + if (status & 2) { + ERRS("MAD redirection not supported; dport (%s)", + portid2str(dport)); + return 0; + } + } + if (status) { ERRS("MAD completed with error status 0x%x; dport (%s)", status, portid2str(dport)); return 0; @@ -254,8 +265,12 @@ void *mad_rpc_rmpp(const struct ibmad_port *port, ib_rpc_t * rpc, mad = umad_get_mad(rcvbuf); if ((status = mad_get_field(mad, 0, IB_MAD_STATUS_F)) != 0) { - ERRS("MAD completed with error status 0x%x; dport (%s)", - status, portid2str(dport)); + if (status & 2) + ERRS("MAD redirection not supported; dport (%s)", + portid2str(dport)); + else + ERRS("MAD completed with error status 0x%x; dport (%s)", + status, portid2str(dport)); return 0; } From hnrose at comcast.net Mon Apr 27 04:08:32 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Mon, 27 Apr 2009 07:08:32 -0400 Subject: [ofa-general] [PATCH] opensm: Changes to spec and make files for updated release notes Message-ID: <20090427110832.GA22098@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/Makefile.am b/opensm/Makefile.am index 75b6dc5..88ff9da 100644 --- a/opensm/Makefile.am +++ b/opensm/Makefile.am @@ -16,7 +16,7 @@ man_MANS = man/opensm.8 man/osmtest.8 various_scripts = $(wildcard scripts/*) docs = doc/performance-manager-HOWTO.txt doc/QoS_management_in_OpenSM.txt \ - doc/opensm_release_notes-3.2.txt + doc/opensm_release_notes-3.3.txt EXTRA_DIST = autogen.sh opensm.spec $(various_scripts) $(man_MANS) $(docs) diff --git a/opensm/opensm.spec.in b/opensm/opensm.spec.in index 7b82faf..6c05d50 100644 --- a/opensm/opensm.spec.in +++ b/opensm/opensm.spec.in @@ -124,7 +124,7 @@ fi %{_sbindir}/opensm %{_sbindir}/osmtest %{_mandir}/man8/* -%doc AUTHORS COPYING README doc/performance-manager-HOWTO.txt doc/QoS_management_in_OpenSM.txt doc/opensm_release_notes-3.2.txt +%doc AUTHORS COPYING README doc/performance-manager-HOWTO.txt doc/QoS_management_in_OpenSM.txt doc/opensm_release_notes-3.3.txt %{_sysconfdir}/init.d/opensmd %{_sbindir}/sldd.sh %config(noreplace) %{_sysconfdir}/logrotate.d/opensm From ogerlitz at Voltaire.com Mon Apr 27 05:19:56 2009 From: ogerlitz at Voltaire.com (Or Gerlitz) Date: Mon, 27 Apr 2009 15:19:56 +0300 Subject: [ofa-general] Re: [PATCH v2] rdma_cm: Add debugfs entries to monitor rdma_cm connections In-Reply-To: <49F42D40.5000200@Voltaire.COM> References: <49F05AAE.4020606@Voltaire.COM> <90A07B5E8FAD49C187EAE583A976A3CD@amr.corp.intel.com> <49F42D40.5000200@Voltaire.COM> Message-ID: <49F5A2EC.3050807@Voltaire.com> Moni Shoua wrote: > Create a virtual file under debugfs for each cma device and use it to print > information about each rdma_id that is attached to this device. If you create virtual file for each device, where are you going to print listener IDs which aren't bind to any specific device? > Here is an example of 'cat /sys/kernel/debug/rdma_cm/mthca0_rdma_id' > TYPE DEVICE PORT NET_DEV SRC_ADDR DST_ADDR SPACE STATE QP_NUM > mthca0 0 0.0.0.0:7174 TCP LISTEN 0 > IB mthca0 1 ib0 192.30.3.249:46079 192.30.3.248:7174 TCP CONNECT 132102 > IB mthca0 1 ib0 192.30.3.249:7174 192.30.3.248:42561 TCP CONNECT 132103 First, if by definition this file relates to device mthca0, why printing mthca0 in the output? second, if a listener is binded to mthca0 then its src address can't be 0.0.0.0 - correct? I'd like to see how we come up with a solution with somehow less space-ing between columns, maybe take a look at the netperf code, e.g that generates output as below. Applying something like the -p option of netperf would be cool, for that end you can have the rmda_cm keep track of rdma_create_id caller's current->pid > # netstat -natup > Active Internet connections (servers and established) > Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name > tcp 0 0 127.0.0.1:2208 0.0.0.0:* LISTEN 5979/hpiod > tcp 0 0 0.0.0.0:802 0.0.0.0:* LISTEN 5708/rpc.statd > tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 5667/portmap > tcp 0 0 0.0.0.0:58385 0.0.0.0:* LISTEN - > tcp 0 0 127.0.0.1:631 0.0.0.0:* LISTEN 6020/cupsd > tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN 6045/sendmail: acce > tcp 0 0 127.0.0.1:2207 0.0.0.0:* LISTEN 5984/python > tcp 0 0 172.30.49.1:47609 172.30.49.2:1000 ESTABLISHED 18567/qperf > tcp 0 0 172.30.49.1:58935 172.30.49.2:22 ESTABLISHED 18123/ssh > tcp 0 0 :::1003 :::* LISTEN 18562/qperf > tcp 0 0 :::22 :::* LISTEN 6004/sshd > tcp 0 0 ::ffff:172.30.49.1:22 ::ffff:172.25.5.138:60437 ESTABLISHED 17734/0 > tcp 0 0 ::ffff:172.30.49.1:1003 ::ffff:172.30.49.3:45282 ESTABLISHED 18563/qperf > tcp 0 0 ::ffff:10.10.5.157:22 ::ffff:10.10.0.90:58802 ESTABLISHED 18226/2 > tcp 0 0 ::ffff:172.30.49.1:22 ::ffff:172.25.5.138:56035 ESTABLISHED 11932/1 From ogerlitz at Voltaire.com Mon Apr 27 05:24:27 2009 From: ogerlitz at Voltaire.com (Or Gerlitz) Date: Mon, 27 Apr 2009 15:24:27 +0300 Subject: [ofa-general] Re: rdma_cm debugfs In-Reply-To: <49F42D40.5000200@Voltaire.COM> References: <49F05AAE.4020606@Voltaire.COM> <90A07B5E8FAD49C187EAE583A976A3CD@amr.corp.intel.com> <49F42D40.5000200@Voltaire.COM> Message-ID: <49F5A3FB.2060504@Voltaire.com> Moni Shoua wrote: > Here is an example of 'cat /sys/kernel/debug/rdma_cm/mthca0_rdma_id'. > TYPE DEVICE PORT NET_DEV SRC_ADDR DST_ADDR SPACE STATE QP_NUM > mthca0 0 0.0.0.0:7174 TCP LISTEN 0 > IB mthca0 1 ib0 192.30.3.249:46079 192.30.3.248:7174 TCP CONNECT 132102 > IB mthca0 1 ib0 192.30.3.249:7174 192.30.3.248:42561 TCP CONNECT 132103 Moni, Are you planning to print also src/dst GUIDs and LIDs along with PKEY and SL? Also, UFM agent wise, things would be easier if the connection information was provided in more packed (or even binary) manner - since the way it goes now, one would have to write a parser for your output (and in my suggestion one would have to write parser to the debugfs output, but we can keep this parsing app and not provide it). To better understand what I'm talking about, compare the output of netstat vs /proc/net/tcp Or. From tziporet at dev.mellanox.co.il Mon Apr 27 05:43:34 2009 From: tziporet at dev.mellanox.co.il (Tziporet Koren) Date: Mon, 27 Apr 2009 15:43:34 +0300 Subject: [ofa-general] OFED for Solaris In-Reply-To: <31BCB8E2EBCE02479FF08001B81F16C5025403AD@blr-mrd-msg.wipro.com> References: <31BCB8E2EBCE02479FF08001B81F16C5025403AD@blr-mrd-msg.wipro.com> Message-ID: <49F5A876.7080605@mellanox.co.il> jigar.halani at wipro.com wrote: > > I am trying to install OFED drivers on Solaris 10. I sow the release > not, but Solaris not in the list at all L > > > > Could any one please let me know, how to install the same on Solaris? > Solaris also comes with pre-installed packages, which is drives for > HCA, but not able to get the connectivity between servers and switch. > > > Solaris has their own IB SW stack - provided by Sun. They do support OFED user space verbs as far as I know You should contact Sun for more info Tziporet From nicolas.morey-chaisemartin at ext.bull.net Mon Apr 27 05:46:40 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey-Chaisemartin) Date: Mon, 27 Apr 2009 14:46:40 +0200 Subject: [ofa-general] [PATCH] OpenSM: include/vendor/osm_vendor.h - Replaced #elif with no condition by #else Message-ID: <49F5A930.1030102@ext.bull.net> Signed-off-by: Nicolas Morey-Chaisemartin --- OpenSM build fails on FC11 without this patch (I guess latest gcc considers this an error) opensm/include/vendor/osm_vendor.h | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/include/vendor/osm_vendor.h b/opensm/include/vendor/osm_vendor.h index 4d0ae4c..3cde781 100644 --- a/opensm/include/vendor/osm_vendor.h +++ b/opensm/include/vendor/osm_vendor.h @@ -65,7 +65,7 @@ #include #elif defined( OSM_VENDOR_INTF_AL ) #include -#elif +#else #error No MAD Interface selected! #error Choose an interface in osm_config.h #endif -- 1.6.2.GIT From jigar.halani at wipro.com Mon Apr 27 05:46:40 2009 From: jigar.halani at wipro.com (jigar.halani at wipro.com) Date: Mon, 27 Apr 2009 18:16:40 +0530 Subject: [ofa-general] OFED for Solaris In-Reply-To: <49F5A876.7080605@mellanox.co.il> References: <31BCB8E2EBCE02479FF08001B81F16C5025403AD@blr-mrd-msg.wipro.com> <49F5A876.7080605@mellanox.co.il> Message-ID: <31BCB8E2EBCE02479FF08001B81F16C50254045A@blr-mrd-msg.wipro.com> Hi Tziporet, Thanks for answer, but m not able to find any details for the same. The software stack is installed on the OS but it is just not working, or it is not giving me some basic command at all. Have any-body else has experience on the same? -- Regards, Jigar Halani Spirit of Wipro: Intensity to Win | Act with Sensitivity | Unyielding Integrity -----Original Message----- From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] Sent: Monday, April 27, 2009 6:14 PM To: Jigar Halani (WI01 - TIS - Services) Cc: general at lists.openfabrics.org Subject: Re: [ofa-general] OFED for Solaris jigar.halani at wipro.com wrote: > > I am trying to install OFED drivers on Solaris 10. I sow the release > not, but Solaris not in the list at all L > > > > Could any one please let me know, how to install the same on Solaris? > Solaris also comes with pre-installed packages, which is drives for > HCA, but not able to get the connectivity between servers and switch. > > > Solaris has their own IB SW stack - provided by Sun. They do support OFED user space verbs as far as I know You should contact Sun for more info Tziporet Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com From tmtalpey at gmail.com Mon Apr 27 05:47:56 2009 From: tmtalpey at gmail.com (Tom Talpey) Date: Mon, 27 Apr 2009 08:47:56 -0400 Subject: [Fwd: Re: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition]] In-Reply-To: <49F58F6E.7080400@ext.bull.net> References: <49F02280.7010005@ext.bull.net> <49F07710.3070002@opengridcomputing.com> <49F19ECE.9080007@ext.bull.net> <49F1CF6C.3090703@opengridcomputing.com> <49F1FCEF.3030305@mellanox.com> <49F207A0.6090507@mellanox.com> <49F58F6E.7080400@ext.bull.net> Message-ID: <49f5a9c8.0e35640a.61a5.7852@mx.google.com> At 06:56 AM 4/27/2009, Celine Bourde wrote: >Thanks for the explanation. >Let me know if you have additional information. > >We have a contact at Mellanox. I will contact him. > >Thanks, > >Céline. > >Vu Pham wrote: >> Celine, >> >> I'm seeing mlx4 in the log so it is connectX. >> >> nfsrdma does not work with any official connectX' fw release 2.6.0 >> because of fast registering work request problems between nfsrdma and >> the firmware. There is a very simple workaround if you don't have the latest mlx4 firmware. Just set the client to use the all-physical memory registration mode. This will avoid making unsupported reregistration requests, which the firmware advertised. Before mounting, enter (as root) sysctl -w sunrpc.rdma_memreg_strategy = 6 The client should work properly after this. If you do have access to the fixed firmware, I recommend using the default setting (5) as it provides greater safety on the client. Tom. >> >> We are currently debugging/fixing those problems. >> >> Do you have direct contact with Mellanox field application engineer? >> Please contact him/her. >> If not I can send you a contact on private channel. >> >> thanks, >> -vu >> >>> Hi Celine, >>> >>> What HCA do you have on your system? Is it ConnectX? If yes, what is >>> its firmware version? >>> >>> -vu >>> >>>> Hey Celine, >>>> >>>> Thanks for gathering all this info! So the rdma connections work >>>> fine with everything _but_ nfsrdma. And errno 103 indicates the >>>> connection was aborted, maybe by the server (since no failures are >>>> logged by the client). >>>> >>>> >>>> More below: >>>> >>>> >>>> Celine Bourde wrote: >>>>> Hi Steve, >>>>> >>>>> This email summarizes the situation: >>>>> >>>>> Standard mount -> OK >>>>> --------------------- >>>>> >>>>> [root at twind ~]# mount -o rw 192.168.0.215:/vol0 /mnt/ >>>>> Command works fine. >>>>> >>>>> rdma mount -> KO >>>>> ----------------- >>>>> >>>>> [root at twind ~]# mount -o rdma,port=2050 192.168.0.215:/vol0 /mnt/ >>>>> Command blocks ! I should perform Ctr+C to kill process. >>>>> >>>>> or >>>>> >>>>> [root at twind ofa_kernel-1.4.1]# strace mount.nfs 192.168.0.215:/vol0 >>>>> /mnt/ -o rdma,port=2050 >>>>> [..] >>>>> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 >>>>> connect(3, {sa_family=AF_INET, sin_port=htons(610), >>>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 0 >>>>> fcntl(3, F_SETFL, O_RDWR) = 0 >>>>> sendto(3, >>>>> >"-3\245\357\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., >>>>> 40, 0, {sa_family=AF_INET, sin_port=htons(610), >>>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 40 >>>>> poll([{fd=3, events=POLLIN}], 1, 3000) = 1 ([{fd=3, revents=POLLIN}]) >>>>> recvfrom(3, "-3\245\357\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", >>>>> 8800, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(610), >>>>> sin_addr=inet_addr("127.0.0.1")}, [16]) = 24 >>>>> close(3) = 0 >>>>> mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, >>>>> "rdma,port=2050,addr=192.168.0.215" >>>>> ..same problem >>>>> >>>>> [root at twind tmp]# dmesg >>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>>>> 32 ird 16 >>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>>>> 32 ird 16 >>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>>>> 32 ird 16 >>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>>>> 32 ird 16 >>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>> >>>>> >>>> >>>> Is there anything logged on the server side? >>>> >>>> Also, can you try this again, but on both systems do this before >>>> attempting the mount: >>>> >>>> echo 32768 > /proc/sys/sunrpc/rpc_debug >>>> >>>> This will enable all the rpc trace points and add a bunch of logging >>>> to /var/log/messages. >>>> Maybe that will show us something. It think the server is aborting >>>> the connection for some reason. >>>> >>>> Steve. >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>>> http://openib.org/mailman/listinfo/openib-general >>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> >> > >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From monis at Voltaire.COM Mon Apr 27 06:10:46 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Mon, 27 Apr 2009 16:10:46 +0300 Subject: [ofa-general] Re: [PATCH v2] rdma_cm: Add debugfs entries to monitor rdma_cm connections In-Reply-To: <49F5A2EC.3050807@Voltaire.com> References: <49F05AAE.4020606@Voltaire.COM> <90A07B5E8FAD49C187EAE583A976A3CD@amr.corp.intel.com> <49F42D40.5000200@Voltaire.COM> <49F5A2EC.3050807@Voltaire.com> Message-ID: <49F5AED6.4070208@Voltaire.COM> Or Gerlitz wrote: > Moni Shoua wrote: >> Create a virtual file under debugfs for each cma device and use it to print >> information about each rdma_id that is attached to this device. > > If you create virtual file for each device, where are you going to print > listener IDs which aren't bind to any specific device? A listener that listens an all will appear on all devices > >> Here is an example of 'cat /sys/kernel/debug/rdma_cm/mthca0_rdma_id' >> TYPE DEVICE PORT NET_DEV SRC_ADDR DST_ADDR SPACE STATE QP_NUM >> mthca0 0 0.0.0.0:7174 TCP LISTEN 0 >> IB mthca0 1 ib0 192.30.3.249:46079 192.30.3.248:7174 TCP CONNECT 132102 >> IB mthca0 1 ib0 192.30.3.249:7174 192.30.3.248:42561 TCP CONNECT 132103 > > First, if by definition this file relates to device mthca0, why printing mthca0 in the output? You are right. It is not necessary. I'll remove it. > second, if a listener is binded to mthca0 then its src address can't be 0.0.0.0 - correct? > > I'd like to see how we come up with a solution with somehow less space-ing between columns, > maybe take a look at the netperf code, e.g that generates output as below. I reserved enough space for IPV6 addresses. The output below is good for IPV4 addresses only. > > Applying something like the -p option of netperf would be cool, for that > end you can have the rmda_cm keep track of rdma_create_id caller's current->pid PID looks like a nice addition to the information > >> # netstat -natup >> Active Internet connections (servers and established) >> Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name >> tcp 0 0 127.0.0.1:2208 0.0.0.0:* LISTEN 5979/hpiod >> tcp 0 0 0.0.0.0:802 0.0.0.0:* LISTEN 5708/rpc.statd >> tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 5667/portmap >> tcp 0 0 0.0.0.0:58385 0.0.0.0:* LISTEN - >> tcp 0 0 127.0.0.1:631 0.0.0.0:* LISTEN 6020/cupsd >> tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN 6045/sendmail: acce >> tcp 0 0 127.0.0.1:2207 0.0.0.0:* LISTEN 5984/python >> tcp 0 0 172.30.49.1:47609 172.30.49.2:1000 ESTABLISHED 18567/qperf >> tcp 0 0 172.30.49.1:58935 172.30.49.2:22 ESTABLISHED 18123/ssh >> tcp 0 0 :::1003 :::* LISTEN 18562/qperf >> tcp 0 0 :::22 :::* LISTEN 6004/sshd >> tcp 0 0 ::ffff:172.30.49.1:22 ::ffff:172.25.5.138:60437 ESTABLISHED 17734/0 >> tcp 0 0 ::ffff:172.30.49.1:1003 ::ffff:172.30.49.3:45282 ESTABLISHED 18563/qperf >> tcp 0 0 ::ffff:10.10.5.157:22 ::ffff:10.10.0.90:58802 ESTABLISHED 18226/2 >> tcp 0 0 ::ffff:172.30.49.1:22 ::ffff:172.25.5.138:56035 ESTABLISHED 11932/1 > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From ogerlitz at voltaire.com Mon Apr 27 06:15:22 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Mon, 27 Apr 2009 16:15:22 +0300 Subject: [ofa-general] Re: [PATCH v2] rdma_cm: Add debugfs entries to monitor rdma_cm connections In-Reply-To: <49F5AED6.4070208@Voltaire.COM> References: <49F05AAE.4020606@Voltaire.COM> <90A07B5E8FAD49C187EAE583A976A3CD@amr.corp.intel.com> <49F42D40.5000200@Voltaire.COM> <49F5A2EC.3050807@Voltaire.com> <49F5AED6.4070208@Voltaire.COM> Message-ID: <49F5AFEA.5090003@voltaire.com> Moni Shoua wrote: > A listener that listens an all will appear on all devices This sounds like something that can cause confusion when someone is looking on multiple devices... but maybe we can live with that? >> I'd like to see how we come up with a solution with somehow less space-ing between columns, >> maybe take a look at the netperf code, e.g that generates output as below. > I reserved enough space for IPV6 addresses. The output below is good for IPV4 addresses only. I understand that, but lets try to think if/how this can be done better, specifically if we want to add more params such as IB L4/L3/L2 (QPN/PKEY, GUID, LID/SL) info to the output. Or. From nicolas.morey-chaisemartin at ext.bull.net Mon Apr 27 06:47:31 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey-Chaisemartin) Date: Mon, 27 Apr 2009 15:47:31 +0200 Subject: [ofa-general] [PATCH] ibutils: Fixed dependency of ibdmsh on libibdmcom.la Message-ID: <49F5B773.5040004@ext.bull.net> ibdmsh has a dependency on libibdmcom.la which was not in Makefile.am Compilation order makes it transparent in most cases but compilation fails when using -j flag. This patch fixes the issue Signed-off-by: Nicolas Morey-Chaisemartin --- ibdm/ibdm/Makefile.am | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/ibdm/ibdm/Makefile.am b/ibdm/ibdm/Makefile.am index 1c57b3b..ba5789a 100644 --- a/ibdm/ibdm/Makefile.am +++ b/ibdm/ibdm/Makefile.am @@ -88,6 +88,7 @@ bin_PROGRAMS = ibdmsh ibdmsh_SOURCES = ibdmsh_wrap.cpp ibdmsh_LDADD = -libdmcom $(TCL_LIBS) ibdmsh_LDFLAGS = -static -Wl,-rpath -Wl,$(TCL_PREFIX)/lib +ibdmsh_DEPENDENCIES=$(lib_LTLIBRARIES) $(srcdir)/Fabric.cpp: $(srcdir)/git_version.h -- 1.6.2.GIT From Pramod.Gunjikar at Sun.COM Mon Apr 27 06:58:26 2009 From: Pramod.Gunjikar at Sun.COM (Pramod Gunjikar) Date: Mon, 27 Apr 2009 19:28:26 +0530 Subject: [ofa-general] OFED for Solaris In-Reply-To: <31BCB8E2EBCE02479FF08001B81F16C50254045A@blr-mrd-msg.wipro.com> References: <31BCB8E2EBCE02479FF08001B81F16C5025403AD@blr-mrd-msg.wipro.com> <49F5A876.7080605@mellanox.co.il> <31BCB8E2EBCE02479FF08001B81F16C50254045A@blr-mrd-msg.wipro.com> Message-ID: <49F5BA02.8000506@sun.com> Hello Jigar, An initial version of OFED user library support on Solaris is available on : http://www.sun.com/download/index.jsp?cat=Hardware%20Drivers&tab=3&subcat=InfiniBand Please download IB Updates 3 listed in this URL, This link contains both installation instructions and other information you will need to know before running this product. Let me know if you have any queries or run into issues Thanks Pramod > Hi Tziporet, > > Thanks for answer, but m not able to find any details for the same. The > software stack is installed on the OS but it is just not working, or it > is not giving me some basic command at all. Have any-body else has > experience on the same? > > -- > Regards, > Jigar Halani > > Spirit of Wipro: Intensity to Win | Act with Sensitivity | Unyielding > Integrity > > > -----Original Message----- > From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] > Sent: Monday, April 27, 2009 6:14 PM > To: Jigar Halani (WI01 - TIS - Services) > Cc: general at lists.openfabrics.org > Subject: Re: [ofa-general] OFED for Solaris > > jigar.halani at wipro.com wrote: > >> I am trying to install OFED drivers on Solaris 10. I sow the release >> not, but Solaris not in the list at all L >> >> >> >> Could any one please let me know, how to install the same on Solaris? >> > > >> Solaris also comes with pre-installed packages, which is drives for >> HCA, but not able to get the connectivity between servers and switch. >> >> >> >> > Solaris has their own IB SW stack - provided by Sun. > They do support OFED user space verbs as far as I know > You should contact Sun for more info > > Tziporet > > > Please do not print this email unless it is absolutely necessary. > > The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. > > WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. > > www.wipro.com > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From hnrose at comcast.net Mon Apr 27 06:53:30 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Mon, 27 Apr 2009 09:53:30 -0400 Subject: [ofa-general] [PATCH] opensm: Add SuperMicro to list of recognized vendors Message-ID: <20090427135330.GA24559@comcast.net> Signed-off-by: Hal Rosenstock --- diff --git a/opensm/include/opensm/osm_base.h b/opensm/include/opensm/osm_base.h index e973a70..bca1133 100644 --- a/opensm/include/opensm/osm_base.h +++ b/opensm/include/opensm/osm_base.h @@ -872,6 +872,7 @@ typedef enum _osm_sm_signal { #define OSM_VENDOR_ID_XSIGO 0x001397 #define OSM_VENDOR_ID_HP2 0x0018FE #define OSM_VENDOR_ID_DELL 0x00188B +#define OSM_VENDOR_ID_SUPERMICRO 0x003048 /**********/ diff --git a/opensm/opensm/osm_helper.c b/opensm/opensm/osm_helper.c index ae5a703..0123edc 100644 --- a/opensm/opensm/osm_helper.c +++ b/opensm/opensm/osm_helper.c @@ -2225,6 +2225,7 @@ const char *osm_get_manufacturer_str(IN uint64_t const guid_ho) static const char *leafntwks_str = "3LeafNtwks"; static const char *xsigo_str = "Xsigo"; static const char *dell_str = "Dell"; + static const char *supermicro_str = "SuperMicro"; static const char *unknown_str = "Unknown"; switch ((uint32_t) (guid_ho >> (5 * 8))) { @@ -2278,6 +2279,8 @@ const char *osm_get_manufacturer_str(IN uint64_t const guid_ho) return (xsigo_str); case OSM_VENDOR_ID_DELL: return (dell_str); + case OSM_VENDOR_ID_SUPERMICRO: + return (supermicro_str); default: return (unknown_str); } From celine.bourde at ext.bull.net Mon Apr 27 07:05:33 2009 From: celine.bourde at ext.bull.net (Celine Bourde) Date: Mon, 27 Apr 2009 16:05:33 +0200 Subject: [Fwd: Re: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition]] In-Reply-To: <49f5a9c8.0e35640a.61a5.7852@mx.google.com> References: <49F02280.7010005@ext.bull.net> <49F07710.3070002@opengridcomputing.com> <49F19ECE.9080007@ext.bull.net> <49F1CF6C.3090703@opengridcomputing.com> <49F1FCEF.3030305@mellanox.com> <49F207A0.6090507@mellanox.com> <49F58F6E.7080400@ext.bull.net> <49f5a9c8.0e35640a.61a5.7852@mx.google.com> Message-ID: <49F5BBAD.9000900@ext.bull.net> We have still the same problem, even changing the registration method. mount doesn't reply and this is the output of dmesg on client: rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16 rpcrdma: connection to 192.168.0.215:2050 closed (-103) rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16 rpcrdma: connection to 192.168.0.215:2050 closed (-103) ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:0000:0001, status -22 rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16 rpcrdma: connection to 192.168.0.215:2050 closed (-103) I have still another doubt: if the firmware is the problem, why is NFS RDMA working with a kernel 2.6.27.10 and without OFED 1.4 with these same cards?? Thanks, Céline Bourde. Tom Talpey wrote: > At 06:56 AM 4/27/2009, Celine Bourde wrote: > >> Thanks for the explanation. >> Let me know if you have additional information. >> >> We have a contact at Mellanox. I will contact him. >> >> Thanks, >> >> Céline. >> >> Vu Pham wrote: >> >>> Celine, >>> >>> I'm seeing mlx4 in the log so it is connectX. >>> >>> nfsrdma does not work with any official connectX' fw release 2.6.0 >>> because of fast registering work request problems between nfsrdma and >>> the firmware. >>> > > There is a very simple workaround if you don't have the latest mlx4 firmware. > > Just set the client to use the all-physical memory registration mode. This will > avoid making unsupported reregistration requests, which the firmware advertised. > > Before mounting, enter (as root) > > sysctl -w sunrpc.rdma_memreg_strategy = 6 > > The client should work properly after this. > > If you do have access to the fixed firmware, I recommend using the default > setting (5) as it provides greater safety on the client. > > Tom. > > >>> We are currently debugging/fixing those problems. >>> >>> Do you have direct contact with Mellanox field application engineer? >>> Please contact him/her. >>> If not I can send you a contact on private channel. >>> >>> thanks, >>> -vu >>> >>> >>>> Hi Celine, >>>> >>>> What HCA do you have on your system? Is it ConnectX? If yes, what is >>>> its firmware version? >>>> >>>> -vu >>>> >>>> >>>>> Hey Celine, >>>>> >>>>> Thanks for gathering all this info! So the rdma connections work >>>>> fine with everything _but_ nfsrdma. And errno 103 indicates the >>>>> connection was aborted, maybe by the server (since no failures are >>>>> logged by the client). >>>>> >>>>> >>>>> More below: >>>>> >>>>> >>>>> Celine Bourde wrote: >>>>> >>>>>> Hi Steve, >>>>>> >>>>>> This email summarizes the situation: >>>>>> >>>>>> Standard mount -> OK >>>>>> --------------------- >>>>>> >>>>>> [root at twind ~]# mount -o rw 192.168.0.215:/vol0 /mnt/ >>>>>> Command works fine. >>>>>> >>>>>> rdma mount -> KO >>>>>> ----------------- >>>>>> >>>>>> [root at twind ~]# mount -o rdma,port=2050 192.168.0.215:/vol0 /mnt/ >>>>>> Command blocks ! I should perform Ctr+C to kill process. >>>>>> >>>>>> or >>>>>> >>>>>> [root at twind ofa_kernel-1.4.1]# strace mount.nfs 192.168.0.215:/vol0 >>>>>> /mnt/ -o rdma,port=2050 >>>>>> [..] >>>>>> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 >>>>>> connect(3, {sa_family=AF_INET, sin_port=htons(610), >>>>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 0 >>>>>> fcntl(3, F_SETFL, O_RDWR) = 0 >>>>>> sendto(3, >>>>>> >>>>>> >> "-3\245\357\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., >> >>>>>> 40, 0, {sa_family=AF_INET, sin_port=htons(610), >>>>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 40 >>>>>> poll([{fd=3, events=POLLIN}], 1, 3000) = 1 ([{fd=3, revents=POLLIN}]) >>>>>> recvfrom(3, "-3\245\357\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", >>>>>> 8800, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(610), >>>>>> sin_addr=inet_addr("127.0.0.1")}, [16]) = 24 >>>>>> close(3) = 0 >>>>>> mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, >>>>>> "rdma,port=2050,addr=192.168.0.215" >>>>>> ..same problem >>>>>> >>>>>> [root at twind tmp]# dmesg >>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>>>>> 32 ird 16 >>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>>>>> 32 ird 16 >>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>>>>> 32 ird 16 >>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>>>>> 32 ird 16 >>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>>> >>>>>> >>>>>> >>>>> Is there anything logged on the server side? >>>>> >>>>> Also, can you try this again, but on both systems do this before >>>>> attempting the mount: >>>>> >>>>> echo 32768 > /proc/sys/sunrpc/rpc_debug >>>>> >>>>> This will enable all the rpc trace points and add a bunch of logging >>>>> to /var/log/messages. >>>>> Maybe that will show us something. It think the server is aborting >>>>> the connection for some reason. >>>>> >>>>> Steve. >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> general mailing list >>>>> general at lists.openfabrics.org >>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>> >>>>> To unsubscribe, please visit >>>>> http://openib.org/mailman/listinfo/openib-general >>>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>>> http://openib.org/mailman/listinfo/openib-general >>>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >>> >>> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> >> > > > > From jackm at dev.mellanox.co.il Mon Apr 27 07:47:43 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Mon, 27 Apr 2009 17:47:43 +0300 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived =?iso-8859-1?q?IB=09modules=3F?= In-Reply-To: <49F58D00.30908@Voltaire.COM> References: <200904270918.35281.jackm@dev.mellanox.co.il> <49F58D00.30908@Voltaire.COM> Message-ID: <200904271747.43905.jackm@dev.mellanox.co.il> On Monday 27 April 2009 13:46, Moni Shoua wrote: > So, Is there an easy way for upstream kernel users that want user space functionality? > Why can't they just install OFED? This affects ONLY the infiniband modules, and has undergone extensive QA on lots of platforms. - Jack From jrlang at uwyo.edu Mon Apr 27 07:46:02 2009 From: jrlang at uwyo.edu (jeffrey Lang) Date: Mon, 27 Apr 2009 08:46:02 -0600 Subject: [Fwd: Re: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition]] In-Reply-To: <49F5BBAD.9000900@ext.bull.net> References: <49F02280.7010005@ext.bull.net> <49F07710.3070002@opengridcomputing.com> <49F19ECE.9080007@ext.bull.net> <49F1CF6C.3090703@opengridcomputing.com> <49F1FCEF.3030305@mellanox.com> <49F207A0.6090507@mellanox.com> <49F58F6E.7080400@ext.bull.net> <49f5a9c8.0e35640a.61a5.7852@mx.google.com> <49F5BBAD.9000900@ext.bull.net> Message-ID: <49F5C52A.1080107@uwyo.edu> I recently was having the "ib0: multicast join failed" issue. Once i upgraded the firmware in my switch everything started working again. I would give the firmware upgrade a try. jeff Celine Bourde wrote: > We have still the same problem, even changing the registration method. > > mount doesn't reply and this is the output of dmesg on client: > > rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16 > rpcrdma: connection to 192.168.0.215:2050 closed (-103) > rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16 > rpcrdma: connection to 192.168.0.215:2050 closed (-103) > ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:0000:0001, status -22 > rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16 > rpcrdma: connection to 192.168.0.215:2050 closed (-103) > > I have still another doubt: if the firmware is the problem, why is NFS > RDMA working with a kernel 2.6.27.10 and without OFED 1.4 with these > same cards?? > > Thanks, > > Céline Bourde. > > > Tom Talpey wrote: > >> At 06:56 AM 4/27/2009, Celine Bourde wrote: >> >> >>> Thanks for the explanation. >>> Let me know if you have additional information. >>> >>> We have a contact at Mellanox. I will contact him. >>> >>> Thanks, >>> >>> Céline. >>> >>> Vu Pham wrote: >>> >>> >>>> Celine, >>>> >>>> I'm seeing mlx4 in the log so it is connectX. >>>> >>>> nfsrdma does not work with any official connectX' fw release 2.6.0 >>>> because of fast registering work request problems between nfsrdma and >>>> the firmware. >>>> >>>> >> There is a very simple workaround if you don't have the latest mlx4 firmware. >> >> Just set the client to use the all-physical memory registration mode. This will >> avoid making unsupported reregistration requests, which the firmware advertised. >> >> Before mounting, enter (as root) >> >> sysctl -w sunrpc.rdma_memreg_strategy = 6 >> >> The client should work properly after this. >> >> If you do have access to the fixed firmware, I recommend using the default >> setting (5) as it provides greater safety on the client. >> >> Tom. >> >> >> >>>> We are currently debugging/fixing those problems. >>>> >>>> Do you have direct contact with Mellanox field application engineer? >>>> Please contact him/her. >>>> If not I can send you a contact on private channel. >>>> >>>> thanks, >>>> -vu >>>> >>>> >>>> >>>>> Hi Celine, >>>>> >>>>> What HCA do you have on your system? Is it ConnectX? If yes, what is >>>>> its firmware version? >>>>> >>>>> -vu >>>>> >>>>> >>>>> >>>>>> Hey Celine, >>>>>> >>>>>> Thanks for gathering all this info! So the rdma connections work >>>>>> fine with everything _but_ nfsrdma. And errno 103 indicates the >>>>>> connection was aborted, maybe by the server (since no failures are >>>>>> logged by the client). >>>>>> >>>>>> >>>>>> More below: >>>>>> >>>>>> >>>>>> Celine Bourde wrote: >>>>>> >>>>>> >>>>>>> Hi Steve, >>>>>>> >>>>>>> This email summarizes the situation: >>>>>>> >>>>>>> Standard mount -> OK >>>>>>> --------------------- >>>>>>> >>>>>>> [root at twind ~]# mount -o rw 192.168.0.215:/vol0 /mnt/ >>>>>>> Command works fine. >>>>>>> >>>>>>> rdma mount -> KO >>>>>>> ----------------- >>>>>>> >>>>>>> [root at twind ~]# mount -o rdma,port=2050 192.168.0.215:/vol0 /mnt/ >>>>>>> Command blocks ! I should perform Ctr+C to kill process. >>>>>>> >>>>>>> or >>>>>>> >>>>>>> [root at twind ofa_kernel-1.4.1]# strace mount.nfs 192.168.0.215:/vol0 >>>>>>> /mnt/ -o rdma,port=2050 >>>>>>> [..] >>>>>>> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 >>>>>>> connect(3, {sa_family=AF_INET, sin_port=htons(610), >>>>>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 0 >>>>>>> fcntl(3, F_SETFL, O_RDWR) = 0 >>>>>>> sendto(3, >>>>>>> >>>>>>> >>>>>>> >>> "-3\245\357\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., >>> >>> >>>>>>> 40, 0, {sa_family=AF_INET, sin_port=htons(610), >>>>>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 40 >>>>>>> poll([{fd=3, events=POLLIN}], 1, 3000) = 1 ([{fd=3, revents=POLLIN}]) >>>>>>> recvfrom(3, "-3\245\357\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", >>>>>>> 8800, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(610), >>>>>>> sin_addr=inet_addr("127.0.0.1")}, [16]) = 24 >>>>>>> close(3) = 0 >>>>>>> mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, >>>>>>> "rdma,port=2050,addr=192.168.0.215" >>>>>>> ..same problem >>>>>>> >>>>>>> [root at twind tmp]# dmesg >>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>>>>>> 32 ird 16 >>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>>>>>> 32 ird 16 >>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>>>>>> 32 ird 16 >>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>>>>>> 32 ird 16 >>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> Is there anything logged on the server side? >>>>>> >>>>>> Also, can you try this again, but on both systems do this before >>>>>> attempting the mount: >>>>>> >>>>>> echo 32768 > /proc/sys/sunrpc/rpc_debug >>>>>> >>>>>> This will enable all the rpc trace points and add a bunch of logging >>>>>> to /var/log/messages. >>>>>> Maybe that will show us something. It think the server is aborting >>>>>> the connection for some reason. >>>>>> >>>>>> Steve. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> general mailing list >>>>>> general at lists.openfabrics.org >>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>>> >>>>>> To unsubscribe, please visit >>>>>> http://openib.org/mailman/listinfo/openib-general >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> general mailing list >>>>> general at lists.openfabrics.org >>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>> >>>>> To unsubscribe, please visit >>>>> http://openib.org/mailman/listinfo/openib-general >>>>> >>>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>>> http://openib.org/mailman/listinfo/openib-general >>>> >>>> >>>> >>>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >>> >>> >>> >> >> >> > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: jrlang.vcf Type: text/x-vcard Size: 298 bytes Desc: not available URL: From tmtalpey at gmail.com Mon Apr 27 07:50:06 2009 From: tmtalpey at gmail.com (Tom Talpey) Date: Mon, 27 Apr 2009 10:50:06 -0400 Subject: [Fwd: Re: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition]] In-Reply-To: <49F5BBAD.9000900@ext.bull.net> References: <49F02280.7010005@ext.bull.net> <49F07710.3070002@opengridcomputing.com> <49F19ECE.9080007@ext.bull.net> <49F1CF6C.3090703@opengridcomputing.com> <49F1FCEF.3030305@mellanox.com> <49F207A0.6090507@mellanox.com> <49F58F6E.7080400@ext.bull.net> <49f5a9c8.0e35640a.61a5.7852@mx.google.com> <49F5BBAD.9000900@ext.bull.net> Message-ID: <49f5c62b.1d1e640a.480a.ffffbccf@mx.google.com> At 10:05 AM 4/27/2009, Celine Bourde wrote: >We have still the same problem, even changing the registration method. > >mount doesn't reply and this is the output of dmesg on client: > >rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16 >rpcrdma: connection to 192.168.0.215:2050 closed (-103) >rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16 >rpcrdma: connection to 192.168.0.215:2050 closed (-103) >ib0: multicast join failed for >ff12:401b:ffff:0000:0000:0000:0000:0001, status -22 >rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16 >rpcrdma: connection to 192.168.0.215:2050 closed (-103) I need to see the log on the server. Errno 103 is ECONNABORTED which means the connection was closed spontaneously. Let's look for a server artifact. > >I have still another doubt: if the firmware is the problem, why is NFS >RDMA working with a kernel 2.6.27.10 and without OFED 1.4 with these >same cards?? There were a number of changes in the 2.6.28 cycle, especially on the server. So it's quite possible that 2.6.27, without the changes, would behave differently. Have you tried this with 2.6.29, or with different cards? Tom. > >Thanks, > >Céline Bourde. > > >Tom Talpey wrote: >> At 06:56 AM 4/27/2009, Celine Bourde wrote: >> >>> Thanks for the explanation. >>> Let me know if you have additional information. >>> >>> We have a contact at Mellanox. I will contact him. >>> >>> Thanks, >>> >>> Céline. >>> >>> Vu Pham wrote: >>> >>>> Celine, >>>> >>>> I'm seeing mlx4 in the log so it is connectX. >>>> >>>> nfsrdma does not work with any official connectX' fw release 2.6.0 >>>> because of fast registering work request problems between nfsrdma and >>>> the firmware. >>>> >> >> There is a very simple workaround if you don't have the latest mlx4 firmware. >> >> Just set the client to use the all-physical memory registration >mode. This will >> avoid making unsupported reregistration requests, which the firmware >advertised. >> >> Before mounting, enter (as root) >> >> sysctl -w sunrpc.rdma_memreg_strategy = 6 >> >> The client should work properly after this. >> >> If you do have access to the fixed firmware, I recommend using the default >> setting (5) as it provides greater safety on the client. >> >> Tom. >> >> >>>> We are currently debugging/fixing those problems. >>>> >>>> Do you have direct contact with Mellanox field application engineer? >>>> Please contact him/her. >>>> If not I can send you a contact on private channel. >>>> >>>> thanks, >>>> -vu >>>> >>>> >>>>> Hi Celine, >>>>> >>>>> What HCA do you have on your system? Is it ConnectX? If yes, what is >>>>> its firmware version? >>>>> >>>>> -vu >>>>> >>>>> >>>>>> Hey Celine, >>>>>> >>>>>> Thanks for gathering all this info! So the rdma connections work >>>>>> fine with everything _but_ nfsrdma. And errno 103 indicates the >>>>>> connection was aborted, maybe by the server (since no failures are >>>>>> logged by the client). >>>>>> >>>>>> >>>>>> More below: >>>>>> >>>>>> >>>>>> Celine Bourde wrote: >>>>>> >>>>>>> Hi Steve, >>>>>>> >>>>>>> This email summarizes the situation: >>>>>>> >>>>>>> Standard mount -> OK >>>>>>> --------------------- >>>>>>> >>>>>>> [root at twind ~]# mount -o rw 192.168.0.215:/vol0 /mnt/ >>>>>>> Command works fine. >>>>>>> >>>>>>> rdma mount -> KO >>>>>>> ----------------- >>>>>>> >>>>>>> [root at twind ~]# mount -o rdma,port=2050 192.168.0.215:/vol0 /mnt/ >>>>>>> Command blocks ! I should perform Ctr+C to kill process. >>>>>>> >>>>>>> or >>>>>>> >>>>>>> [root at twind ofa_kernel-1.4.1]# strace mount.nfs 192.168.0.215:/vol0 >>>>>>> /mnt/ -o rdma,port=2050 >>>>>>> [..] >>>>>>> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 >>>>>>> connect(3, {sa_family=AF_INET, sin_port=htons(610), >>>>>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 0 >>>>>>> fcntl(3, F_SETFL, O_RDWR) = 0 >>>>>>> sendto(3, >>>>>>> >>>>>>> >>> >"-3\245\357\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., >>> >>>>>>> 40, 0, {sa_family=AF_INET, sin_port=htons(610), >>>>>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 40 >>>>>>> poll([{fd=3, events=POLLIN}], 1, 3000) = 1 ([{fd=3, revents=POLLIN}]) >>>>>>> recvfrom(3, "-3\245\357\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", >>>>>>> 8800, MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(610), >>>>>>> sin_addr=inet_addr("127.0.0.1")}, [16]) = 24 >>>>>>> close(3) = 0 >>>>>>> mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, >>>>>>> "rdma,port=2050,addr=192.168.0.215" >>>>>>> ..same problem >>>>>>> >>>>>>> [root at twind tmp]# dmesg >>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>>>>>> 32 ird 16 >>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>>>>>> 32 ird 16 >>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>>>>>> 32 ird 16 >>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots >>>>>>> 32 ird 16 >>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>>>> >>>>>>> >>>>>>> >>>>>> Is there anything logged on the server side? >>>>>> >>>>>> Also, can you try this again, but on both systems do this before >>>>>> attempting the mount: >>>>>> >>>>>> echo 32768 > /proc/sys/sunrpc/rpc_debug >>>>>> >>>>>> This will enable all the rpc trace points and add a bunch of logging >>>>>> to /var/log/messages. >>>>>> Maybe that will show us something. It think the server is aborting >>>>>> the connection for some reason. >>>>>> >>>>>> Steve. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> general mailing list >>>>>> general at lists.openfabrics.org >>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>>> >>>>>> To unsubscribe, please visit >>>>>> http://openib.org/mailman/listinfo/openib-general >>>>>> >>>>> _______________________________________________ >>>>> general mailing list >>>>> general at lists.openfabrics.org >>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>> >>>>> To unsubscribe, please visit >>>>> http://openib.org/mailman/listinfo/openib-general >>>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>>> http://openib.org/mailman/listinfo/openib-general >>>> >>>> >>>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >http://openib.org/mailman/listinfo/openib-general >>> >>> >> >> >> >> > > From monis at Voltaire.COM Mon Apr 27 08:28:00 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Mon, 27 Apr 2009 18:28:00 +0300 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: <200904271747.43905.jackm@dev.mellanox.co.il> References: <200904270918.35281.jackm@dev.mellanox.co.il> <49F58D00.30908@Voltaire.COM> <200904271747.43905.jackm@dev.mellanox.co.il> Message-ID: <49F5CF00.4030309@Voltaire.COM> Jack Morgenstein wrote: > On Monday 27 April 2009 13:46, Moni Shoua wrote: >> So, Is there an easy way for upstream kernel users that want user space functionality? >> > Why can't they just install OFED? This affects ONLY the infiniband modules, and has undergone > extensive QA on lots of platforms. > > - Jack I agree that the common case is using OFED, but from time to time there is a need to work with kernels that doesn't have backports in OFED. I think that for such cases we need to have an answer. What I'm asking for here is an advice for real cases that I have met. Maybe we can have the same set of US libraries somewhere for download for upstream kernel users What do you think? From worleys at gmail.com Mon Apr 27 08:54:11 2009 From: worleys at gmail.com (Chris Worley) Date: Mon, 27 Apr 2009 09:54:11 -0600 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: References: <200904261431.18620.jackm@dev.mellanox.co.il> <20090426125827.GA6513@sk> <200904270927.40449.jackm@dev.mellanox.co.il> Message-ID: On Mon, Apr 27, 2009 at 5:03 AM, Bart Van Assche wrote: > On Mon, Apr 27, 2009 at 8:27 AM, Jack Morgenstein > wrote: >> The OFED distributions may contain features that the mainstream kernels and libraries do not support. >> These features frequently require changes in the Infiniband kernel modules.  Such changes are in the form >> of kernel patches which are applied to the base mainstream kernel on which the OFED release is based. >> A lag between the mainstream kernel and the OFED kernel is unavoidable, since the new features are first >> released in the OFED distributions -- and later, gradually (and hopefully), these features make there way >> into the upstream kernel. > > I don't doubt that there is a good reason why new features go in the > OFED distribution first and later in the mainstream Linux kernel. My opinion is: IB is still just too bleeding edge, even for the vanilla Linux kernel. Maybe "Upstream First" is the measure of IB achieving stability. SRP (specifically the SCST target code) is my first case in using IB where I've not been able to start with the latest OFED (or IBGD) stable release, as OFED is unsupported by the SRP target code, and had to start with a distro's IB version to get a working SRP target (of which Ubuntu 8.10 provided the only stable SRP target distro for my configuration). Chris > But > it's not clear to me why this process has been chosen. There is wide > agreement in the Linux kernel community that new kernel code should go > first in the mainstream Linux kernel and from there to the various > Linux distributions, and not the other way around. This is called the > "upstream first" policy. One of the most highly regarded kernel > maintainers (James Bottomley) wrote the following about the "upstream > first" policy: > > * Major distributions have agreed not to incorporate features or > drivers unless they are on “upstream track” > for the vanilla Linux Kernel >  - Obviously there’s some flexibility in interpretation of this for > their best customers > * Primary reason is that it keeps the distribution kernel code and the > vanilla kernel code as close as possible, so >  - Maintenance is reduced: the distro can file a bug with the > upstream maintainer if there’s a problem. >  - Testing is enhanced: users of all distributions are testing the same code >  - Code Review burden is greatly reduced: Can rely on upstream > maintainers to review and accept. > > More information about the "upstream first" policy can be found here: > * James Bottomley, Hacking the Linux Kernel for Fun and Profit, 5 > April 2008, http://www.flourishconf.com/flourish2008/images/downloads/flourish2008-jamesbottomley-hackingthelinuxkernel.pdf. > * Jonathan Corbet, A Guide to the Linux Kernel Development Process, > 2008, http://lwn.net/talks/lfeu2008/devproc/index.html. > > Bart. > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From Brendan.Doyle at Sun.COM Mon Apr 27 08:58:21 2009 From: Brendan.Doyle at Sun.COM (Brendan Doyle) Date: Mon, 27 Apr 2009 16:58:21 +0100 Subject: [ofa-general] OFED for Solaris Message-ID: <49F5D61D.4090105@sun.com> Just some further detail. This is based on OFED 1.3. libibverbs is ported, but librdmacm just has UD and multicast features, RC will be released on the same website soon. The source (Sun mods to OFED code required to make this work are still internal to Sun), but we would like to contribute them to the main OFA base. We are currently moving to OFED 1.5, and also undergoing a Solaris footprint reduction exercise, after which we will have a patch for review. Meantime, binary only support can be obtained from the Sun download website referenced. Stay tuned Brendan > Hello Jigar, An initial version of OFED user library support on > Solaris is available on : > http://www.sun.com/download/index.jsp?cat=Hardware%20Drivers&tab=3&subcat=InfiniBand > Please download IB Updates 3 listed in this URL, This link contains > both installation instructions and other information you will need to > know before running this product. Let me know if you have any queries > or run into issues Thanks Pramod >> > Hi Tziporet, >> > >> > Thanks for answer, but m not able to find any details for the same. The >> > software stack is installed on the OS but it is just not working, or it >> > is not giving me some basic command at all. Have any-body else has >> > experience on the same? >> > >> > -- >> > Regards, >> > Jigar Halani >> > >> > Spirit of Wipro: Intensity to Win | Act with Sensitivity | Unyielding >> > Integrity >> > >> > >> > -----Original Message----- >> > From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] >> > Sent: Monday, April 27, 2009 6:14 PM >> > To: Jigar Halani (WI01 - TIS - Services) >> > Cc: general at lists.openfabrics.org >> > Subject: Re: [ofa-general] OFED for Solaris >> > >> > jigar.halani at wipro.com wrote: >> > >> >>> >> I am trying to install OFED drivers on Solaris 10. I sow the release >>> >> not, but Solaris not in the list at all L >>> >> >>> >> >>> >> >>> >> Could any one please let me know, how to install the same on Solaris? >>> >> >>> >> > >> > >> >>> >> Solaris also comes with pre-installed packages, which is drives for >>> >> HCA, but not able to get the connectivity between servers and switch. >>> >> >>> >> >>> >> >>> >> >>> >> > Solaris has their own IB SW stack - provided by Sun. >> > They do support OFED user space verbs as far as I know >> > You should contact Sun for more info >> > >> > Tziporet >> > >> > >> > Please do not print this email unless it is absolutely necessary. >> > >> > The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. >> > >> > WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. >> > >> > www.wipro.com >> > _______________________________________________ >> > general mailing list >> > general at lists.openfabrics.org >> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> > >> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general >> > >> > > From jgunthorpe at obsidianresearch.com Mon Apr 27 09:23:49 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Mon, 27 Apr 2009 10:23:49 -0600 Subject: [ofa-general] Re: [PATCH v2] rdma_cm: Add debugfs entries to monitor rdma_cm connections In-Reply-To: <49F5AFEA.5090003@voltaire.com> References: <49F05AAE.4020606@Voltaire.COM> <90A07B5E8FAD49C187EAE583A976A3CD@amr.corp.intel.com> <49F42D40.5000200@Voltaire.COM> <49F5A2EC.3050807@Voltaire.com> <49F5AED6.4070208@Voltaire.COM> <49F5AFEA.5090003@voltaire.com> Message-ID: <20090427162349.GI4431@obsidianresearch.com> On Mon, Apr 27, 2009 at 04:15:22PM +0300, Or Gerlitz wrote: > Moni Shoua wrote: >> A listener that listens an all will appear on all devices > This sounds like something that can cause confusion when someone is looking > on multiple devices... but maybe we can live with that? > >>> I'd like to see how we come up with a solution with somehow less >>> space-ing between columns, maybe take a look at the netperf code, e.g >>> that generates output as below. >> I reserved enough space for IPV6 addresses. The output below is good for >> IPV4 addresses only. > I understand that, but lets try to think if/how this can be done better, > specifically if we want to add more params such as IB L4/L3/L2 (QPN/PKEY, > GUID, LID/SL) info to the output. Sounds like you should just bite the bullet and implement this with netlink. All of Or's concerns are easially addressed that way. Including all the IB path information, and APM information is definately worthwhile. BTW, including a PID is not best, you should include enough information to figure out the pid(s) from proc/xx/fd, and vice versa. Jason From jgunthorpe at obsidianresearch.com Mon Apr 27 09:55:55 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Mon, 27 Apr 2009 10:55:55 -0600 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: <200904270927.40449.jackm@dev.mellanox.co.il> References: <200904261431.18620.jackm@dev.mellanox.co.il> <20090426125827.GA6513@sk> <200904270927.40449.jackm@dev.mellanox.co.il> Message-ID: <20090427165555.GJ4431@obsidianresearch.com> On Mon, Apr 27, 2009 at 09:27:40AM +0300, Jack Morgenstein wrote: > > In general such requirement seems fundamentally bad for me. OFED goal is > > to provide support for IB and iWARP, and not to develop its own linux > > kernel. > The OFED distributions may contain features that the mainstream > kernels and libraries do not support. These features frequently > require changes in the Infiniband kernel modules. Such changes are > in the form of kernel patches which are applied to the base > mainstream kernel on which the OFED release is based. A lag between > the mainstream kernel and the OFED kernel is unavoidable, since the > new features are first released in the OFED distributions -- and > later, gradually (and hopefully), these features make there way into > the upstream kernel. Well, as others have said not following the upstream-first philosophy is 'not the Linux way' - but fundamentally, the wrong things are being QA'd :( OFED tests the past - back ports to old distributions and a random non-upstream collection of patches ontop of that. That is fine for end users, but.. It seems almost no testing is done on the future - pristine Linux kernel and pristine user space libraries (and combinations therein). If stock the Linus kernel and stock IB support libraries don't work - what hope is there to QA the huge matrix that is OFED - especially if you give up on the include random patches idea. So we have this treadmill where OFED continues to exists because it is the only thing that works, and we can't be rid of it. But I know this is hashed over every year at Sonoma ... Jason From vuhuong at mellanox.com Mon Apr 27 10:30:32 2009 From: vuhuong at mellanox.com (Vu Pham) Date: Mon, 27 Apr 2009 10:30:32 -0700 Subject: [Fwd: Re: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition]] In-Reply-To: <49f5a9c8.0e35640a.61a5.7852@mx.google.com> References: <49F02280.7010005@ext.bull.net> <49F07710.3070002@opengridcomputing.com> <49F19ECE.9080007@ext.bull.net> <49F1CF6C.3090703@opengridcomputing.com> <49F1FCEF.3030305@mellanox.com> <49F207A0.6090507@mellanox.com> <49F58F6E.7080400@ext.bull.net> <49f5a9c8.0e35640a.61a5.7852@mx.google.com> Message-ID: <49F5EBB8.30808@mellanox.com> Tom Talpey wrote: > At 06:56 AM 4/27/2009, Celine Bourde wrote: > >> Thanks for the explanation. >> Let me know if you have additional information. >> >> We have a contact at Mellanox. I will contact him. >> >> Thanks, >> >> Céline. >> >> Vu Pham wrote: >> >>> Celine, >>> >>> I'm seeing mlx4 in the log so it is connectX. >>> >>> nfsrdma does not work with any official connectX' fw release 2.6.0 >>> because of fast registering work request problems between nfsrdma and >>> the firmware. >>> > > There is a very simple workaround if you don't have the latest mlx4 firmware. > > Just set the client to use the all-physical memory registration mode. This will > avoid making unsupported reregistration requests, which the firmware advertised. > > Before mounting, enter (as root) > > sysctl -w sunrpc.rdma_memreg_strategy = 6 > > The client should work properly after this. > > If you do have access to the fixed firmware, I recommend using the default > setting (5) as it provides greater safety on the client. > > Tom. > > This work around only work for client side. On the server side we don't have option to switch the memory option -vu From bart.vanassche at gmail.com Mon Apr 27 10:31:14 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Mon, 27 Apr 2009 19:31:14 +0200 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: <200904271747.43905.jackm@dev.mellanox.co.il> References: <200904270918.35281.jackm@dev.mellanox.co.il> <49F58D00.30908@Voltaire.COM> <200904271747.43905.jackm@dev.mellanox.co.il> Message-ID: On Mon, Apr 27, 2009 at 4:47 PM, Jack Morgenstein wrote: > On Monday 27 April 2009 13:46, Moni Shoua wrote: >> So, Is there an easy way for upstream kernel users that want user space functionality? >> > Why can't they just install OFED?  This affects ONLY the infiniband modules, and has undergone > extensive QA on lots of platforms. I'm not sure that replacing the distro-provided InfiniBand components by the kernel drivers included in OFED is always a good idea. The mainstream kernel namely contains some InfiniBand patches that are not present in OFED. As an example, commit 233e70f4228e78eb2f80dc6650f65d3ae3dbf17c was applied to Linus' tree on October 19, 2008. This patch simplifies the file API such that it is no longer possible that kernel code causes a memory leak by forgetting to evict a file from fasync lists. I could not find any trace of this patch in the OFED distribution -- not even in OFED-1.4.1-20090427-0600. See also: * http://lkml.org/lkml/2008/10/31/310 * http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.29.y.git;a=commitdiff;h=233e70f4228e78eb2f80dc6650f65d3ae3dbf17c Bart. From vuhuong at mellanox.com Mon Apr 27 10:33:55 2009 From: vuhuong at mellanox.com (Vu Pham) Date: Mon, 27 Apr 2009 10:33:55 -0700 Subject: [Fwd: Re: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition]] In-Reply-To: <49F5BBAD.9000900@ext.bull.net> References: <49F02280.7010005@ext.bull.net> <49F07710.3070002@opengridcomputing.com> <49F19ECE.9080007@ext.bull.net> <49F1CF6C.3090703@opengridcomputing.com> <49F1FCEF.3030305@mellanox.com> <49F207A0.6090507@mellanox.com> <49F58F6E.7080400@ext.bull.net> <49f5a9c8.0e35640a.61a5.7852@mx.google.com> <49F5BBAD.9000900@ext.bull.net> Message-ID: <49F5EC83.6050608@mellanox.com> Celine Bourde wrote: > We have still the same problem, even changing the registration method. > > mount doesn't reply and this is the output of dmesg on client: > > rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 > ird 16 > rpcrdma: connection to 192.168.0.215:2050 closed (-103) > rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 > ird 16 > rpcrdma: connection to 192.168.0.215:2050 closed (-103) > ib0: multicast join failed for > ff12:401b:ffff:0000:0000:0000:0000:0001, status -22 > rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 > ird 16 > rpcrdma: connection to 192.168.0.215:2050 closed (-103) > > I have still another doubt: if the firmware is the problem, why is NFS > RDMA working with a kernel 2.6.27.10 and without OFED 1.4 with these > same cards?? On 2.6.27.10 nfsrdma does not use fast registration work request; therefore, it works well with connectX From 2.6.28 and so on, nfsrdma start implementing/using fast registration work request and commit without verifying it with connectX I'm looking and trying to resolve those glitches/issues now -vu > > Thanks, > > Céline Bourde. > > Tom Talpey wrote: >> At 06:56 AM 4/27/2009, Celine Bourde wrote: >> >>> Thanks for the explanation. >>> Let me know if you have additional information. >>> >>> We have a contact at Mellanox. I will contact him. >>> >>> Thanks, >>> >>> Céline. >>> >>> Vu Pham wrote: >>> >>>> Celine, >>>> >>>> I'm seeing mlx4 in the log so it is connectX. >>>> >>>> nfsrdma does not work with any official connectX' fw release 2.6.0 >>>> because of fast registering work request problems between nfsrdma >>>> and the firmware. >>>> >> >> There is a very simple workaround if you don't have the latest mlx4 >> firmware. >> >> Just set the client to use the all-physical memory registration mode. >> This will >> avoid making unsupported reregistration requests, which the firmware >> advertised. >> >> Before mounting, enter (as root) >> >> sysctl -w sunrpc.rdma_memreg_strategy = 6 >> >> The client should work properly after this. >> >> If you do have access to the fixed firmware, I recommend using the >> default >> setting (5) as it provides greater safety on the client. >> >> Tom. >> >> >>>> We are currently debugging/fixing those problems. >>>> >>>> Do you have direct contact with Mellanox field application >>>> engineer? Please contact him/her. >>>> If not I can send you a contact on private channel. >>>> >>>> thanks, >>>> -vu >>>> >>>> >>>>> Hi Celine, >>>>> >>>>> What HCA do you have on your system? Is it ConnectX? If yes, what >>>>> is its firmware version? >>>>> >>>>> -vu >>>>> >>>>> >>>>>> Hey Celine, >>>>>> >>>>>> Thanks for gathering all this info! So the rdma connections work >>>>>> fine with everything _but_ nfsrdma. And errno 103 indicates the >>>>>> connection was aborted, maybe by the server (since no failures >>>>>> are logged by the client). >>>>>> >>>>>> >>>>>> More below: >>>>>> >>>>>> >>>>>> Celine Bourde wrote: >>>>>> >>>>>>> Hi Steve, >>>>>>> >>>>>>> This email summarizes the situation: >>>>>>> >>>>>>> Standard mount -> OK >>>>>>> --------------------- >>>>>>> >>>>>>> [root at twind ~]# mount -o rw 192.168.0.215:/vol0 /mnt/ >>>>>>> Command works fine. >>>>>>> >>>>>>> rdma mount -> KO >>>>>>> ----------------- >>>>>>> >>>>>>> [root at twind ~]# mount -o rdma,port=2050 192.168.0.215:/vol0 /mnt/ >>>>>>> Command blocks ! I should perform Ctr+C to kill process. >>>>>>> >>>>>>> or >>>>>>> >>>>>>> [root at twind ofa_kernel-1.4.1]# strace mount.nfs >>>>>>> 192.168.0.215:/vol0 /mnt/ -o rdma,port=2050 >>>>>>> [..] >>>>>>> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 >>>>>>> connect(3, {sa_family=AF_INET, sin_port=htons(610), >>>>>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 0 >>>>>>> fcntl(3, F_SETFL, O_RDWR) = 0 >>>>>>> sendto(3, >>>>>>> >>> "-3\245\357\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., >>> >>>>>>> 40, 0, {sa_family=AF_INET, sin_port=htons(610), >>>>>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 40 >>>>>>> poll([{fd=3, events=POLLIN}], 1, 3000) = 1 ([{fd=3, >>>>>>> revents=POLLIN}]) >>>>>>> recvfrom(3, >>>>>>> "-3\245\357\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 8800, >>>>>>> MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(610), >>>>>>> sin_addr=inet_addr("127.0.0.1")}, [16]) = 24 >>>>>>> close(3) = 0 >>>>>>> mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, >>>>>>> "rdma,port=2050,addr=192.168.0.215" >>>>>>> ..same problem >>>>>>> >>>>>>> [root at twind tmp]# dmesg >>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 >>>>>>> slots 32 ird 16 >>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 >>>>>>> slots 32 ird 16 >>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 >>>>>>> slots 32 ird 16 >>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 >>>>>>> slots 32 ird 16 >>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>>>> >>>>>>> >>>>>>> >>>>>> Is there anything logged on the server side? >>>>>> >>>>>> Also, can you try this again, but on both systems do this before >>>>>> attempting the mount: >>>>>> >>>>>> echo 32768 > /proc/sys/sunrpc/rpc_debug >>>>>> >>>>>> This will enable all the rpc trace points and add a bunch of >>>>>> logging to /var/log/messages. >>>>>> Maybe that will show us something. It think the server is >>>>>> aborting the connection for some reason. >>>>>> >>>>>> Steve. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> general mailing list >>>>>> general at lists.openfabrics.org >>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>>> >>>>>> To unsubscribe, please visit >>>>>> http://openib.org/mailman/listinfo/openib-general >>>>>> >>>>> _______________________________________________ >>>>> general mailing list >>>>> general at lists.openfabrics.org >>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>> >>>>> To unsubscribe, please visit >>>>> http://openib.org/mailman/listinfo/openib-general >>>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>>> http://openib.org/mailman/listinfo/openib-general >>>> >>>> >>>> >>> _______________________________________________ >>> general mailing list >>> general at lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >>> >> >> >> >> > From vst at vlnb.net Mon Apr 27 10:51:15 2009 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Mon, 27 Apr 2009 21:51:15 +0400 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: References: <200904261431.18620.jackm@dev.mellanox.co.il> <20090426125827.GA6513@sk> <200904270927.40449.jackm@dev.mellanox.co.il> Message-ID: <49F5F093.5070408@vlnb.net> Chris Worley, on 04/27/2009 07:54 PM wrote: > On Mon, Apr 27, 2009 at 5:03 AM, Bart Van Assche > wrote: >> On Mon, Apr 27, 2009 at 8:27 AM, Jack Morgenstein >> wrote: >>> The OFED distributions may contain features that the mainstream kernels and libraries do not support. >>> These features frequently require changes in the Infiniband kernel modules. Such changes are in the form >>> of kernel patches which are applied to the base mainstream kernel on which the OFED release is based. >>> A lag between the mainstream kernel and the OFED kernel is unavoidable, since the new features are first >>> released in the OFED distributions -- and later, gradually (and hopefully), these features make there way >>> into the upstream kernel. >> I don't doubt that there is a good reason why new features go in the >> OFED distribution first and later in the mainstream Linux kernel. > > My opinion is: IB is still just too bleeding edge, even for the > vanilla Linux kernel. > > Maybe "Upstream First" is the measure of IB achieving stability. > > SRP (specifically the SCST target code) is my first case in using IB > where I've not been able to start with the latest OFED (or IBGD) > stable release, as OFED is unsupported by the SRP target code, and had > to start with a distro's IB version to get a working SRP target (of > which Ubuntu 8.10 provided the only stable SRP target distro for my > configuration). I think, to find out who's guilty, OFED or SRP target driver, you should simply try the latest SCST/SRP driver from the SCST SVN trunk with the known working OFED. Only make sure you don't have again mixed up older and new SCST headers. Vlad From vuhuong at mellanox.com Mon Apr 27 11:01:58 2009 From: vuhuong at mellanox.com (Vu Pham) Date: Mon, 27 Apr 2009 11:01:58 -0700 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: <49F5F093.5070408@vlnb.net> References: <200904261431.18620.jackm@dev.mellanox.co.il> <20090426125827.GA6513@sk> <200904270927.40449.jackm@dev.mellanox.co.il> <49F5F093.5070408@vlnb.net> Message-ID: <49F5F316.5060305@mellanox.com> Vladislav Bolkhovitin wrote: > Chris Worley, on 04/27/2009 07:54 PM wrote: >> On Mon, Apr 27, 2009 at 5:03 AM, Bart Van Assche >> wrote: >>> On Mon, Apr 27, 2009 at 8:27 AM, Jack Morgenstein >>> wrote: >>>> The OFED distributions may contain features that the mainstream >>>> kernels and libraries do not support. >>>> These features frequently require changes in the Infiniband kernel >>>> modules. Such changes are in the form >>>> of kernel patches which are applied to the base mainstream kernel >>>> on which the OFED release is based. >>>> A lag between the mainstream kernel and the OFED kernel is >>>> unavoidable, since the new features are first >>>> released in the OFED distributions -- and later, gradually (and >>>> hopefully), these features make there way >>>> into the upstream kernel. >>> I don't doubt that there is a good reason why new features go in the >>> OFED distribution first and later in the mainstream Linux kernel. >> >> My opinion is: IB is still just too bleeding edge, even for the >> vanilla Linux kernel. >> >> Maybe "Upstream First" is the measure of IB achieving stability. >> >> SRP (specifically the SCST target code) is my first case in using IB >> where I've not been able to start with the latest OFED (or IBGD) >> stable release, as OFED is unsupported by the SRP target code, and had >> to start with a distro's IB version to get a working SRP target (of >> which Ubuntu 8.10 provided the only stable SRP target distro for my >> configuration). > > I think, to find out who's guilty, OFED or SRP target driver, you > should simply try the latest SCST/SRP driver from the SCST SVN trunk > with the known working OFED. Only make sure you don't have again mixed > up older and new SCST headers. > Here is the simple rule of thumb: 1. If you want to use latest/greatest top of trunk SCST SVN, kernel 2.6.28, 29... then you have to use the IB driver/modules in that kernel tree. You also have to use the ib_srpt driver in SCST SVN tree 2. If you want to run IB driver, ib_srpt driver and SCST on distribution default kernel (RHEL 5,0/1/2 and its family ie. fedora, centos..., sles 10 sp1/sp2...) then you should use OFED package (with ib_srpt inside the package), SCST-1.0.0 In the OFED-1.xxx/docs directory, there is a readme on how-to ib_srpt From hnrose at comcast.net Mon Apr 27 11:17:53 2009 From: hnrose at comcast.net (Hal Rosenstock) Date: Mon, 27 Apr 2009 14:17:53 -0400 Subject: [ofa-general] [PATCH] infiniband-diags/saquery.c: Display attribute ID in hex rather than decimal Message-ID: <20090427181753.GA20430@comcast.net> for easier correlation to IBA spec Signed-off-by: Hal Rosenstock --- diff --git a/infiniband-diags/src/saquery.c b/infiniband-diags/src/saquery.c index dddebc1..4dcd712 100644 --- a/infiniband-diags/src/saquery.c +++ b/infiniband-diags/src/saquery.c @@ -163,7 +163,7 @@ recv_mad: umad = realloc(umad, umad_size() + len); goto recv_mad; } - IBPANIC("umad_recv failed: attr %u: %s\n", attr, + IBPANIC("umad_recv failed: attr 0x%x: %s\n", attr, strerror(errno)); } From worleys at gmail.com Mon Apr 27 11:29:21 2009 From: worleys at gmail.com (Chris Worley) Date: Mon, 27 Apr 2009 12:29:21 -0600 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: <49F5F093.5070408@vlnb.net> References: <200904261431.18620.jackm@dev.mellanox.co.il> <20090426125827.GA6513@sk> <200904270927.40449.jackm@dev.mellanox.co.il> <49F5F093.5070408@vlnb.net> Message-ID: On Mon, Apr 27, 2009 at 11:51 AM, Vladislav Bolkhovitin wrote: > Chris Worley, on 04/27/2009 07:54 PM wrote: >> >> On Mon, Apr 27, 2009 at 5:03 AM, Bart Van Assche >> wrote: >>> >>> On Mon, Apr 27, 2009 at 8:27 AM, Jack Morgenstein >>> wrote: >>>> >>>> The OFED distributions may contain features that the mainstream kernels >>>> and libraries do not support. >>>> These features frequently require changes in the Infiniband kernel >>>> modules.  Such changes are in the form >>>> of kernel patches which are applied to the base mainstream kernel on >>>> which the OFED release is based. >>>> A lag between the mainstream kernel and the OFED kernel is unavoidable, >>>> since the new features are first >>>> released in the OFED distributions -- and later, gradually (and >>>> hopefully), these features make there way >>>> into the upstream kernel. >>> >>> I don't doubt that there is a good reason why new features go in the >>> OFED distribution first and later in the mainstream Linux kernel. >> >> My opinion is: IB is still just too bleeding edge, even for the >> vanilla Linux kernel. >> >> Maybe "Upstream First" is the measure of IB achieving stability. >> >> SRP (specifically the SCST target code) is my first case in using IB >> where I've not been able to start with the latest OFED (or IBGD) >> stable release, as OFED is unsupported by the SRP target code, and had >> to start with a distro's IB version to get a working SRP target (of >> which Ubuntu 8.10 provided the only stable SRP target distro for my >> configuration). > > I think, to find out who's guilty, OFED or SRP target driver, you should > simply try the latest SCST/SRP driver from the SCST SVN trunk with the known > working OFED. Only make sure you don't have again mixed up older and new > SCST headers. SCST's latest ip_srpt hung (which was well documented in the SCST list) with OFED 1.4 (and 1.4.1rc3)... to which the reply was "SCST doesn't support OFED, only distros", so the Ubuntu 8.10 was the only recourse, as it uses more up-to-date drivers from OFED (close to OFED 1.4), which solved the reliability issue. There were never "mixed up header" issues, as also documented on the SCST list. Chris From worleys at gmail.com Mon Apr 27 11:40:07 2009 From: worleys at gmail.com (Chris Worley) Date: Mon, 27 Apr 2009 12:40:07 -0600 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: <49F5F316.5060305@mellanox.com> References: <200904261431.18620.jackm@dev.mellanox.co.il> <20090426125827.GA6513@sk> <200904270927.40449.jackm@dev.mellanox.co.il> <49F5F093.5070408@vlnb.net> <49F5F316.5060305@mellanox.com> Message-ID: On Mon, Apr 27, 2009 at 12:01 PM, Vu Pham wrote: > Vladislav Bolkhovitin wrote: >> >> Chris Worley, on 04/27/2009 07:54 PM wrote: >>> >>> On Mon, Apr 27, 2009 at 5:03 AM, Bart Van Assche >>> wrote: >>>> >>>> On Mon, Apr 27, 2009 at 8:27 AM, Jack Morgenstein >>>> wrote: >>>>> >>>>> The OFED distributions may contain features that the mainstream kernels >>>>> and libraries do not support. >>>>> These features frequently require changes in the Infiniband kernel >>>>> modules.  Such changes are in the form >>>>> of kernel patches which are applied to the base mainstream kernel on >>>>> which the OFED release is based. >>>>> A lag between the mainstream kernel and the OFED kernel is unavoidable, >>>>> since the new features are first >>>>> released in the OFED distributions -- and later, gradually (and >>>>> hopefully), these features make there way >>>>> into the upstream kernel. >>>> >>>> I don't doubt that there is a good reason why new features go in the >>>> OFED distribution first and later in the mainstream Linux kernel. >>> >>> My opinion is: IB is still just too bleeding edge, even for the >>> vanilla Linux kernel. >>> >>> Maybe "Upstream First" is the measure of IB achieving stability. >>> >>> SRP (specifically the SCST target code) is my first case in using IB >>> where I've not been able to start with the latest OFED (or IBGD) >>> stable release, as OFED is unsupported by the SRP target code, and had >>> to start with a distro's IB version to get a working SRP target (of >>> which Ubuntu 8.10 provided the only stable SRP target distro for my >>> configuration). >> >> I think, to find out who's guilty, OFED or SRP target driver, you should >> simply try the latest SCST/SRP driver from the SCST SVN trunk with the known >> working OFED. Only make sure you don't have again mixed up older and new >> SCST headers. >> > Here is the simple rule of thumb: > 1. If you want to use latest/greatest top of trunk SCST SVN, kernel 2.6.28, > 29... then you have to use the IB driver/modules in that kernel tree. You > also have to use the ib_srpt driver in SCST SVN tree Which is what I did, and as RHEL/CentOS5.[23] use newer OFED drivers, I had to turn to Ubuntu to get a stable SRP target. > > 2. If you want to run IB driver, ib_srpt driver and SCST on distribution > default kernel (RHEL 5,0/1/2 and its family ie. fedora, centos..., sles 10 > sp1/sp2...) then you should use OFED package (with ib_srpt inside the > package), SCST-1.0.0 > In the OFED-1.xxx/docs directory, there is a readme on how-to ib_srpt That doesn't work w/ OFED 1.4 and 1.4.1rc3. OFED's ib_srpt w/ SCST 1.0.0 hangs the sstem during modprobe, similar to what SCST's ib_srpt did. The latter was well documented in the SCST list, but later it was disclosed that SCST doesn't support any incarnation of OFED. Chris From vst at vlnb.net Mon Apr 27 11:42:29 2009 From: vst at vlnb.net (Vladislav Bolkhovitin) Date: Mon, 27 Apr 2009 22:42:29 +0400 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: References: <200904261431.18620.jackm@dev.mellanox.co.il> <20090426125827.GA6513@sk> <200904270927.40449.jackm@dev.mellanox.co.il> <49F5F093.5070408@vlnb.net> Message-ID: <49F5FC95.5060508@vlnb.net> Chris Worley, on 04/27/2009 10:29 PM wrote: > On Mon, Apr 27, 2009 at 11:51 AM, Vladislav Bolkhovitin wrote: >> Chris Worley, on 04/27/2009 07:54 PM wrote: >>> On Mon, Apr 27, 2009 at 5:03 AM, Bart Van Assche >>> wrote: >>>> On Mon, Apr 27, 2009 at 8:27 AM, Jack Morgenstein >>>> wrote: >>>>> The OFED distributions may contain features that the mainstream kernels >>>>> and libraries do not support. >>>>> These features frequently require changes in the Infiniband kernel >>>>> modules. Such changes are in the form >>>>> of kernel patches which are applied to the base mainstream kernel on >>>>> which the OFED release is based. >>>>> A lag between the mainstream kernel and the OFED kernel is unavoidable, >>>>> since the new features are first >>>>> released in the OFED distributions -- and later, gradually (and >>>>> hopefully), these features make there way >>>>> into the upstream kernel. >>>> I don't doubt that there is a good reason why new features go in the >>>> OFED distribution first and later in the mainstream Linux kernel. >>> My opinion is: IB is still just too bleeding edge, even for the >>> vanilla Linux kernel. >>> >>> Maybe "Upstream First" is the measure of IB achieving stability. >>> >>> SRP (specifically the SCST target code) is my first case in using IB >>> where I've not been able to start with the latest OFED (or IBGD) >>> stable release, as OFED is unsupported by the SRP target code, and had >>> to start with a distro's IB version to get a working SRP target (of >>> which Ubuntu 8.10 provided the only stable SRP target distro for my >>> configuration). >> I think, to find out who's guilty, OFED or SRP target driver, you should >> simply try the latest SCST/SRP driver from the SCST SVN trunk with the known >> working OFED. Only make sure you don't have again mixed up older and new >> SCST headers. > > SCST's latest ip_srpt hung (which was well documented in the SCST > list) with OFED 1.4 (and 1.4.1rc3)... to which the reply was "SCST > doesn't support OFED, only distros", so the Ubuntu 8.10 was the only > recourse, as it uses more up-to-date drivers from OFED (close to OFED > 1.4), which solved the reliability issue. > > There were never "mixed up header" issues, as also documented on the SCST list. In the SCST list clearly documented how well you mixed them up :/ > Chris > From sean.hefty at intel.com Mon Apr 27 13:01:26 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 27 Apr 2009 13:01:26 -0700 Subject: [ofa-general] RE: [PATCH 4/4] ib-mgmt/ibn3 branch: libibnetdisc add windows support In-Reply-To: <20090425100710.GA28604@sk> References: <5A8239DBACE14A9281405317269D03A3@amr.corp.intel.com> <20090425100710.GA28604@sk> Message-ID: <6C01B2F7A1A246E19866D57ABAE2B70F@amr.corp.intel.com> >> +#include > >Why is this inclusion needed? mad_osd.h is included via mad.h. It's not then, but I prefer to include necessary files directly, rather than relying on other include files to pick them up. From sashak at voltaire.com Mon Apr 27 13:08:37 2009 From: sashak at voltaire.com (Sasha Khapyorsky) Date: Mon, 27 Apr 2009 23:08:37 +0300 Subject: [ofa-general] Re: [PATCH 4/4] ib-mgmt/ibn3 branch: libibnetdisc add windows support In-Reply-To: <6C01B2F7A1A246E19866D57ABAE2B70F@amr.corp.intel.com> References: <5A8239DBACE14A9281405317269D03A3@amr.corp.intel.com> <20090425100710.GA28604@sk> <6C01B2F7A1A246E19866D57ABAE2B70F@amr.corp.intel.com> Message-ID: <20090427200837.GF16078@sk.ofa> On 13:01 Mon 27 Apr , Sean Hefty wrote: > > It's not then, but I prefer to include necessary files directly, rather than > relying on other include files to pick them up. I would agree in general, but in this specific case it is *_osd.h - system dependent file which is not included directly, at least not in libibmad and infiniband-diags up to now (hypothetically in some implementations it may not exist at all). Sasha From rdreier at cisco.com Mon Apr 27 13:30:48 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 27 Apr 2009 13:30:48 -0700 Subject: [ofa-general] Re: [PATCH 4/4] RDMA/nes: set trace length to 1 inch for SFP_D In-Reply-To: <20090422003300.GA3884@ctung-MOBL> (Chien Tung's message of "Tue, 21 Apr 2009 19:33:00 -0500") References: <20090422003300.GA3884@ctung-MOBL> Message-ID: thanks, applied 1-4 From rdreier at cisco.com Mon Apr 27 13:34:01 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 27 Apr 2009 13:34:01 -0700 Subject: [ofa-general] [PATCH] RDMA/nes: fix fw_ver in /sys In-Reply-To: <20090422010044.GA4412@ctung-MOBL> (Chien Tung's message of "Tue, 21 Apr 2009 20:00:44 -0500") References: <20090422010044.GA4412@ctung-MOBL> Message-ID: thanks, applied From rdreier at cisco.com Mon Apr 27 13:36:41 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 27 Apr 2009 13:36:41 -0700 Subject: [ofa-general] Re: [PATCH] RDMA/nes: remove compile warning nes_cm.c:857 without INFINIBAND_NES_DEBUG In-Reply-To: <20090422011709.GA4228@ctung-MOBL> (Chien Tung's message of "Tue, 21 Apr 2009 20:17:09 -0500") References: <20090422011709.GA4228@ctung-MOBL> Message-ID: thanks, applied From rdreier at cisco.com Mon Apr 27 13:41:15 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 27 Apr 2009 13:41:15 -0700 Subject: [ofa-general] Re: [PATCH 4/4] RDMA/nes: Fix hang issues for large cluster dynamic connections In-Reply-To: <20090422190958.GA21652@flatif-MOBL> (Faisal Latif's message of "Wed, 22 Apr 2009 14:09:58 -0500") References: <20090422190958.GA21652@flatif-MOBL> Message-ID: thanks, applied 1-4 From rdreier at cisco.com Mon Apr 27 13:45:41 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 27 Apr 2009 13:45:41 -0700 Subject: [ofa-general] Re: [PATCH] RDMA/nes: fix error handling In-Reply-To: <20090424160658.GA17724@flatif-MOBL> (Faisal Latif's message of "Fri, 24 Apr 2009 11:06:58 -0500") References: <20090424160658.GA17724@flatif-MOBL> Message-ID: thanks, applied. From rdreier at cisco.com Mon Apr 27 13:46:36 2009 From: rdreier at cisco.com (Roland Dreier) Date: Mon, 27 Apr 2009 13:46:36 -0700 Subject: [ofa-general] Re: Subject: [PATCH] RDMA/nes: Update iw_nes version In-Reply-To: <20090413152841.GA3648@ctung-MOBL> (Chien Tung's message of "Mon, 13 Apr 2009 10:28:41 -0500") References: <20090413152841.GA3648@ctung-MOBL> Message-ID: thanks, applied. From sean.hefty at intel.com Mon Apr 27 14:15:57 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 27 Apr 2009 14:15:57 -0700 Subject: [ofa-general] RE: [PATCH 4/4] ib-mgmt/ibn3 branch: libibnetdisc add windows support In-Reply-To: <20090427200837.GF16078@sk.ofa> References: <5A8239DBACE14A9281405317269D03A3@amr.corp.intel.com> <20090425100710.GA28604@sk> <6C01B2F7A1A246E19866D57ABAE2B70F@amr.corp.intel.com> <20090427200837.GF16078@sk.ofa> Message-ID: >I would agree in general, but in this specific case it is *_osd.h - >system dependent file which is not included directly, at least not in >libibmad and infiniband-diags up to now (hypothetically in some >implementations it may not exist at all). libibmad mad.h includes mad_osd.h directly. I added it to ibnetdisc.h, because libibnetdisc is a new library and requires OS dependent mechanisms (i.e. MAD_EXPORT) to export the new interfaces. I agree in trying to keep mad_osd.h out of the diags, but libibnetdisc is special within the diags... I really don't have a strong preference on this, so whatever you want is fine. From weiny2 at llnl.gov Mon Apr 27 14:16:03 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 27 Apr 2009 14:16:03 -0700 Subject: [ofa-general] Re: [PATCH v3 1/3] Create a new library libibnetdisc In-Reply-To: <20090424175325.GD5465@sk> References: <20090403154251.dec181f2.weiny2@llnl.gov> <20090424175325.GD5465@sk> Message-ID: <20090427141603.84e8110f.weiny2@llnl.gov> Sasha, On Fri, 24 Apr 2009 20:53:25 +0300 Sasha Khapyorsky wrote: > On 15:42 Fri 03 Apr , Ira Weiny wrote: > > diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h [snip] > > + /* quick cache of switchinfo below */ > > + int smaenhsp0; > > + /* use libibmad decoder functions for switchinfo */ > > + //WHY does this not work??? > > + //uint8_t switchinfo[sizeof (ib_switch_info_t)]; > > This is a right question - sizeof(ib_switch_info_t) < 64. Ok, I missed this. And forgot about that comment! Thanks for fixing. I did not experience any crashes though. :-/ Good thing you caught this. Ira > > > + uint8_t switchinfo[64]; > > + > > + /* quick cache of info below */ > > + uint64_t guid; > > + int type; > > + int numports; > > + /* use libibmad decoder functions for info */ > > + uint8_t info[sizeof(ib_node_info_t)]; > > Above, here and in some other places. Those buffers are used as rcvdata > with smp_query_via(), it assumes SMP MADs and 64 bytes of data is always > copied there. So when actual buffer is smaller bad things may happen. > I'm fixing this with such addition: > > diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h > index a882994..bc108ab 100644 > --- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h > +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h > @@ -57,18 +57,16 @@ typedef struct node { > /* quick cache of switchinfo below */ > int smaenhsp0; > /* use libibmad decoder functions for switchinfo */ > - //WHY does this not work??? > - //uint8_t switchinfo[sizeof (ib_switch_info_t)]; > - uint8_t switchinfo[64]; > + uint8_t switchinfo[IB_SMP_DATA_SIZE]; > > /* quick cache of info below */ > uint64_t guid; > int type; > int numports; > /* use libibmad decoder functions for info */ > - uint8_t info[sizeof(ib_node_info_t)]; > + uint8_t info[IB_SMP_DATA_SIZE]; > > - char nodedesc[IB_NODE_DESCRIPTION_SIZE]; > + char nodedesc[IB_SMP_DATA_SIZE]; > > struct port **ports; /* in order array of port pointers */ > /* the size of this array is info.numports + 1 */ > @@ -96,7 +94,7 @@ typedef struct port { > uint16_t base_lid; > uint8_t lmc; > /* use libibmad decoder functions for info */ > - uint8_t info[sizeof(ib_port_info_t)]; > + uint8_t info[IB_SMP_DATA_SIZE]; > } ibnd_port_t; > > > diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c > index 479bae7..3fd3b76 100644 > --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c > +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c > @@ -231,7 +231,7 @@ ibnd_find_node_guid(ibnd_fabric_t *fabric, uint64_t guid) > ibnd_node_t * > ibnd_update_node(ibnd_node_t *node) > { > - char portinfo_port0[sizeof (ib_port_info_t)]; > + char portinfo_port0[IB_SMP_DATA_SIZE]; > void *nd = node->nodedesc; > int p = 0; > struct ibnd_fabric *f = CONV_FABRIC_INTERNAL(node->fabric); > > Sasha -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab weiny2 at llnl.gov From sean.hefty at intel.com Mon Apr 27 14:23:07 2009 From: sean.hefty at intel.com (Sean Hefty) Date: Mon, 27 Apr 2009 14:23:07 -0700 Subject: [ofa-general] [PATCH/Resend] Fixed capability mask problem in ibstat introduec by commit 722b6c6428c9e4921a81f4a6db2838bcee660bb7 In-Reply-To: <20090425210255.GL28604@sk> References: <49F16310.1080902@ext.bull.net> <132D7B1EACCC462387A1C7FB9EAC4F2D@amr.corp.intel.com> <20090425210255.GL28604@sk> Message-ID: <112DCCB086FF41ABA5940BB49318165C@amr.corp.intel.com> >OTOH I cannot understand why port->capmask is defined as uint64_t and >not as 32-bit. Kernel uses 32-bit value and it is shown in this file as >0x%0x. > >What about to convert type of port->capmask to uint32_t? I think that makes the most sense. From weiny2 at llnl.gov Mon Apr 27 14:25:33 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 27 Apr 2009 14:25:33 -0700 Subject: [ofa-general] Re: [PATCH v3 3/3] Convert ibnetdiscover to use new ibnetdisc library. In-Reply-To: <20090425103216.GB28604@sk> References: <20090403154301.f656e7a4.weiny2@llnl.gov> <20090423082535.GD8281@sk> <20090423100206.c2621310.weiny2@llnl.gov> <20090425103216.GB28604@sk> Message-ID: <20090427142533.85f00f4d.weiny2@llnl.gov> On Sat, 25 Apr 2009 13:32:17 +0300 Sasha Khapyorsky wrote: > On 10:02 Thu 23 Apr , Ira Weiny wrote: > > > > Somewhere along the line I broke this and then this got put into the > > ibnetdiscover patch. This should not even have been here. Anyway, LDFLAGS is > > required for the -L I believe? > > 'info automake' says (Top -> Programs -> A Program): > > `PROG_LDADD' is inappropriate for passing program-specific linker > flags (except for `-l', `-L', `-dlopen' and `-dlpreopen'). So, use > the `PROG_LDFLAGS' variable for this purpose. > > So '-L' is exception suitable for LDADD. Ah ok, I did not know about the exception. We can change if you prefer. Ira > > Sasha -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab weiny2 at llnl.gov From weiny2 at llnl.gov Mon Apr 27 14:27:53 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 27 Apr 2009 14:27:53 -0700 Subject: [ofa-general] Re: [PATCH] ibnetdiscover: fix types to avoid portability castings In-Reply-To: <20090425144224.GC28604@sk> References: <5A8239DBACE14A9281405317269D03A3@amr.corp.intel.com> <20090425144224.GC28604@sk> Message-ID: <20090427142753.03e989bf.weiny2@llnl.gov> :-/ Sorry I thought I got all the ibnetdiscover patches migrated over. Ira On Sat, 25 Apr 2009 17:42:24 +0300 Sasha Khapyorsky wrote: > > We did this before, but somehow it was lost in libibnetdisc patches. > > Signed-off-by: Sasha Khapyorsky > --- > .../libibnetdisc/include/infiniband/ibnetdisc.h | 4 ++-- > infiniband-diags/src/ibnetdiscover.c | 6 +++--- > 2 files changed, 5 insertions(+), 5 deletions(-) > > diff --git a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h > index 8324ca9..4fe0f21 100644 > --- a/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h > +++ b/infiniband-diags/libibnetdisc/include/infiniband/ibnetdisc.h > @@ -104,10 +104,10 @@ typedef struct port { > typedef struct chassis { > struct chassis *next; > uint64_t chassisguid; > - int chassisnum; > + unsigned char chassisnum; > > /* generic grouping by SystemImageGUID */ > - int nodecount; > + unsigned char nodecount; > ibnd_node_t *nodes; > > /* specific to voltaire type nodes */ > diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c > index 2ca696e..e874fe4 100644 > --- a/infiniband-diags/src/ibnetdiscover.c > +++ b/infiniband-diags/src/ibnetdiscover.c > @@ -205,12 +205,12 @@ out_ids(ibnd_node_t *node, int group, char *chname) > } > > uint64_t > -out_chassis(ibnd_fabric_t *fabric, int chassisnum) > +out_chassis(ibnd_fabric_t *fabric, unsigned char chassisnum) > { > uint64_t guid; > > - fprintf(f, "\nChassis %d", chassisnum); > - guid = ibnd_get_chassis_guid(fabric, (unsigned char) chassisnum); > + fprintf(f, "\nChassis %u", chassisnum); > + guid = ibnd_get_chassis_guid(fabric, chassisnum); > if (guid) > fprintf(f, " (guid 0x%" PRIx64 ")", guid); > fprintf(f, "\n"); > -- > 1.6.1.2.319.gbd9e > -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab weiny2 at llnl.gov From weiny2 at llnl.gov Mon Apr 27 14:50:26 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 27 Apr 2009 14:50:26 -0700 Subject: [ofa-general] Re: [PATCH 8/8] Convert ibqueryerrors.pl to C and use new ibnetdisc library. In-Reply-To: <20090425155441.GE28604@sk> References: <20090423133120.acf0af63.weiny2@llnl.gov> <20090425155441.GE28604@sk> Message-ID: <20090427145026.7e074ffc.weiny2@llnl.gov> On Sat, 25 Apr 2009 18:54:41 +0300 Sasha Khapyorsky wrote: > On 13:31 Thu 23 Apr , Ira Weiny wrote: > > diff --git a/infiniband-diags/src/ibqueryerrors.c b/infiniband-diags/src/ibqueryerrors.c > > new file mode 100644 > > index 0000000..9d96190 > > +#include [snip] > > + > > +#include > > AFAIR WinOF doesn't like such inclusion and uses > > #include > > instead. I'm changing. Ok, sorry. > > > +#include > > +#include > > + > > +#include "ibdiag_common.h" > > + > > +char *argv0 = "ibqueryerrors"; > > argv0 variable is not needed if you are using ibdiag_common stuff. > Removing. cool, that was left over sorry. > > > +static FILE *f; > > I don't see where 'f' is used (except 'f = stdout;' below). Removing. > > [snip...] > > > +static int process_opt(void *context, int ch, char *optarg) > > +{ > > + switch (ch) { > > + case 's': > > + calculate_suppressed_fields(optarg); > > + break; > > + case 'c': > > + /* Right now this is the only "common" error */ > > + add_suppressed(IB_PC_ERR_SWITCH_REL_F); > > + break; > > + case 1: > > + node_name_map_file = strdup(optarg); > > + break; > > + case 2: > > + data_counters++; > > + break; > > + case 3: > > + all_nodes++; > > + break; > > + case 'S': > > + switch_guid_str = strdup(optarg); > > Why should optarg be strdup()ed? Well it does not have to be strdup'ed however, switch_guid_str needs to be set for the call to ib_resolve_portid_str_via below. ... if (ib_resolve_portid_str_via(&portid, switch_guid_str, IB_DEST_GUID, ... The removal of this line causes the '-S' option to segfault. Patch to pq/ibn4 is below. [snip] > > + > > + if (switch_guid) { > > + /* limit the scan the fabric around the target */ > > + ib_portid_t portid = {0}; > > + > > + if (ib_resolve_portid_str_via(&portid, switch_guid_str, IB_DEST_GUID, > > + ibd_sm_id, ibmad_port) < 0) { > > + fprintf(stderr, "can't resolve destination port %s %p\n", > > + switch_guid_str, ibd_sm_id); > > + rc = 1; > > + goto close_port; > > + } > > + > > + if ((fabric = ibnd_discover_fabric(ibmad_port, ibd_timeout, &portid, 1)) == NULL) { > > + fprintf(stderr, "discover failed\n"); > > + rc = 1; > > + goto close_port; > > + } > > + } else { > > + if ((fabric = ibnd_discover_fabric(ibmad_port, ibd_timeout, NULL, -1)) == NULL) { > > + fprintf(stderr, "discover failed\n"); > > + rc = 1; > > + goto close_port; > > + } > > Above you are using IBERROR(), here is fprintf(stderr, ...). Could it be > consistent? (if yes - it is subsequent patch). I used fprintf here to allow the goto to close the port, rather than let IBERROR exit out. We already discussed this on the list but this gives a better example to users that they are to close the port. I can change it if you like. > > > + } > > + > > + report_suppressed(); > > + > > + if (switch_guid) { > > + ibnd_node_t *node = ibnd_find_node_guid(fabric, switch_guid); > > + print_node(node, NULL); > > + } else if (dr_path) { > > + ibnd_node_t *node = ibnd_find_node_dr(fabric, dr_path); > > + print_node(node, NULL); > > When GUID or DR Path are specified we don't need to discover whole > fabric, but can try to resolve LID using SA or querying PortInfo. > > Although when in GUID is specified and SA is not responsive there is > probably no other choice than discover. > :-( good point. Discovering only part of the fabric was a huge speed improvement but if the resolve does not succeed I should do a full discover. I will work up a separate patch. Right now you are correct if the SA is unresponsive the "-S" option will fail. iblinkinfo does the full scan every time. But that slows down the query for a single switch to the same O(n) query that a full system scan requires. I would rather have that query be O(1). So I implemented ibqueryerrors in this manner with the intent of going back and "fixing" iblinkinfo. I think having a fall back on a full system scan is a good idea. Patch for both tools will follow... :-D Ira From: Ira Weiny Date: Mon, 27 Apr 2009 14:47:08 -0700 Subject: [PATCH] switch_guid_str is required for the string resolve function. Signed-off-by: Ira Weiny --- infiniband-diags/src/ibqueryerrors.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/infiniband-diags/src/ibqueryerrors.c b/infiniband-diags/src/ibqueryerrors.c index 52bd036..09861be 100644 --- a/infiniband-diags/src/ibqueryerrors.c +++ b/infiniband-diags/src/ibqueryerrors.c @@ -364,6 +364,7 @@ static int process_opt(void *context, int ch, char *optarg) all_nodes++; break; case 'S': + switch_guid_str = optarg; switch_guid = strtoull(optarg, 0, 0); break; case 'D': -- 1.5.4.5 From weiny2 at llnl.gov Mon Apr 27 15:04:09 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Mon, 27 Apr 2009 15:04:09 -0700 Subject: [ofa-general] Re: [PATCH 0/5] Follow on patch series to libibnetdisc including converting ibqueryerrors.pl In-Reply-To: <20090425175710.GI28604@sk> References: <20090422185441.6f8601dc.weiny2@llnl.gov> <20090425175710.GI28604@sk> Message-ID: <20090427150409.9c10e479.weiny2@llnl.gov> On Sat, 25 Apr 2009 20:57:10 +0300 Sasha Khapyorsky wrote: > Hi, > > On 18:54 Wed 22 Apr , Ira Weiny wrote: > > > > When do you plan to merge pq/ibn3? > > I merged all libibnetdiscover related patch series with noted fixes into > new branch pq/ibn4 and pushed it out. Please verify it once again > (including win compatibility) before merging upstream. > > I also figured out that ibnetdiscover changes an order of switches and > ports (now it is from high to low). It could be not bad things for > debugging scripts which use this output, but for human readability I > think reverse order (from low to high) is preferable. > The port output should be from low to high. The following is an example from my output. vendid=0x8f1 devid=0x5a30 sysimgguid=0x8f10400411b19 switchguid=0x8f10400411b18(8f10400411b18) Switch 24 "S-0008f10400411b18" # "ISR9024D Voltaire" base port 0 lid 15 lmc 0 [4] "H-0002c90300000388"[1](2c90300000389) # "woprjr3" lid 20 4xSDR [6] "H-0002c902002268c4"[1](2c902002268c5) # "woprjr4" lid 28 4xDDR [12] "S-000b8cffff00490c"[22] # "MT47396 Infiniscale-III Mellanox Technologies" lid 14 4xDDR [23] "S-000b8cffff00490c"[11] # "MT47396 Infiniscale-III Mellanox Technologies" lid 14 4xDDR [24] "S-000b8cffff00490c"[10] # "MT47396 Infiniscale-III Mellanox Technologies" lid 14 4xDDR vendid=0x2c9 devid=0xb924 sysimgguid=0xb8cffff00490c switchguid=0xb8cffff00490c(b8cffff00490c) Switch 24 "S-000b8cffff00490c" # "MT47396 Infiniscale-III Mellanox Technologies" base port 0 lid 14 lmc 0 [10] "S-0008f10400411b18"[24] # "ISR9024D Voltaire" lid 15 4xDDR [11] "S-0008f10400411b18"[23] # "ISR9024D Voltaire" lid 15 4xDDR [12] "S-0005ad0000092106"[8] # "Cisco Switch SFS7000D" lid 10 4xDDR [22] "S-0008f10400411b18"[12] # "ISR9024D Voltaire" lid 15 4xDDR [23] "S-0005ad0000092106"[20] # "Cisco Switch SFS7000D" lid 10 4xDDR [24] "S-0005ad0000092106"[21] # "Cisco Switch SFS7000D" lid 10 4xDDR vendid=0x5ad devid=0xb924 sysimgguid=0x5ad0301092106 switchguid=0x5ad0000092106(5ad0000092106) Switch 24 "S-0005ad0000092106" # "Cisco Switch SFS7000D" enhanced port 0 lid 10 lmc 0 [1] "H-0002c9020023c288"[1](2c9020023c289) # "woprjr0" lid 4 4xDDR [2] "H-0002c90300000378"[1](2c90300000379) # "woprjr1" lid 24 4xDDR [3] "H-0002c90300002378"[1](2c90300002379) # "woprjr2" lid 16 4xDDR [8] "S-000b8cffff00490c"[12] # "MT47396 Infiniscale-III Mellanox Technologies" lid 14 4xDDR [20] "S-000b8cffff00490c"[23] # "MT47396 Infiniscale-III Mellanox Technologies" lid 14 4xDDR [21] "S-000b8cffff00490c"[24] # "MT47396 Infiniscale-III Mellanox Technologies" lid 14 4xDDR What do you see? Ira From dledford at redhat.com Mon Apr 27 16:03:14 2009 From: dledford at redhat.com (Doug Ledford) Date: Mon, 27 Apr 2009 19:03:14 -0400 Subject: [ofa-general] Possible bug of inkernel OFED of RHEL5.3? In-Reply-To: <49F12CB4.6030302@sun.com> References: <49F12CB4.6030302@sun.com> Message-ID: <8C644F36-107F-4772-A684-EDF5E06CF05A@redhat.com> On Apr 23, 2009, at 11:06 PM, Liang Zhen wrote: > Hi there, > I've posted this in rhel5-list, but I'm not sure whether it's the > right > place so I post it here again... I'm not sure which rhel5-list you are referring to, but I'm certain I'm not on it, and I'm certain that it's not one of our SLA assured support mechanisms. > We got this assertion while running inkernel OFED of RHEL5.3: > > Apr 15 08:06:24 cl8-0 kernel: RTNL: assertion failed at > net/core/fib_rules.c (388) > Apr 15 08:06:24 cl8-0 kernel: > Apr 15 08:06:24 cl8-0 kernel: Call Trace: > Apr 15 08:06:24 cl8-0 kernel: [] fib_rules_event > +0x3d/0xff > Apr 15 08:06:24 cl8-0 kernel: [] > notifier_call_chain+0x20/0x32 > Apr 15 08:06:24 cl8-0 kernel: [] dev_set_mtu+0x5a/ > 0x60 > Apr 15 08:06:24 cl8-0 kernel: [] > :ib_ipoib:set_mode+0x94/0x134 > Apr 15 08:06:24 cl8-0 kernel: [] > sysfs_write_file+0xb9/0xe8 > Apr 15 08:06:24 cl8-0 kernel: [] vfs_write+0xce/ > 0x174 > Apr 15 08:06:24 cl8-0 kernel: [] sys_write+0x45/0x6e > Apr 15 08:06:24 cl8-0 kernel: [] system_call+0x7e/ > 0x83 > Apr 15 08:06:24 cl8-0 kernel: > Apr 15 08:06:24 cl8-0 kernel: RTNL: assertion failed at > net/ipv4/devinet.c (986) > Apr 15 08:06:24 cl8-0 kernel: > Apr 15 08:06:24 cl8-0 kernel: Call Trace: > Apr 15 08:06:24 cl8-0 kernel: [] inetdev_event > +0x48/0x282 > Apr 15 08:06:24 cl8-0 kernel: [] > notifier_call_chain+0x20/0x32 > Apr 15 08:06:24 cl8-0 kernel: [] dev_set_mtu+0x5a/ > 0x60 > Apr 15 08:06:24 cl8-0 kernel: [] > :ib_ipoib:set_mode+0x94/0x134 > Apr 15 08:06:24 cl8-0 kernel: [] > sysfs_write_file+0xb9/0xe8 > Apr 15 08:06:24 cl8-0 kernel: [] vfs_write+0xce/ > 0x174 > Apr 15 08:06:24 cl8-0 kernel: [] sys_write+0x45/0x6e > Apr 15 08:06:24 cl8-0 kernel: [] system_call+0x7e/ > 0x83 > Apr 15 08:06:24 cl8-0 kernel: > > When looking into code I found: > > sysfs_write_file()->flush_write_buffer()->store()- > >ipoib_cm.c::set_mode()->dev_set_mtu()->raw_notifier_call_chain- > >notifier_call_chain()->fib_rules_event()->ASSERT_RTNL(). > So, ipoib_cm called dev_set_mtu without rtnl_lock, but dev_set_mtu > will assert caller already has rtnl_lock. > > I think we may need this patch, could somebody confirm this? > > Thanks > Liang > > --- drivers/infiniband/ulp/ipoib/ipoib_cm.c 2009-04-16 > 12:49:04.000000000 -0400 > +++ drivers/infiniband/ulp/ipoib/ipoib_cm.c 2009-04-16 > 12:48:52.000000000 -0400 > @@ -1481,7 +1481,9 @@ static ssize_t set_mode(struct class_dev > if (ipoib_cm_max_mtu(dev) > priv->mcast_mtu) > ipoib_warn(priv, "mtu > %d will cause multicast packet drops.\n", > priv->mcast_mtu); > + rtnl_lock(); > dev_set_mtu(dev, ipoib_cm_max_mtu(dev)); > + rtnl_unlock(); > > ipoib_flush_paths(dev); > return count; No, you don't want this patch. The infinband core in OFED 1.3.2 (used in rhel5.3) is not ready for this patch. There are additional changes needed to the core code to deal with handling work queue flushes and deciding whether or not to process events during those work queue flushes depending on the code path that we got to that point from. Without that additional infrastructure changes, the change to use dev_set_mtu and take the rtnl_lock resulted in lockups during attempts to ifdown interfaces (either ones in connected mode or unconnected mode, can't remember which, but one way worked and the other was lockup city). We reverted this patch due to those lockups. Instead, we only support setting connected mode and setting the device mtu as part of the bringup of the interface (aka, ifup ib0 when you've added CONNECTED_MODE=yes and MTU=65520 to /etc/sysconfig/network-scripts/ ifcfg-ib0). Under those conditions, the kernel works fine and does not present a risk. -- Doug Ledford GPG KeyID: CFBFF194 http://people.redhat.com/dledford InfiniBand Specific RPMS http://people.redhat.com/dledford/Infiniband -------------- next part -------------- A non-text attachment was scrubbed... Name: PGP.sig Type: application/pgp-signature Size: 203 bytes Desc: This is a digitally signed message part URL: From jackm at dev.mellanox.co.il Mon Apr 27 22:46:31 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 28 Apr 2009 08:46:31 +0300 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: References: <200904271747.43905.jackm@dev.mellanox.co.il> Message-ID: <200904280846.31920.jackm@dev.mellanox.co.il> On Monday 27 April 2009 20:31, Bart Van Assche wrote: > . As an example, commit > 233e70f4228e78eb2f80dc6650f65d3ae3dbf17c was applied to Linus' tree on > October 19, 2008. .... I could not find any trace of this > patch in the OFED distribution -- not even in > OFED-1.4.1-20090427-0600. That is because OFED 1.4.1 is based upon kernel 2.6.27 -- and the patch you mention only entered kernel 2.6.28. OFED 1.5 (coming out in a couple of months) is in development now, and will be based on kernel 2.6.30 and does contain this patch -- and all others that were committed to the kernel up to then (we take the mainstream kernel tree as-is, and adapt/delete the various patches and backport fixes from previous OFEDs; that way the infiniband drivers always contain all fixes which were applied to the kernel version on which the OFED release is based). - Jack From bart.vanassche at gmail.com Mon Apr 27 23:53:34 2009 From: bart.vanassche at gmail.com (Bart Van Assche) Date: Tue, 28 Apr 2009 08:53:34 +0200 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: <49F5F316.5060305@mellanox.com> References: <200904261431.18620.jackm@dev.mellanox.co.il> <20090426125827.GA6513@sk> <200904270927.40449.jackm@dev.mellanox.co.il> <49F5F093.5070408@vlnb.net> <49F5F316.5060305@mellanox.com> Message-ID: On Mon, Apr 27, 2009 at 8:01 PM, Vu Pham wrote: > Here is the simple rule of thumb: > 1. If you want to use latest/greatest top of trunk SCST SVN, kernel 2.6.28, > 29... then you have to use the IB driver/modules in that kernel tree. You > also have to use the ib_srpt driver in SCST SVN tree > > 2. If you want to run IB driver, ib_srpt driver and SCST on distribution > default kernel (RHEL 5,0/1/2 and its family ie. fedora, centos..., sles 10 > sp1/sp2...) then you should use OFED package (with ib_srpt inside the > package), SCST-1.0.0 > In the OFED-1.xxx/docs directory, there is a readme on how-to ib_srpt Hello Vu, Thanks for jumping in on this thread. Regarding the ib_srpt driver in the SCST Subversion repository: while this driver works great with any mainstream kernel it has been tested with, the combination RHEL 5.3 + OFED 1.4.1rc3 + ib_srpt from the SCST trunk does not work. Would it be possible for one of the OFED Q.A. people to have a look at this ? Bart. From jackm at dev.mellanox.co.il Tue Apr 28 01:01:37 2009 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Tue, 28 Apr 2009 11:01:37 +0300 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: <20090427165555.GJ4431@obsidianresearch.com> References: <200904270927.40449.jackm@dev.mellanox.co.il> <20090427165555.GJ4431@obsidianresearch.com> Message-ID: <200904281101.38009.jackm@dev.mellanox.co.il> On Monday 27 April 2009 19:55, Jason Gunthorpe wrote: > OFED tests the past - back ports to old distributions and a random > non-upstream collection of patches ontop of that. That is fine for end > users, but.. > That is not quite the case. We do test regression on the base kernel of a given OFED distribution (That is -- on a system which runs the base kernel, but installing OFED, which does include kernel_patches/fixes), so we do verify that OFED does run properly on its base kernel. We are not testing just the past. - Jack From vlad at lists.openfabrics.org Tue Apr 28 03:22:28 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Tue, 28 Apr 2009 03:22:28 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090428-0200 daily build status Message-ID: <20090428102229.1FC4BE6134F@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From Diego.Moreno-Lazaro at bull.net Tue Apr 28 05:45:01 2009 From: Diego.Moreno-Lazaro at bull.net (Diego Moreno) Date: Tue, 28 Apr 2009 14:45:01 +0200 Subject: [Fwd: Re: [ofa-general] [NFS/RDMA] Can't mount NFS/RDMA partition]] In-Reply-To: <49F5EC83.6050608@mellanox.com> References: <49F02280.7010005@ext.bull.net> <49F07710.3070002@opengridcomputing.com> <49F19ECE.9080007@ext.bull.net> <49F1CF6C.3090703@opengridcomputing.com> <49F1FCEF.3030305@mellanox.com> <49F207A0.6090507@mellanox.com> <49F58F6E.7080400@ext.bull.net> <49f5a9c8.0e35640a.61a5.7852@mx.google.com> <49F5BBAD.9000900@ext.bull.net> <49F5EC83.6050608@mellanox.com> Message-ID: <49F6FA4D.2080007@bull.net> Hi, I'm working with Celine trying to make NFS RDMA work. We installed a new firmware (2.6.636). We still have the problem but now we have more information on client side. - With the workaround (memreg 6) we can mount without any problem. We can read a file but if we try to create a file with dd, application hangs and then we have to do 'umount -f'. There is no message on server. Message on client: rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16 rpcrdma: connection to 192.168.0.215:2050 closed (-103) - With fast registration: There is no message on server. dmesg client output with fast registration: rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16 rpcrdma: connection to 192.168.0.215:2050 closed (-103) rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16 ------------[ cut here ]------------ WARNING: at kernel/softirq.c:136 local_bh_enable_ip+0x3c/0x92() Modules linked in: xprtrdma autofs4 hidp nfs lockd nfs_acl rfcomm l2cap bluetooth sunrpc iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables cpufreq_ondemand acpi_cpufreq freq_table rdma_ucm ib_sdp rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_sa ipv6 ib_uverbs ib_umad iw_nes ib_ipath ib_mthca dm_multipath scsi_dh raid0 sbs sbshc battery acpi_memhotplug ac parport_pc lp parport mlx4_ib ib_mad ib_core e1000e sr_mod joydev cdrom mlx4_core i5000_edac edac_core shpchp rtc_cmos sg pcspkr rtc_core rtc_lib i2c_i801 i2c_core serio_raw button dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode] Pid: 0, comm: swapper Not tainted 2.6.27_ofa_compil #2 Call Trace: [] warn_on_slowpath+0x51/0x77 [] __wake_up+0x38/0x4f [] __wake_up_bit+0x28/0x2d [] rpc_wake_up_task_queue_locked+0x223/0x24b [sunrpc] [] rpc_wake_up_status+0x47/0x82 [sunrpc] [] local_bh_enable_ip+0x3c/0x92 [] rpcrdma_conn_func+0x6d/0x7c [xprtrdma] [] rpcrdma_qp_async_error_upcall+0x45/0x5a [xprtrdma] [] mlx4_ib_qp_event+0xf9/0x100 [mlx4_ib] [] __queue_work+0x22/0x32 [] mlx4_qp_event+0x8a/0xad [mlx4_core] [] mlx4_eq_int+0x55/0x291 [mlx4_core] [] mlx4_msi_x_interrupt+0xf/0x16 [mlx4_core] [] handle_IRQ_event+0x25/0x53 [] handle_edge_irq+0xe3/0x123 [] do_IRQ+0xf1/0x15e [] ret_from_intr+0x0/0xa [] nul_marshal+0x0/0x20 [sunrpc] [] mwait_idle+0x41/0x45 [] cpu_idle+0x7e/0x9c ---[ end trace 5cc994fbe7e141af ]--- rpcrdma: connection to 192.168.0.215:2050 closed (-103) rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16 rpcrdma: connection to 192.168.0.215:2050 closed (-103) Thanks, Diego Vu Pham wrote: > Celine Bourde wrote: >> We have still the same problem, even changing the registration method. >> >> mount doesn't reply and this is the output of dmesg on client: >> >> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 >> ird 16 >> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 >> ird 16 >> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >> ib0: multicast join failed for >> ff12:401b:ffff:0000:0000:0000:0000:0001, status -22 >> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 >> ird 16 >> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >> >> I have still another doubt: if the firmware is the problem, why is NFS >> RDMA working with a kernel 2.6.27.10 and without OFED 1.4 with these >> same cards?? > > On 2.6.27.10 nfsrdma does not use fast registration work request; > therefore, it works well with connectX > > From 2.6.28 and so on, nfsrdma start implementing/using fast > registration work request and commit without verifying it with connectX > > I'm looking and trying to resolve those glitches/issues now > > -vu > >> >> Thanks, >> >> Céline Bourde. >> >> Tom Talpey wrote: >>> At 06:56 AM 4/27/2009, Celine Bourde wrote: >>> >>>> Thanks for the explanation. >>>> Let me know if you have additional information. >>>> >>>> We have a contact at Mellanox. I will contact him. >>>> >>>> Thanks, >>>> >>>> Céline. >>>> >>>> Vu Pham wrote: >>>> >>>>> Celine, >>>>> >>>>> I'm seeing mlx4 in the log so it is connectX. >>>>> >>>>> nfsrdma does not work with any official connectX' fw release 2.6.0 >>>>> because of fast registering work request problems between nfsrdma >>>>> and the firmware. >>>>> >>> >>> There is a very simple workaround if you don't have the latest mlx4 >>> firmware. >>> >>> Just set the client to use the all-physical memory registration mode. >>> This will >>> avoid making unsupported reregistration requests, which the firmware >>> advertised. >>> >>> Before mounting, enter (as root) >>> >>> sysctl -w sunrpc.rdma_memreg_strategy = 6 >>> >>> The client should work properly after this. >>> >>> If you do have access to the fixed firmware, I recommend using the >>> default >>> setting (5) as it provides greater safety on the client. >>> >>> Tom. >>> >>> >>>>> We are currently debugging/fixing those problems. >>>>> >>>>> Do you have direct contact with Mellanox field application >>>>> engineer? Please contact him/her. >>>>> If not I can send you a contact on private channel. >>>>> >>>>> thanks, >>>>> -vu >>>>> >>>>> >>>>>> Hi Celine, >>>>>> >>>>>> What HCA do you have on your system? Is it ConnectX? If yes, what >>>>>> is its firmware version? >>>>>> >>>>>> -vu >>>>>> >>>>>> >>>>>>> Hey Celine, >>>>>>> >>>>>>> Thanks for gathering all this info! So the rdma connections work >>>>>>> fine with everything _but_ nfsrdma. And errno 103 indicates the >>>>>>> connection was aborted, maybe by the server (since no failures >>>>>>> are logged by the client). >>>>>>> >>>>>>> >>>>>>> More below: >>>>>>> >>>>>>> >>>>>>> Celine Bourde wrote: >>>>>>> >>>>>>>> Hi Steve, >>>>>>>> >>>>>>>> This email summarizes the situation: >>>>>>>> >>>>>>>> Standard mount -> OK >>>>>>>> --------------------- >>>>>>>> >>>>>>>> [root at twind ~]# mount -o rw 192.168.0.215:/vol0 /mnt/ >>>>>>>> Command works fine. >>>>>>>> >>>>>>>> rdma mount -> KO >>>>>>>> ----------------- >>>>>>>> >>>>>>>> [root at twind ~]# mount -o rdma,port=2050 192.168.0.215:/vol0 /mnt/ >>>>>>>> Command blocks ! I should perform Ctr+C to kill process. >>>>>>>> >>>>>>>> or >>>>>>>> >>>>>>>> [root at twind ofa_kernel-1.4.1]# strace mount.nfs >>>>>>>> 192.168.0.215:/vol0 /mnt/ -o rdma,port=2050 >>>>>>>> [..] >>>>>>>> fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 >>>>>>>> connect(3, {sa_family=AF_INET, sin_port=htons(610), >>>>>>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 0 >>>>>>>> fcntl(3, F_SETFL, O_RDWR) = 0 >>>>>>>> sendto(3, >>>>>>>> >>>> "-3\245\357\0\0\0\0\0\0\0\2\0\1\206\270\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., >>>> >>>>>>>> 40, 0, {sa_family=AF_INET, sin_port=htons(610), >>>>>>>> sin_addr=inet_addr("127.0.0.1")}, 16) = 40 >>>>>>>> poll([{fd=3, events=POLLIN}], 1, 3000) = 1 ([{fd=3, >>>>>>>> revents=POLLIN}]) >>>>>>>> recvfrom(3, >>>>>>>> "-3\245\357\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 8800, >>>>>>>> MSG_DONTWAIT, {sa_family=AF_INET, sin_port=htons(610), >>>>>>>> sin_addr=inet_addr("127.0.0.1")}, [16]) = 24 >>>>>>>> close(3) = 0 >>>>>>>> mount("192.168.0.215:/vol0", "/mnt", "nfs", 0, >>>>>>>> "rdma,port=2050,addr=192.168.0.215" >>>>>>>> ..same problem >>>>>>>> >>>>>>>> [root at twind tmp]# dmesg >>>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 >>>>>>>> slots 32 ird 16 >>>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 >>>>>>>> slots 32 ird 16 >>>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 >>>>>>>> slots 32 ird 16 >>>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>>>>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 >>>>>>>> slots 32 ird 16 >>>>>>>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> Is there anything logged on the server side? >>>>>>> >>>>>>> Also, can you try this again, but on both systems do this before >>>>>>> attempting the mount: >>>>>>> >>>>>>> echo 32768 > /proc/sys/sunrpc/rpc_debug >>>>>>> >>>>>>> This will enable all the rpc trace points and add a bunch of >>>>>>> logging to /var/log/messages. >>>>>>> Maybe that will show us something. It think the server is >>>>>>> aborting the connection for some reason. >>>>>>> >>>>>>> Steve. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> general mailing list >>>>>>> general at lists.openfabrics.org >>>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>>>> >>>>>>> To unsubscribe, please visit >>>>>>> http://openib.org/mailman/listinfo/openib-general >>>>>>> >>>>>> _______________________________________________ >>>>>> general mailing list >>>>>> general at lists.openfabrics.org >>>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>>> >>>>>> To unsubscribe, please visit >>>>>> http://openib.org/mailman/listinfo/openib-general >>>>>> >>>>> _______________________________________________ >>>>> general mailing list >>>>> general at lists.openfabrics.org >>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>>> >>>>> To unsubscribe, please visit >>>>> http://openib.org/mailman/listinfo/openib-general >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> general mailing list >>>> general at lists.openfabrics.org >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>>> >>>> To unsubscribe, please visit >>>> http://openib.org/mailman/listinfo/openib-general >>>> >>>> >>> >>> >>> >>> >> > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > From PHF at zurich.ibm.com Tue Apr 28 05:54:37 2009 From: PHF at zurich.ibm.com (Philip Frey1) Date: Tue, 28 Apr 2009 14:54:37 +0200 Subject: [ofa-general] RPM / build environment Message-ID: Dear all, as announced earlier on this channel as well as at the Sonoma Workshop, we are adding a purely software based RDMA driver to OFED. What is the correct/appropriate way of adding our driver to the build system consisting of the 'install.pl' perl script and the SRPMS? Many thanks for your advice, Philip -- Philip Frey IBM Zurich Research Laboratory Saumerstrasse 4 | Phone: +41 44 724 8613 CH-8803 Rueschlikon/Switzerland | Email: phf at zurich.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From tmtalpey at gmail.com Tue Apr 28 05:56:35 2009 From: tmtalpey at gmail.com (tmtalpey at gmail.com) Date: Tue, 28 Apr 2009 05:56:35 -0700 (PDT) Subject: [Fwd: Re: [ofa-general][NFS/RDMA]Can'tmountNFS/RDMApartition]] Message-ID: <49f6fd03.85c2f10a.4d46.644a@mx.google.com> In both cases the connection is being lost under load. This usually indicates a credit (slot count) mismatch, or an IRD/ORD one. What kernel version are you running on each end? Any special sysctl settings on the server? The oops on the client is troubling, but it,s happening in the error upcall and resembles a problem I fixed a while back. I'll check it when I get back to a source repo. It's not the cause of the issue though. Tom. -----Original Message----- From: Diego Moreno Subj: Re: [Fwd: Re: [ofa-general][NFS/RDMA]Can'tmountNFS/RDMApartition]] Date: Tue Apr 28, 2009 8:44 am Size: 3K To: Vu Pham cc: OpenIB Hi, I'm working with Celine trying to make NFS RDMA work. We installed a new firmware (2.6.636). We still have the problem but now we have more information on client side. - With the workaround (memreg 6) we can mount without any problem. We can read a file but if we try to create a file with dd, application hangs and then we have to do 'umount -f'. There is no message on server. Message on client: rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16 rpcrdma: connection to 192.168.0.215:2050 closed (-103) - With fast registration: There is no message on server. dmesg client output with fast registration: rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16 rpcrdma: connection to 192.168.0.215:2050 closed (-103) rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16 ------------[ cut here ]------------ WARNING: at kernel/softirq.c:136 local_bh_enable_ip+0x3c/0x92() Modules linked in: xprtrdma autofs4 hidp nfs lockd nfs_acl rfcomm l2cap bluetooth sunrpc iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables cpufreq_ondemand acpi_cpufreq freq_table rdma_ucm ib_sdp rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_sa ipv6 ib_uverbs ib_umad iw_nes ib_ipath ib_mthca dm_multipath scsi_dh raid0 sbs sbshc battery acpi_memhotplug ac parport_pc lp parport mlx4_ib ib_mad ib_core e1000e sr_mod joydev cdrom mlx4_core i5000_edac edac_core shpchp rtc_cmos sg pcspkr rtc_core rtc_lib i2c_i801 i2c_core serio_raw button dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode] Pid: 0, comm: swapper Not tainted 2.6.27_ofa_compil #2 Call Trace: [] warn_on_slowpath+0x51/0x77 [] __wake_up+0x38/0x4f [] __wake_up_bit+0x28/0x2d [] rpc_wake_up_task_queue_locked+0x223/0x24b [sunrpc] [] rpc_wake_up_status+0x47/0x82 [sunrpc] [] local_bh_enable_ip+0x3c/0x92 [] rpcrdma_conn_func+0x6d/0x7c [xprtrdma] [] rpcrdma_qp_async_error_upcall+0x45/0x5a [xprtrdma] [] mlx4_ib_qp_event+0xf9/0x100 [mlx4_ib] [] __queue_work+0x22/0x32 [] mlx4_qp_event+0x8a/0xad [mlx4_core] [] mlx4_eq_int+0x55/0x291 [mlx4_core] [] mlx4_msi_x_interrupt+0xf/0x16 [mlx4_core] [] handle_IRQ_event+0x25/0x53 [] handle_edge_irq+0xe3/0x123 [] do_IRQ+0xf1/0x15e [] ret_from_intr+0x0/0xa [] nul_marshal+0x0/0x20 [sunrpc] [] mwait_idle+0x41/0x45 [] cpu_idle+0x7e/0x9c ---[ end trace 5cc994fbe7e141af ]--- rpcrdma: connection to 192.168.0.215:2050 closed (-103) rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16 rpcrdma: connection to 192.168.0.215:2050 closed (-103) Thanks, Diego Vu Pham wrote: > Celine Bourde wrote: >> We have still the same problem, even changing the registration method. >> >> mount doesn't reply and this is the output of dmesg on client: >> >> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 >> ird 16 >> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 >> ird 16 --- message truncated --- From Diego.Moreno-Lazaro at bull.net Tue Apr 28 06:07:50 2009 From: Diego.Moreno-Lazaro at bull.net (Diego Moreno) Date: Tue, 28 Apr 2009 15:07:50 +0200 Subject: [Fwd: Re: [ofa-general][NFS/RDMA]Can'tmountNFS/RDMApartition]] In-Reply-To: <49f6fd03.85c2f10a.4d46.644a@mx.google.com> References: <49f6fd03.85c2f10a.4d46.644a@mx.google.com> Message-ID: <49F6FFA6.3040608@bull.net> Hi Tom, I'm running 2.6.27.10 vanilla kernel but I'll try with 2.6.29. Thanks, Diego Sysctl config on server: [root at twing ~]# cat /etc/sysctl.conf # Kernel sysctl configuration file for Red Hat Linux # # For binary values, 0 is disabled, 1 is enabled. See sysctl(8) and # sysctl.conf(5) for more details. # Controls IP packet forwarding net.ipv4.ip_forward = 0 # Controls source route verification net.ipv4.conf.default.rp_filter = 1 # Do not accept source routing net.ipv4.conf.default.accept_source_route = 0 # Controls the System Request debugging functionality of the kernel kernel.sysrq = 0 # Controls whether core dumps will append the PID to the core filename # Useful for debugging multi-threaded applications kernel.core_uses_pid = 1 # Controls the use of TCP syncookies net.ipv4.tcp_syncookies = 1 # Controls the maximum size of a message, in bytes kernel.msgmnb = 65536 # Controls the default maxmimum size of a mesage queue kernel.msgmax = 65536 # Controls the maximum shared segment size, in bytes kernel.shmmax = 68719476736 # Controls the maximum number of shared memory segments, in pages kernel.shmall = 4294967296 ## MLX4_EN tuning parameters ## net.ipv4.tcp_timestamps = 0 net.ipv4.tcp_sack = 0 net.core.netdev_max_backlog = 250000 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.core.rmem_default = 16777216 net.core.wmem_default = 16777216 net.core.optmem_max = 16777216 net.ipv4.tcp_mem = 16777216 16777216 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 ## END MLX4_EN ## tmtalpey at gmail.com wrote: > In both cases the connection is being lost under load. This usually indicates a credit (slot count) mismatch, or an IRD/ORD one. What kernel version are you running on each end? Any special sysctl settings on the server? > > The oops on the client is troubling, but it,s happening in the error upcall and resembles a problem I fixed a while back. I'll check it when I get back to a source repo. It's not the cause of the issue though. > > Tom. > > > -----Original Message----- > > From: Diego Moreno > Subj: Re: [Fwd: Re: [ofa-general][NFS/RDMA]Can'tmountNFS/RDMApartition]] > Date: Tue Apr 28, 2009 8:44 am > Size: 3K > To: Vu Pham > cc: OpenIB > > Hi, > > I'm working with Celine trying to make NFS RDMA work. We installed a new > firmware (2.6.636). We still have the problem but now we have more > information on client side. > > - With the workaround (memreg 6) we can mount without any problem. We > can read a file but if we try to create a file with dd, application > hangs and then we have to do 'umount -f'. There is no message on server. > Message on client: > > rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 > ird 16 > rpcrdma: connection to 192.168.0.215:2050 closed (-103) > > > - With fast registration: > > There is no message on server. dmesg client output with fast registration: > > > rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 > ird 16 > rpcrdma: connection to 192.168.0.215:2050 closed (-103) > rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 > ird 16 > ------------[ cut here ]------------ > WARNING: at kernel/softirq.c:136 local_bh_enable_ip+0x3c/0x92() > Modules linked in: xprtrdma autofs4 hidp nfs lockd nfs_acl rfcomm l2cap > bluetooth sunrpc iptable_filter ip_tables ip6t_REJECT xt_tcpudp > ip6table_filter ip6_tables x_tables cpufreq_ondemand acpi_cpufreq > freq_table rdma_ucm ib_sdp rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_sa > ipv6 ib_uverbs ib_umad iw_nes ib_ipath ib_mthca dm_multipath scsi_dh > raid0 sbs sbshc battery acpi_memhotplug ac parport_pc lp parport mlx4_ib > ib_mad ib_core e1000e sr_mod joydev cdrom mlx4_core i5000_edac edac_core > shpchp rtc_cmos sg pcspkr rtc_core rtc_lib i2c_i801 i2c_core serio_raw > button dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage ata_piix > libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last > unloaded: microcode] > Pid: 0, comm: swapper Not tainted 2.6.27_ofa_compil #2 > > Call Trace: > [] warn_on_slowpath+0x51/0x77 > [] __wake_up+0x38/0x4f > [] __wake_up_bit+0x28/0x2d > [] rpc_wake_up_task_queue_locked+0x223/0x24b [sunrpc] > [] rpc_wake_up_status+0x47/0x82 [sunrpc] > [] local_bh_enable_ip+0x3c/0x92 > [] rpcrdma_conn_func+0x6d/0x7c [xprtrdma] > [] rpcrdma_qp_async_error_upcall+0x45/0x5a [xprtrdma] > [] mlx4_ib_qp_event+0xf9/0x100 [mlx4_ib] > [] __queue_work+0x22/0x32 > [] mlx4_qp_event+0x8a/0xad [mlx4_core] > [] mlx4_eq_int+0x55/0x291 [mlx4_core] > [] mlx4_msi_x_interrupt+0xf/0x16 [mlx4_core] > [] handle_IRQ_event+0x25/0x53 > [] handle_edge_irq+0xe3/0x123 > [] do_IRQ+0xf1/0x15e > [] ret_from_intr+0x0/0xa > [] nul_marshal+0x0/0x20 [sunrpc] > [] mwait_idle+0x41/0x45 > [] cpu_idle+0x7e/0x9c > > ---[ end trace 5cc994fbe7e141af ]--- > rpcrdma: connection to 192.168.0.215:2050 closed (-103) > rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 > ird 16 > rpcrdma: connection to 192.168.0.215:2050 closed (-103) > > > Thanks, > > Diego > > Vu Pham wrote: >> Celine Bourde wrote: >>> We have still the same problem, even changing the registration method. >>> >>> mount doesn't reply and this is the output of dmesg on client: >>> >>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 >>> ird 16 >>> rpcrdma: connection to 192.168.0.215:2050 closed (-103) >>> rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 >>> ird 16 > > --- message truncated --- > > > > From rdreier at cisco.com Tue Apr 28 07:00:41 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 28 Apr 2009 07:00:41 -0700 Subject: [ofa-general] RPM / build environment In-Reply-To: (Philip Frey1's message of "Tue, 28 Apr 2009 14:54:37 +0200") References: Message-ID: > as announced earlier on this channel as well as at the Sonoma Workshop, > we are adding a purely software based RDMA driver to OFED. > What is the correct/appropriate way of adding our driver to the build > system > consisting of the 'install.pl' perl script and the SRPMS? Wouldn't posting patches for review and eventual merge to the upstream kernel be a better first step, rather than worrying about the OFED build scripts? - R. From jgunthorpe at obsidianresearch.com Tue Apr 28 11:15:36 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Tue, 28 Apr 2009 12:15:36 -0600 Subject: [ofa-general] Re: How to tell what OFED rev a distro derived IB modules? In-Reply-To: <200904281101.38009.jackm@dev.mellanox.co.il> References: <200904270927.40449.jackm@dev.mellanox.co.il> <20090427165555.GJ4431@obsidianresearch.com> <200904281101.38009.jackm@dev.mellanox.co.il> Message-ID: <20090428181536.GN4431@obsidianresearch.com> On Tue, Apr 28, 2009 at 11:01:37AM +0300, Jack Morgenstein wrote: > On Monday 27 April 2009 19:55, Jason Gunthorpe wrote: > > OFED tests the past - back ports to old distributions and a random > > non-upstream collection of patches ontop of that. That is fine for end > > users, but.. > That is not quite the case. We do test regression on the base kernel of > a given OFED distribution (That is -- on a system which runs the base kernel, > but installing OFED, which does include kernel_patches/fixes), so we do > verify that OFED does run properly on its base kernel. We are not testing > just the past. But by the time OFED starts that testing the base kernel version is in the past, and if any bugs are found they can't really be fixed upstream. Plus patching OFED into the base kernel kinda defeats the entire point.. The point is to test the current Linus releases so that when the distributions fork them off they are good, working and compatible with userspace. Jason From dennis.portello at gmail.com Tue Apr 28 14:04:32 2009 From: dennis.portello at gmail.com (Dennis Portello) Date: Tue, 28 Apr 2009 17:04:32 -0400 Subject: [ofa-general] ***SPAM*** Re: IB Bonding errors with recent kernel In-Reply-To: <49EDB644.8040604@Voltaire.COM> References: <52436c7f0904200421s53d65a31vdcbb26babd9be196@mail.gmail.com> <49EC6505.40406@Voltaire.COM> <52436c7f0904200629q447d969cwbe12ce8bb606584b@mail.gmail.com> <49EDB644.8040604@Voltaire.COM> Message-ID: <52436c7f0904281404ycc353f2k95fb6d8168e28276@mail.gmail.com> Hi Moni, Thank you for looking into this. The discovery of multicast not working with bonding caused a major course correction in my project, I haven't checked emails from the list in a few days. I expect to verify if bonding works as you described later this week. Unfortunately, bonding as you described will not work in my situation since we use Ethernet bonding as well. I hope to revisit IPoIB at a later time. Thanks again, Dennis P. On Tue, Apr 21, 2009 at 8:04 AM, Moni Shoua wrote: > Dennis Portello wrote: > > I can confirm that this issue exists beyond Redhat 4, I'm using Ubuntu > > 8.10 (2.6.27). > > > > I'm using ib-bond and I've also tried adding he bonds directly with > > > > echo +bond0 > /sys/class/net/bonding_masters > > echo 1 > /sys/class/net/bond0/bonding/mode > > echo 100 > /sys/class/net/bond0/bonding/miimon > > echo +ib0 > /sys/class/net/bond0/bonding/slaves > > echo +ib1 > /sys/class/net/bond0/bonding/slaves > > ifconfig bond0 192.168.47.102/24 > > route add -net 224.0.0.0/3 gw 192.168.47.100 > > > I guess that what you see is a result of 2 issues. > First, a garbage multicast addresses that is passed to ib0 by bond0 > The second, a garbage mulicast address in the list of mcast addresses of > interface ib0 prevents other legal addresses from joining the mcast group. > > To avoid this (at least as a workaround) you should make sure that > interface bond0 won't be up before it has ib slaves > or in other words, bond0 was never up between 'modprobe bonding' and 'echo > +ib0 > /sys/class/net/bond0/bonding/slaves' > > Let me know if this helps > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsquyres at cisco.com Tue Apr 28 14:31:41 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Tue, 28 Apr 2009 17:31:41 -0400 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: References: Message-ID: <3E350D1F-BADB-4B7B-8307-BE0DA085DBF3@cisco.com> Is anyone going to comment on this? I'm surprised / disappointed that it's been over 2 weeks with *no* comments. Roland can't lead *every* discussion... On Apr 13, 2009, at 12:07 PM, Jeff Squyres wrote: > The following is a proposal from several MPI implementations to the > OpenFabrics community (various MPI implementation representatives > CC'ed). The basic concept was introduced in the MPI Panel at Sonoma > (see http://www.openfabrics.org/archives/spring2009sonoma/tuesday/panel3/panel3.zip) > ; it was further refined in discussions after Sonoma. > > Introduction: > ============= > > MPI has long had a problem maintaining its own verbs memory > registration cache in userspace. The main issue is that user > applications are responsible for allocating/freeing their own data > buffers -- the MPI layer does not (usually) have visibility when > application buffers are allocated or freed. Hence, MPI has had to > intercept deallocation calls in order to know when its registration > cache entries have potentially become invalid. Horrible and dangerous > tricks are used to intercept the various flavors of free, sbrk, > munmap, etc. > > Here's the classic scenario we're trying to handle better: > > 1. MPI application allocs buffer A and MPI_SENDs it > 2. MPI library registers buffer A and caches it (in user space) > 3. MPI application frees buffer A > 4. page containing buffer A is returned to the OS > 5. MPI application allocs buffer B > 5a. B is at the same virtual address as A, but different physical > address > 6. MPI application MPI_SENDs buffer B > 7. MPI library thinks B is already registered and sends it > --> the physical address may well still be registered, so the send > does not fail -- but it's the wrong data > > Note that the above scenario occurs because before Linux kernel > v2.6.27, the OF kernel drivers are not notified when pages are > returned to the OS -- we're leaking registered memory, and therefore > the OF driver/hardware have the wrong virtual/physical mapping. It > *may* not segv at step 7 because the OF driver/hardware can still > access the memory and it is still registered. But it will definitely > be accessing the wrong physical memory. > > In discussions before the Sonoma OpenFabrics event this year, several > MPI implementations got together and concluded that userspace > "notifier" functions might solve this issue for MPI (as proposed by > Pete Wyckoff quite a while ago). Specifically, when memory is > unregistered down in the kernel, a flag is set in userspace that > allows the userspace to know that it needs to make a [potentially > expensive] downcall to find out exactly what happened. In this way, > MPI can know when to update its registration cache safely. > > After further post-Sonoma discussion, it became evident that the > so-called userspace "notifier" functions nat not solve the problem -- > there seem to be unavoidable race conditions, particularly in > multi-threaded applications (more on this below). We concluded that > what could be useful is to move the registration cache from the > userspace/MPI down into the kernel and maintain it on a per-protection > domain (PD) basis. > > Short version: > ============== > > Here's a short version of our proposal: > > 1. A new enum value is added to ibv_access_flags: IBV_ACCESS_CACHE. > If this flag is set in the call to ibv_reg_mr(), the following > occurs down in the kernel: > - look for the memory to be registered in the PD-specific cache > - if found > - increment its refcount > - else > - try to register the memory > - if the registration fails because no more memory is available > - traverse all PD registration caches in this process, > evicting/unregistering each entry with a refcount <= 0 > - try to register the memory again > - if the registration succeeds (either the 1st or the 2nd time), > put it in the PD cache with a refcount of 1 > > If this flag is *not* set in the call to ibv_reg_mr(), then the > following occurs: > > - try to register the memory > - if the registration fails because no more registered memory is > available > - traverse all PD registration caches in this process, > evicting/unregistering each entry with a refcount <= 0 > - try to register the memory again > > If an application never uses IBV_ACCESS_CACHE, registration > performance should be no different. Registration costs may > increase slightly in some cases if there is a non-empty > registration cache. > > 2. The kernel side of the ibv_dereg_mr() deregistration call now does > the following: > - look for the memory to be deregistered in the PD's cache > - if it's in the cache > - decrement the refcount (leaving the memory registered) > - else > - unregister the memory > > 3. A new verb, ibv_is_reg(), is created to query if the entire buffer > X is already registered. If it is, increase its refcount in the > reg cache. If it is not, just return an error (and do not register > any of the buffer). > > --> An alternate proposal for this idea is to add another > ibv_access_flags value (e.g., IBV_ACCESS_IS_CACHED) instead of > a new verb. But that might be a little odd in that we don't > want the memory registered if it's not already registered. > > This verb is useful for pipelined protocols to offset the cost of > registration of long buffers (e.g., if the buffer is already > registered, just send it -- otherwise let the ULP potentially do > something else). See below for a more detailed explanation / use > case. > > 4. A new verb, ibv_reg_mr_limits(), is created to specify some > configuration information about the registration cache. > Configuration specifics TBD here, but one obvious possibility here > would be to specify the maximum number of pages that can be > registered by this process (which must be <= the value specified > limits.conf, or it will fail). > > 5. A new verb, ibv_reg_mr_clean(), is created to traverse the internal > registration cache and actually de-register any item with a > refcount <= 0. The intent is to give applications the ability to > forcibly deregister any still-existing memory that has been > ibv_reg_mr(..., IBV_ACCESS_CACHE)'ed and later ibv_dereg_mr()'ed. > > These proposals assume that the new IOMMU notify system in >=2.6.27 > kernels will be used to catch when memory is returned from a process > to the kernel, and will both unregister the memory and remove it from > the kernel PD reg caches, if relevant. > > More details: > ============= > > Starting with Linux kernel v2.6.27, the OF kernel drivers can be > notified when pages are returned to the OS (I don't know if they yet > take advantage of this feature). However, we can still run into > pretty much the same scenario -- the MPI userspace registration cache > can become invalid even though the kernel is no longer leaking > registered memory. The situation is *slightly* better because the > ibv_post_send() may fail because the memory will (in a single threaded > application) likely be unregistered. > > Pete Wyckoff's solution several years ago was to add two steps into > the scenario listed above; my understanding is this is now possible > with the IOMMU notifiers in 2.6.27 (new steps 4a and 4b): > > 1. MPI application allocs buffer A and MPI_SENDs it > 2. MPI library registers buffer A and caches it (in user space) > 3. MPI application frees buffer A > 4. page containing buffer A is returned to the OS > 4a. OF kernel driver is notified and can unregister the page > 4b. OF kernel driver can twiddle a bit in userspace indicating that > something has changed > ...etc. > > The thought here is that the MPI can register a global variable during > MPI_INIT that can be modified during step 4b. Hence, you can add a > cheap "if" statement in MPI's send path like this: > > if (variable_has_changed_indicating_step_4b_executed) { > ibv_expensive_downcall_to_find_out_what_happened(..., &output); > if (need_to_register(buffer, mpi_reg_cache, output)) { > ibv_reg_mr(buffer, ...); > } > } > ibv_post_send(...); > > You get the idea -- check the global variable before invoking > ibv_post_send() or ibv_post_recv(), and if necessary, register the > memory that MPI thought was already registered. > > But whacky situations might occur in a multithreaded application where > one thread calls free() while another thread calls malloc(), gets the > same virtual address that was just free()d but has not yet been > unregistered in the kernel, so a subsequent ibv_post_send() may > succeed but be sending the wrong data. > > Put simply: in a multi-threaded application, there's always the chance > that the notify won't get to the user-level process until after the > global notifier variable has been checked, right? Or, putting it the > other way: is there any kind of notify system that could be used that > *can't* create a potential race condition in a multi-threaded user > application? > > NOTE: There's actually some debate about whether this "bad" scenario > could actually happen -- I admit that I'm not entirely sure. > But if this race condition *can* happen, then I cannot think > of a kernel notifier system that would not have this race > condition. > > So a few of us hashed this around and came up with an alternate > proposal: > > 1. Move the entire registration cache down into the kernel. > Supporting rationale: > 1a. If all ULPs (MPIs, in this case) have to implement registration > caches, why not implement it *once*, not N times? > 1b. Putting the reg cache in the kernel means that with the IOMMU > notifier system introduced in 2.6.27, the kernel can call back > to the device driver when the mapping changes so that a) the > memory can be deregistered, and b) the corresponding item can > be removed from the registration cache. Specifically: the race > condition described above can be fixed because it's all located > in one place in the kernel. > > 2. This means that the userspace process must *always* call > ibv_reg_mr() and ibv_dereg_mr() to increment / decrement the > reference counts on the kernel reg cache. But in practice, > on-demand registration/de-registration is only done for long > messages (short messages typically use > copy-to-pre-registered-buffers schemes). So the additional > ibv_reg_mr() before calling ibv_post_send() / ibv_post_recv() for > long messages shouldn't matter. > > 3. The registration cache in the kernel can lazily deregister cached > memory, as described in the "short version" discussion, above > (quite similar to what MPI's do today). > > To offset the cost of large memory registrations (because registration > is linearly proportional to the size of the buffer being registered), > pipelined protocols are sometimes used. As such, it seems useful to > have a "is this memory already registered?" verb -- a ULP can check to > see if an entire long message is already registered, and if so, do a > single large RDMA action. If not, the ULP can use a pipelined > protocol to loop over registering a portion of the buffer and then > RDMA'ing it. > > Possible pipelined pseudocode can look like this: > > if (ibv_is_reg(pd, buffer, len)) { > ibv_post_send(); > // will still need to ibv_dereg_mr() after completion > } else { > // pipeline loop > for (i = 0; ...) { > ibv_reg_mr(pd, buffer + i*pipeline_size, > pipeline_size, IBV_ACCESS_CACHE); > ibv_post_send(...); > } > } > > The rationale here is that these verbs allow the flexibility of doing > something like the above scenario or just registering the whole long > buffer and sending it immediately: > > ibv_reg_mr(pd, buffer, len, IBV_ACCESS_CACHE); > ibv_post_send(...); > > It may also be useful to progamatically enforce some limits on a given > PD's registration cache. A per-process limit is already enforced via > /etc/security/limits.conf, but it may be useful to specify per-PD > limits in the ULP (MPI) itself. Note that most MPI's have controls > like this already; it's consistent with moving the registration cache > down to the kernel. A proposal for the verb could be: > > ibv_reg_mr_cache_limits(pd, max_num_pages) > > Another userspace-accessible verb that may be useful is one that > traverses a PD's reg cache and actually deregisters any item with a > refcount <= 0. This allows a ULP to "clean out" any lingering > registrations, thereby freeing up registered memory for other uses > (e.g., being registered by another PD). This verb can have a > simplistic interface: > > ibv_reg_mr_clean(pd) > > It's not 100% clear that we need this "clean" verb -- if ibv_reg_mr() > will evict entries with <= 0 refcounts from any PD's registration > cache in this process, that might be enough. However, using verbs > registered memory with other (non-verbs) pinned memory in the same > process may make this verb necessary. > > ----- > > Finally, it should be noted that with 2.6.27's IOMMU notify system, > full on-demand paging / registering seems possible. On-demand paging > would be a full, complete solution -- the ULP wouldn't have to worry > about registering / de-registering memory at all (the existing > de/registration verbs could become no-ops for backwards > compatibility). I assume that a proposal along these lines this would > be a [much] larger debate in the OpenFabrics community, and further > assume that the proposal above would be a smaller debate and actually > have a chance of being implemented in the not-distant future. > > (/me puts on fire suit) > > Thoughts? > > -- > Jeff Squyres > Cisco Systems > -- Jeff Squyres Cisco Systems From ralph.campbell at qlogic.com Tue Apr 28 15:11:08 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Tue, 28 Apr 2009 15:11:08 -0700 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <3E350D1F-BADB-4B7B-8307-BE0DA085DBF3@cisco.com> References: <3E350D1F-BADB-4B7B-8307-BE0DA085DBF3@cisco.com> Message-ID: <1240956668.3403.324.camel@chromite.mv.qlogic.com> On Tue, 2009-04-28 at 14:31 -0700, Jeff Squyres wrote: > Is anyone going to comment on this? I'm surprised / disappointed that > it's been over 2 weeks with *no* comments. > > Roland can't lead *every* discussion... > > > On Apr 13, 2009, at 12:07 PM, Jeff Squyres wrote: > > > The following is a proposal from several MPI implementations to the > > OpenFabrics community (various MPI implementation representatives > > CC'ed). The basic concept was introduced in the MPI Panel at Sonoma > > (see http://www.openfabrics.org/archives/spring2009sonoma/tuesday/panel3/panel3.zip) > > ; it was further refined in discussions after Sonoma. > > > > Introduction: > > ============= > > > > MPI has long had a problem maintaining its own verbs memory > > registration cache in userspace. The main issue is that user > > applications are responsible for allocating/freeing their own data > > buffers -- the MPI layer does not (usually) have visibility when > > application buffers are allocated or freed. Hence, MPI has had to > > intercept deallocation calls in order to know when its registration > > cache entries have potentially become invalid. Horrible and dangerous > > tricks are used to intercept the various flavors of free, sbrk, > > munmap, etc. > > > > Here's the classic scenario we're trying to handle better: > > > > 1. MPI application allocs buffer A and MPI_SENDs it > > 2. MPI library registers buffer A and caches it (in user space) > > 3. MPI application frees buffer A The memory is pinned so the OS isn't going to actually free the memory. By "alloc" and "free" I assume you mean malloc()/free() or any other call which might increase the memory footprint of an application. The MPI library needs to ibv_dereg_mr() the buffer before it can be actually freed and the pages returned to the free pool. > > 4. page containing buffer A is returned to the OS > > 5. MPI application allocs buffer B > > 5a. B is at the same virtual address as A, but different physical > > address > > 6. MPI application MPI_SENDs buffer B > > 7. MPI library thinks B is already registered and sends it > > --> the physical address may well still be registered, so the send > > does not fail -- but it's the wrong data Ah, free() just puts the buffer on a free list and a subsequent malloc() can return it. The application isn't aware of the MPI library calling ibv_reg_mr() and the MPI library isn't aware of the application reusing the buffer differently. The virtual to physical mapping can't change while it is pinned so buffer B should have been written with new data overwriting the same physical pages that buffer A used. I would assume the application would wait for the MPI_isend() to complete before freeing the buffer so it shouldn't be the case that the same buffer is in the process of being sent when the application overwrites the address and tries to send it again. > > Note that the above scenario occurs because before Linux kernel > > v2.6.27, the OF kernel drivers are not notified when pages are > > returned to the OS -- we're leaking registered memory, and therefore > > the OF driver/hardware have the wrong virtual/physical mapping. It > > *may* not segv at step 7 because the OF driver/hardware can still > > access the memory and it is still registered. But it will definitely > > be accessing the wrong physical memory. Well, the driver can register for callbacks when the mapping changes but most HCA drivers aren't going to be able to use it. The problem is that once a memory region is created, there is no way the driver knows when an incoming or outgoing DMA might try to reference that address. There would need to be a way to suspend DMAs, change the mapping, and then allow DMAs to continue. The CPU equivalent is a TLB flush after changing the page table memory. The whole area of page pinning, mapping, unmapping, etc. between the application, MPI library, OS, and driver is very complex and I don't think can be designed easily via email. I wasn't at the Sonoma conference so I don't know what was discussed. The "ideal" from MPI library perspective is to not have to worry about memory registrations and have the HCA somehow share the user application's page table, faulting in IB to physical address mappings as needed. That involves quite a bit of hardware support as well as the changes in 2.6.27. From rdreier at cisco.com Tue Apr 28 16:03:49 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 28 Apr 2009 16:03:49 -0700 Subject: [ofa-general] [GIT PULL] please pull infiniband.git Message-ID: Linus, please pull from master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This tree is also available from kernel.org mirrors at: git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus This will get a batch of changes for -rc2, mostly low-level hardware driver fixes with a few other miscellaneous fixes and an IPoIB documentation update. Chien Tung (8): RDMA/nes: Fix compiler warning at nes_verbs.c:1955 RDMA/nes: Modify thermo mitigation to flip SerDes1 ref clk to internal RDMA/nes: Correct CDR loop filter setting for port 1 RDMA/nes: Enable repause timer for port 1 RDMA/nes: Set trace length to 1 inch for SFP_D RDMA/nes: Fix fw_ver in /sys RDMA/nes: Fix unused variable compile warning when INFINIBAND_NES_DEBUG=n RDMA/nes: Update iw_nes version Don Wood (1): RDMA/nes: Fix bugs in nes_reg_phys_mr() Faisal Latif (5): RDMA/nes: Do not set apbvt entry for loopback RDMA/nes: Check for sequence number wrap-around RDMA/nes: Increase rexmit timeout interval RDMA/nes: Fix hang issues for large cluster dynamic connections RDMA/nes: Fix error path in nes_accept() Jack Morgenstein (1): IB/mthca: Fix timeout for INIT_HCA and a few other commands Matt Kraai (1): RDMA/nes: Remove root_256()'s unused pbl_count_256 parameter Miroslaw Walukiewicz (1): RDMA/nes: Fix resource issues in nes_create_cq() and nes_destroy_cq() Nicolas Morey-Chaisemartin (1): mlx4_core: Fix memory leak in mlx4_enable_msi_x() Roland Dreier (1): Merge branches 'cxgb3', 'ipoib', 'mthca', 'mlx4' and 'nes' into for-linus Steve Wise (2): RDMA/cxgb3: Adjust ORD/IRD (if needed) for peer2peer connections RDMA/cxgb3: Don't zero QP attrs when moving to IDLE Yossi Etigin (1): IPoIB: Disable NAPI while CQ is being drained drivers/infiniband/hw/cxgb3/iwch_cm.c | 8 +++ drivers/infiniband/hw/cxgb3/iwch_qp.c | 1 - drivers/infiniband/hw/mthca/mthca_cmd.c | 16 +++--- drivers/infiniband/hw/nes/nes.h | 4 +- drivers/infiniband/hw/nes/nes_cm.c | 84 ++++++++++++++-------------- drivers/infiniband/hw/nes/nes_cm.h | 1 + drivers/infiniband/hw/nes/nes_hw.c | 30 ++++++---- drivers/infiniband/hw/nes/nes_verbs.c | 67 +++++++++++++++-------- drivers/infiniband/hw/nes/nes_verbs.h | 1 + drivers/infiniband/ulp/ipoib/ipoib_ib.c | 6 ++- drivers/infiniband/ulp/ipoib/ipoib_main.c | 5 +-- drivers/net/mlx4/main.c | 2 +- 12 files changed, 130 insertions(+), 95 deletions(-) From Ted.Kim at Sun.COM Tue Apr 28 16:07:05 2009 From: Ted.Kim at Sun.COM (Ted H. Kim) Date: Tue, 28 Apr 2009 16:07:05 -0700 Subject: [ofa-general] Does IPonIB-CM consolidate connections? Message-ID: <49F78C19.1060603@sun.com> Folks, Suppose you have two nodes A & B. Node A opens a connection to B using IPonIB Connected Mode. Later, when B wants to send to A, does the IPonIB-CM code notice that a connection already exists between the two nodes and use it? or does it open a new connection? Also RFC 4755 section 3.3 talks about how you can deal with crossing REQ messages so you that only end up with one connection. Does OFED IPonIB-CM do this? Thanks, -ted -- Ted H. Kim Sun Microsystems, Inc. ted.kim at sun.com 222 North Sepulveda Blvd., 10th Floor (310) 341-1116 El Segundo, CA 90245 (310) 341-1120 FAX From rdreier at cisco.com Tue Apr 28 16:15:56 2009 From: rdreier at cisco.com (Roland Dreier) Date: Tue, 28 Apr 2009 16:15:56 -0700 Subject: [ofa-general] Does IPonIB-CM consolidate connections? In-Reply-To: <49F78C19.1060603@sun.com> (Ted H. Kim's message of "Tue, 28 Apr 2009 16:07:05 -0700") References: <49F78C19.1060603@sun.com> Message-ID: > Suppose you have two nodes A & B. > Node A opens a connection to B using > IPonIB Connected Mode. Later, when B > wants to send to A, does the IPonIB-CM > code notice that a connection already > exists between the two nodes and > use it? or does it open a new connection? If you're talking about the Linux kernel implementation, no it doesn't reuse existing connections. It creates a new one. I don't know about Solaris or any other OS. > Also RFC 4755 section 3.3 talks about > how you can deal with crossing REQ > messages so you that only end up with > one connection. Does OFED IPonIB-CM > do this? Not that I know of. - R. From jsquyres at cisco.com Tue Apr 28 18:10:35 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Tue, 28 Apr 2009 21:10:35 -0400 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <1240956668.3403.324.camel@chromite.mv.qlogic.com> References: <3E350D1F-BADB-4B7B-8307-BE0DA085DBF3@cisco.com> <1240956668.3403.324.camel@chromite.mv.qlogic.com> Message-ID: On Apr 28, 2009, at 6:11 PM, Ralph Campbell wrote: > Ah, free() just puts the buffer on a free list and a subsequent > malloc() > can return it. The application isn't aware of the MPI library calling > ibv_reg_mr() > Right. > and the MPI library isn't aware of the application > reusing the buffer differently. > The virtual to physical mapping can't change while it is pinned > so buffer B should have been written with new data overwriting > the same physical pages that buffer A used. > I would assume the application would wait for the MPI_isend() to > complete before freeing the buffer so it shouldn't be the case that > the same buffer is in the process of being sent when the application > overwrites the address and tries to send it again. > This is not the problem. An MPI program that re-uses a buffer that is in use in an ongoing non- blocking send operation is clearly erroneous. Perhaps my explanations were incorrect and you kernel gurus can educate me. What I know can happen is: - MPI application alloc's buffer A and gets virtual address B back, corresponding to physical address C - MPI application calls MPI_SEND with A - MPI implementation registers buffer A, and caches that address B is registered, and then does the send - MPI application frees buffer A - MPI implementation does *NOT* unregister buffer A - MPI application alloc's buffer X and gets virtual address *B* back, corresponding to physical address Z (Z!=C) - MPI application calls MPI_SEND with X - MPI implementation sees virtual address B in its cache and thinks that it is already registered... badness ensues Note that the virtual addresses are the same, but the physical addresses are different. This can, and does, happen. It makes it impossible to tell the buffer apart in userspace -- MPI cannot tell that the buffer is not already pinned (because according to MPI's internal cache, it *is* registered already). The only way to hack around this is for the MPI implementation to intercept free/sbrk/ whatever (horrors!) so that it can a) know to unregister the buffer and b) remove the address from its "already registered" cache. It's quite possible that I don't know why this happens, or stated the wrong reasons why. But it definitely does happen. > > > Note that the above scenario occurs because before Linux kernel > > > v2.6.27, the OF kernel drivers are not notified when pages are > > > returned to the OS -- we're leaking registered memory, and > therefore > > > the OF driver/hardware have the wrong virtual/physical mapping. > It > > > *may* not segv at step 7 because the OF driver/hardware can still > > > access the memory and it is still registered. But it will > definitely > > > be accessing the wrong physical memory. > > Well, the driver can register for callbacks when the mapping changes > but most HCA drivers aren't going to be able to use it. > The problem is that once a memory region is created, there is no way > the driver knows when an incoming or outgoing DMA might try to > reference that address. > Wouldn't it be an erroneous program that tried to use a region after free()'ing it? > There would need to be a way to suspend DMAs, > change the mapping, and then allow DMAs to continue. > The CPU equivalent is a TLB flush after changing the page table > memory. > > The whole area of page pinning, mapping, unmapping, etc. between the > application, MPI library, OS, and driver is very complex and I don't > think can be designed easily via email. > The conversation needs to start somewhere. MPI is verbs' biggest customer; this is a major pain point for all of us. Can't we fix it? Do you need something more than a specific use case and API proposal to start the conversation? No one has money to travel travel; the bi- weekly EWG call is for discussing bugs. What other vehicle do you suggest for this discussion? I'd consider this issue to be in the top 3 major roadblocks of verbs adoption to developers other than those of us who write MPI implementations. > I wasn't at the Sonoma > conference so I don't know what was discussed. > Only the problem was discussed. It was hypothesized that Pete Wyckoff's "tweak a bit in userspace when something changes" notifier interface would fix the problem, but per my mail, after more post- Sonoma discussion, we think that it's not sufficient. My Sonoma slides are here: http://www.openfabrics.org/archives/spring2009sonoma/tuesday/panel3/panel3.zip http://www.openfabrics.org/archives/spring2009sonoma/wednesday/panel1/panel1.zip > The "ideal" from > MPI library perspective is to not have to worry about memory > registrations and have the HCA somehow share the user application's > page table, faulting in IB to physical address mappings as needed. > That involves quite a bit of hardware support as well as the changes > in 2.6.27. > Understood -- but as I stated in my mail, I assume that such a change is a long way off (particularly since it needs some kind of hardware support). Moving the registration cache down into the kernel seems do- able. Why not try to tackle this [enormous] problem? -- Jeff Squyres Cisco Systems From swise at opengridcomputing.com Tue Apr 28 18:32:00 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Tue, 28 Apr 2009 20:32:00 -0500 Subject: [ofa-general] Re: dapl attribute bug In-Reply-To: <4989F89D.8020905@opengridcomputing.com> References: <49871E6A.9000901@opengridcomputing.com> <4989F89D.8020905@opengridcomputing.com> Message-ID: <49F7AE10.70904@opengridcomputing.com> Hey Arlin, Did this ever get fixed? I think UNH is seeing this issue still. Steve Wise wrote: > Davis, Arlin R wrote: >> >> >> >>> The DAPL dat_ia_attr->max_lmr_block_size is a u32, yet the dapl code >>> maps this to the linux ib_device_attr->max_mr_size which is u64. >>> >>> This causes dapltest to fail in some cases when running over chelsio >>> which sets max_mr_size to 0x100000000 (4GB). The dapl code >>> truncates the value to 0. See dapl/openib_cma/dapl_ib_util.c. >>> >>> I'm not sure what the fix should be, but maybe the dapl code should >>> set anything over 32 bits to 0xffffffff? >>> >>> >> >> This attribute changed with DAT 2.0 to match the 32-bit ibv_sge >> length field. Since there are no direct max lmr segments mappings >> I will need add some checks when setting max_lmr_block_size from >> max_mr_size. Thanks. >> >> -arlin > > I'll test your fix when its ready. Lemme know. > > > Steve. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From weiny2 at llnl.gov Tue Apr 28 20:27:36 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 28 Apr 2009 20:27:36 -0700 Subject: [ofa-general] Issues with combined routing in smpquery Message-ID: <20090428202736.0ff049e5.weiny2@llnl.gov> Sasha, Hal, I have some hardware on which the following query does not work. 18:40:54 > ./smpquery -c nodeinfo 243 0,1 ibwarn: [22072] mad_rpc: _do_madrpc failed; dport (Lid 243 DR path slid 148; dlid 65535; 0,1) ./smpquery: iberror: failed: operation nodeinfo: node info query failed from the node I am running on. 20:08:46 > ibstat CA 'mlx4_0' CA type: MT25418 Number of ports: 2 Firmware version: 2.6.0 Hardware version: a0 Node GUID: 0x0002c9020025feb4 System image GUID: 0x0002c9020025feb7 Port 1: State: Active Physical state: LinkUp Rate: 10 Base lid: 148 LMC: 2 SM lid: 148 Capability mask: 0x0251086a Port GUID: 0x0002c9020025feb5 [snip] 19:12:10 > hostname hype137 A query on the LID alone returns this. 18:41:20 > ./smpquery nodeinfo 243 # Node info: Lid 243 [snip] NodeType:........................Switch NumPorts:........................24 SystemGuid:......................0x0008f10400400e69 Guid:............................0x0008f10400400e69 PortGuid:........................0x0008f10400400e69 [snip] And iblinkinfo is. 18:41:26 > iblinkinfo.pl -S 0x0008f10400400e69 Switch 0x0008f10400400e69 ISR9288 Voltaire sFB-12D: 243 1[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 646 10[ ] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) [snip] It looks like combined routing is not working at all except for this one query. (LID 37 is the switch which is connected to the HCA I am running on.) 18:53:18 > ./smpquery -c portinfo 37 0,1 # Port info: Lid 37 DR path slid 148; dlid 65535; 0,1 port 0 Mkey:............................0x0000000000000000 GidPrefix:.......................0xfe80000000000000 Lid:.............................148 SMLid:...........................148 [snip] All other combined routing queries I try fail. And even this one above is wrong. It is returning the data on port 6 not 1. Look at the output from the local switch. 19:12:00 > iblinkinfo.pl -R -S 0x000b8cffff004663 Switch 0x000b8cffff004663 MT47396 Infiniscale-III Mellanox Technologies: 37 1[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 108 1[ ] "hype132" ( ) 37 2[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 528 1[ ] "hype133" ( ) 37 3[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 296 1[ ] "hype134" ( ) 37 4[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 92 1[ ] "hype135" ( ) 37 5[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 144 1[ ] "hype136" ( ) This is what is connected to LID 148... 37 6[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 148 1[ ] "hype137" ( ) 37 7[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 540 1[ ] "hype138" ( ) 37 8[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 212 1[ ] "hype139" ( ) 37 9[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 532 1[ ] "hype140" ( ) 37 10[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 60 1[ ] "hype141" ( ) 37 11[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 192 1[ ] "hype142" ( ) 37 12[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 312 1[ ] "hype143" ( ) 37 13[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 647 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) 37 14[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 641 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) 37 15[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 643 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) 37 16[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 653 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) 37 17[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 637 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) 37 18[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 610 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) 37 19[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 655 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) 37 20[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 645 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) 37 21[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 635 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) 37 22[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 651 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) 37 23[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 639 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) 37 24[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 649 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) Any idea what is going on? These were all run with a smpquery built from the current master tree. On my little test system this seems to work just fine... But not on this system. Did some older hardware not support combined DR routing? Ira -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab weiny2 at llnl.gov From weiny2 at llnl.gov Tue Apr 28 20:55:25 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Tue, 28 Apr 2009 20:55:25 -0700 Subject: [ofa-general] Re: Issues with combined routing in smpquery In-Reply-To: <20090428202736.0ff049e5.weiny2@llnl.gov> References: <20090428202736.0ff049e5.weiny2@llnl.gov> Message-ID: <20090428205525.4ffdd778.weiny2@llnl.gov> On Tue, 28 Apr 2009 20:27:36 -0700 Ira Weiny wrote: > Sasha, Hal, > > I have some hardware on which the following query does not work. > > 18:40:54 > ./smpquery -c nodeinfo 243 0,1 > ibwarn: [22072] mad_rpc: _do_madrpc failed; dport (Lid 243 DR path slid 148; dlid 65535; 0,1) > ./smpquery: iberror: failed: operation nodeinfo: node info query failed > > from the node I am running on. > > 20:08:46 > ibstat > CA 'mlx4_0' > CA type: MT25418 > Number of ports: 2 > Firmware version: 2.6.0 > Hardware version: a0 > Node GUID: 0x0002c9020025feb4 > System image GUID: 0x0002c9020025feb7 > Port 1: > State: Active > Physical state: LinkUp > Rate: 10 > Base lid: 148 > LMC: 2 > SM lid: 148 > Capability mask: 0x0251086a > Port GUID: 0x0002c9020025feb5 > [snip] > > 19:12:10 > hostname > hype137 > > > A query on the LID alone returns this. > > 18:41:20 > ./smpquery nodeinfo 243 > # Node info: Lid 243 > [snip] > NodeType:........................Switch > NumPorts:........................24 > SystemGuid:......................0x0008f10400400e69 > Guid:............................0x0008f10400400e69 > PortGuid:........................0x0008f10400400e69 > [snip] > > And iblinkinfo is. > > 18:41:26 > iblinkinfo.pl -S 0x0008f10400400e69 > Switch 0x0008f10400400e69 ISR9288 Voltaire sFB-12D: > 243 1[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 646 10[ ] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > [snip] > > > It looks like combined routing is not working at all except for this one > query. (LID 37 is the switch which is connected to the HCA I am running > on.) > > 18:53:18 > ./smpquery -c portinfo 37 0,1 > # Port info: Lid 37 DR path slid 148; dlid 65535; 0,1 port 0 > Mkey:............................0x0000000000000000 > GidPrefix:.......................0xfe80000000000000 > Lid:.............................148 > SMLid:...........................148 > [snip] > > All other combined routing queries I try fail. And even this one above is > wrong. It is returning the data on port 6 not 1. Look at the output from the > local switch. > > 19:12:00 > iblinkinfo.pl -R -S 0x000b8cffff004663 > Switch 0x000b8cffff004663 MT47396 Infiniscale-III Mellanox Technologies: > 37 1[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 108 1[ ] "hype132" ( ) > 37 2[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 528 1[ ] "hype133" ( ) > 37 3[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 296 1[ ] "hype134" ( ) > 37 4[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 92 1[ ] "hype135" ( ) > 37 5[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 144 1[ ] "hype136" ( ) > > This is what is connected to LID 148... > 37 6[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 148 1[ ] "hype137" ( ) > > 37 7[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 540 1[ ] "hype138" ( ) > 37 8[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 212 1[ ] "hype139" ( ) > 37 9[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 532 1[ ] "hype140" ( ) > 37 10[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 60 1[ ] "hype141" ( ) > 37 11[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 192 1[ ] "hype142" ( ) > 37 12[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 312 1[ ] "hype143" ( ) > 37 13[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 647 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > 37 14[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 641 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > 37 15[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 643 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > 37 16[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 653 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > 37 17[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 637 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > 37 18[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 610 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > 37 19[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 655 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > 37 20[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 645 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > 37 21[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 635 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > 37 22[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 651 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > 37 23[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 639 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > 37 24[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 649 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > > Any idea what is going on? These were all run with a smpquery built from the > current master tree. > > On my little test system this seems to work just fine... But not on this > system. Did some older hardware not support combined DR routing? Actually I take this back. It seems an older version of smpquery works but not this newer one. So I don't think this is a hardware issue. :-( 20:54:47 > ./smpquery -c nodeinfo 14 0,10 ibwarn: [21947] _do_madrpc: send failed; Invalid argument ibwarn: [21947] mad_rpc: _do_madrpc failed; dport (Lid 14 DR path slid 4; dlid 65535; 0,10) ./smpquery: iberror: failed: operation nodeinfo: node info query failed 20:54:52 > ./smpquery -V ./smpquery BUILD VERSION: 1.5.1_76524e3_dirty Build date: Apr 28 2009 20:47:10 20:54:55 > smpquery -c nodeinfo 14 0,10 # Node info: Lid 14 DR path 0,10 BaseVers:........................1 ClassVers:.......................1 NodeType:........................Switch NumPorts:........................24 SystemGuid:......................0x0008f10400411b19 Guid:............................0x0008f10400411b18 PortGuid:........................0x0008f10400411b18 PartCap:.........................8 DevId:...........................0x5a30 Revision:........................0x000001a1 LocalPort:.......................24 VendorId:........................0x0008f1 20:54:59 > smpquery -V smpquery BUILD VERSION: 1.3.6 Build date: Oct 13 2008 12:20:42 Ira From sfr at canb.auug.org.au Tue Apr 28 21:01:01 2009 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Wed, 29 Apr 2009 14:01:01 +1000 Subject: [ofa-general] linux-next: infiniband tree build failure Message-ID: <20090429140101.d9c7467c.sfr@canb.auug.org.au> Hi Roland, Today's linux-next build (powerpc ppc64_defconfig) produced this new warning: drivers/infiniband/hw/ehca/hcp_phyp.c: In function 'hcp_galpas_ctor': drivers/infiniband/hw/ehca/hcp_phyp.c:65: warning: assignment makes integer from pointer without a cast Caused by commit 2bd93ed8b59d9bf8b918a0fa04be50482906c16b ("IB/ehca: Remove unnecessary memory operations for userspace queue pairs"). -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available URL: From jgunthorpe at obsidianresearch.com Tue Apr 28 21:03:29 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Tue, 28 Apr 2009 22:03:29 -0600 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: References: <3E350D1F-BADB-4B7B-8307-BE0DA085DBF3@cisco.com> <1240956668.3403.324.camel@chromite.mv.qlogic.com> Message-ID: <20090429040329.GD29727@obsidianresearch.com> On Tue, Apr 28, 2009 at 09:10:35PM -0400, Jeff Squyres wrote: > The conversation needs to start somewhere. MPI is verbs' biggest > customer; this is a major pain point for all of us. Can't we fix it? > Do you need something more than a specific use case and API proposal > to start the conversation? No one has money to travel travel; the bi- > weekly EWG call is for discussing bugs. What other vehicle do you > suggest for this discussion? I've often wondered, wouldn't it just be fine for MPI if the entire process address space is kept pinned, registered and consistent with the HCA? The process would opt in to this behavior during MPI startup. Similar in spirit to the all physical memory registration the kernel can do. That seems like a much more straightfoward problem than trying to fit the fairly incompatible verbs registration API and the MPI API together. Jason From sfr at canb.auug.org.au Tue Apr 28 21:04:29 2009 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Wed, 29 Apr 2009 14:04:29 +1000 Subject: [ofa-general] Re: linux-next: infiniband tree build failure In-Reply-To: <20090429140101.d9c7467c.sfr@canb.auug.org.au> References: <20090429140101.d9c7467c.sfr@canb.auug.org.au> Message-ID: <20090429140429.5b446297.sfr@canb.auug.org.au> Hi Roland, That should have said "warning" not "failure", sorry. -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available URL: From parakie at gmail.com Tue Apr 28 22:37:25 2009 From: parakie at gmail.com (Gennadiy Nerubayev) Date: Wed, 29 Apr 2009 01:37:25 -0400 Subject: [ofa-general] Build failures on current 1.4.1 dailies Message-ID: Hi all, Running on 2.6.27.21 x64. ofa_kernel build error as follows: -I/usr/src/redhat/BUILD/kernel-2.6.27.21/arch/x86_64/include \ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -Os -m64 -mtune=generic -mno-red-zone -mc model=kernel -funit-at-a-time -maccumulate-outgoing-args -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Iinclude/asm-x86/mach-default -fno-stack-protector -fomit-frame-pointer -g -Wdeclaration-after-statement -Wno-pointer-sign -fwrapv -DMODULE -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(file)" -D"KBUILD_MODNAME=KBUILD_STR(nfs)" -c -o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/ fs/nfs/.tmp_file.o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c: In function 'nfs_write_begin': /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c:354: error: implicit declaration of function '__grab_cache_page' /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.c:354: warning: assignment makes pointer from integer without a cast make[3]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs/file.o] Error 1 make[2]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1/fs/nfs] Error 2 make[1]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4.1] Error 2 make[1]: Leaving directory `/usr/src/redhat/BUILD/kernel-2.6.27.21' make: *** [kernel] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.2461 (%build) Assuming we turn off nfs stuff to go further, error number two is from infiniband-diags: checking whether to build shared libraries... yes checking whether to build static libraries... yes checking for sys_read_string in -libcommon... yes checking for umad_init in -libumad... yes checking for mad_dump_int in -libmad... no configure: error: mad_dump_int() not found. diags require libibmad. error: Bad exit status from /var/tmp/rpm-tmp.42050 (%build) I confirmed that pulling management git and compiling libs and diags from there does not have this issue, and that the libibmad.so.1 that gets compiled in the daily OFED does not have mad_dump_int(). Thanks, -Gennadiy -------------- next part -------------- An HTML attachment was scrubbed... URL: From nicolas.morey-chaisemartin at ext.bull.net Wed Apr 29 02:35:37 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey-Chaisemartin) Date: Wed, 29 Apr 2009 11:35:37 +0200 Subject: [ofa-general] [PATCH] sdp: Fixed SDP to work on 2.6.29+ as sk_prot->orphan_count is a percpu_counter and not a atomic_t anymore Message-ID: <49F81F69.2050805@ext.bull.net> Signed-off-by: Nicolas Morey-Chaisemartin --- this patch goes on ofed_1_5/linux-2.6 drivers/infiniband/ulp/sdp/sdp_main.c | 16 ++++++++-------- 1 files changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/ulp/sdp/sdp_main.c b/drivers/infiniband/ulp/sdp/sdp_main.c index 51801e0..c457b37 100644 --- a/drivers/infiniband/ulp/sdp/sdp_main.c +++ b/drivers/infiniband/ulp/sdp/sdp_main.c @@ -580,7 +580,7 @@ adjudge_to_death: /* TODO: tcp_fin_time to get timeout */ sdp_dbg(sk, "%s: entering time wait refcnt %d\n", __func__, atomic_read(&sk->sk_refcnt)); - atomic_inc(sk->sk_prot->orphan_count); + percpu_counter_inc(sk->sk_prot->orphan_count); } /* TODO: limit number of orphaned sockets. @@ -861,7 +861,7 @@ void sdp_cancel_dreq_wait_timeout(struct sdp_sock *ssk) sock_put(&ssk->isk.sk, SOCK_REF_DREQ_TO); } - atomic_dec(ssk->isk.sk.sk_prot->orphan_count); + percpu_counter_dec(ssk->isk.sk.sk_prot->orphan_count); } void sdp_destroy_work(struct work_struct *work) @@ -902,7 +902,7 @@ void sdp_dreq_wait_timeout_work(struct work_struct *work) sdp_sk(sk)->dreq_wait_timeout = 0; if (sk->sk_state == TCP_FIN_WAIT1) - atomic_dec(ssk->isk.sk.sk_prot->orphan_count); + percpu_counter_dec(ssk->isk.sk.sk_prot->orphan_count); sdp_exch_state(sk, TCPF_LAST_ACK | TCPF_FIN_WAIT1, TCP_TIME_WAIT); @@ -2162,9 +2162,9 @@ void sdp_urg(struct sdp_sock *ssk, struct sk_buff *skb) sk->sk_data_ready(sk, 0); } -static atomic_t sockets_allocated; +static struct percpu_counter sockets_allocated; static atomic_t memory_allocated; -static atomic_t orphan_count; +static struct percpu_counter orphan_count; static int memory_pressure; struct proto sdp_proto = { .close = sdp_close, @@ -2574,9 +2574,9 @@ static void __exit sdp_exit(void) sock_unregister(PF_INET_SDP); proto_unregister(&sdp_proto); - if (atomic_read(&orphan_count)) - printk(KERN_WARNING "%s: orphan_count %d\n", __func__, - atomic_read(&orphan_count)); + if (percpu_counter_read_positive(&orphan_count)) + printk(KERN_WARNING "%s: orphan_count %lld\n", __func__, + percpu_counter_read_positive(&orphan_count)); destroy_workqueue(sdp_workqueue); flush_scheduled_work(); -- 1.6.2.GIT From vlad at lists.openfabrics.org Wed Apr 29 03:21:32 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Wed, 29 Apr 2009 03:21:32 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090429-0200 daily build status Message-ID: <20090429102132.9FE2CE61502@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From jsquyres at cisco.com Wed Apr 29 05:15:57 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 29 Apr 2009 08:15:57 -0400 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <20090429040329.GD29727@obsidianresearch.com> References: <3E350D1F-BADB-4B7B-8307-BE0DA085DBF3@cisco.com> <1240956668.3403.324.camel@chromite.mv.qlogic.com> <20090429040329.GD29727@obsidianresearch.com> Message-ID: On Apr 29, 2009, at 12:03 AM, Jason Gunthorpe wrote: > I've often wondered, wouldn't it just be fine for MPI if the entire > process address space is kept pinned, registered and consistent with > the HCA? The process would opt in to this behavior during MPI > startup. Similar in spirit to the all physical memory registration the > kernel can do. > An interesting idea. As I understand your idea, you essentially have to pre-allocate memory to all MPI processes, registering all available RAM. After thinking about this a little bit, I think there are still a few problems, though: - How much memory do you give to each MPI process? (phys_ram - OS_overhead) / num_mpi_processes? What if each MPI process is not created equal -- some need more RAM than others? Does each MPI process need to know at the beginning of time the max memory that it might need in the future? That could be quite difficult to know -- it seems like an large new restriction to impose on users. - As we head towards "manycore", the above problem will get [much] worse, because I think we'll be heading back to the days of running multiple different MPI jobs on a single machine. These jobs will have no a priori knowledge of each other; if the 2nd MPI job launched on a machine needs more than (phys_ram - OS_overhead) / num_processors, how is that coordinated with the 1st MPI job that is already running on the same machine? - What about any other (non-MPI) process that needs to run? If all memory after the OS is registered / unswappable / allocated to MPI processes, then how do random processes get any memory to run? (e.g., shell scripts, daemons, ... etc.) If you simply leave X space un- register specifically for such non-MPI processes, how do you decide the value of X? - The preallocation/registration of memory must happen pre-main() because the first MPI function that is invoked (MPI_Init()) may not occur until well after main(), and potentially after some calls to malloc (etc.). For example, the following is a valid MPI program: int main(...) { int *a = malloc(...); MPI_Init(...); MPI_Send(a, ...); ... } Re-reading your brief text; I'm wondering if I missed the zen of what you're trying to suggest...? If I'm off the mark, can you explain more? Thanks. -- Jeff Squyres Cisco Systems From devel-ofed at morey-chaisemartin.com Wed Apr 29 07:22:40 2009 From: devel-ofed at morey-chaisemartin.com (Nicolas Morey-Chaisemartin) Date: Wed, 29 Apr 2009 16:22:40 +0200 Subject: [ofa-general] [PATCH] sdp: Fixed SDP to work on 2.6.29+ as sk_prot->orphan_count is a percpu_counter and not a atomic_t anymore In-Reply-To: <49F81F69.2050805@ext.bull.net> References: <49F81F69.2050805@ext.bull.net> Message-ID: <49F862B0.4030107@morey-chaisemartin.com> Patch makes build OK but still fails due to percpu_counters being too large to be allocated on the stack Posting a fixed version. Le 29/04/2009 11:35, Nicolas Morey-Chaisemartin a écrit : > Signed-off-by: Nicolas Morey-Chaisemartin > --- > this patch goes on ofed_1_5/linux-2.6 > > drivers/infiniband/ulp/sdp/sdp_main.c | 16 ++++++++-------- > 1 files changed, 8 insertions(+), 8 deletions(-) > > diff --git a/drivers/infiniband/ulp/sdp/sdp_main.c b/drivers/infiniband/ulp/sdp/sdp_main.c > index 51801e0..c457b37 100644 > --- a/drivers/infiniband/ulp/sdp/sdp_main.c > +++ b/drivers/infiniband/ulp/sdp/sdp_main.c > @@ -580,7 +580,7 @@ adjudge_to_death: > /* TODO: tcp_fin_time to get timeout */ > sdp_dbg(sk, "%s: entering time wait refcnt %d\n", __func__, > atomic_read(&sk->sk_refcnt)); > - atomic_inc(sk->sk_prot->orphan_count); > + percpu_counter_inc(sk->sk_prot->orphan_count); > } > > /* TODO: limit number of orphaned sockets. > @@ -861,7 +861,7 @@ void sdp_cancel_dreq_wait_timeout(struct sdp_sock *ssk) > sock_put(&ssk->isk.sk, SOCK_REF_DREQ_TO); > } > > - atomic_dec(ssk->isk.sk.sk_prot->orphan_count); > + percpu_counter_dec(ssk->isk.sk.sk_prot->orphan_count); > } > > void sdp_destroy_work(struct work_struct *work) > @@ -902,7 +902,7 @@ void sdp_dreq_wait_timeout_work(struct work_struct *work) > sdp_sk(sk)->dreq_wait_timeout = 0; > > if (sk->sk_state == TCP_FIN_WAIT1) > - atomic_dec(ssk->isk.sk.sk_prot->orphan_count); > + percpu_counter_dec(ssk->isk.sk.sk_prot->orphan_count); > > sdp_exch_state(sk, TCPF_LAST_ACK | TCPF_FIN_WAIT1, TCP_TIME_WAIT); > > @@ -2162,9 +2162,9 @@ void sdp_urg(struct sdp_sock *ssk, struct sk_buff *skb) > sk->sk_data_ready(sk, 0); > } > > -static atomic_t sockets_allocated; > +static struct percpu_counter sockets_allocated; > static atomic_t memory_allocated; > -static atomic_t orphan_count; > +static struct percpu_counter orphan_count; > static int memory_pressure; > struct proto sdp_proto = { > .close = sdp_close, > @@ -2574,9 +2574,9 @@ static void __exit sdp_exit(void) > sock_unregister(PF_INET_SDP); > proto_unregister(&sdp_proto); > > - if (atomic_read(&orphan_count)) > - printk(KERN_WARNING "%s: orphan_count %d\n", __func__, > - atomic_read(&orphan_count)); > + if (percpu_counter_read_positive(&orphan_count)) > + printk(KERN_WARNING "%s: orphan_count %lld\n", __func__, > + percpu_counter_read_positive(&orphan_count)); > destroy_workqueue(sdp_workqueue); > flush_scheduled_work(); > From nicolas.morey-chaisemartin at ext.bull.net Wed Apr 29 07:23:04 2009 From: nicolas.morey-chaisemartin at ext.bull.net (Nicolas Morey-Chaisemartin) Date: Wed, 29 Apr 2009 16:23:04 +0200 Subject: [ofa-general] [PATCHv2] sdp: Fixed SDP to work on 2.6.29+ Message-ID: <49F862C8.6030102@ext.bull.net> orphan_count and sockets_allocated have been changed from atomic_t to percpu_counter. As percpu_counter are huge they can be allocated on the stack without causing sdp module to crash. Both variable are now dynamically allocated at module init. Signed-off-by: Nicolas Morey-Chaisemartin --- drivers/infiniband/ulp/sdp/sdp_main.c | 29 +++++++++++++++++++---------- 1 files changed, 19 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/ulp/sdp/sdp_main.c b/drivers/infiniband/ulp/sdp/sdp_main.c index 51801e0..7a38c47 100644 --- a/drivers/infiniband/ulp/sdp/sdp_main.c +++ b/drivers/infiniband/ulp/sdp/sdp_main.c @@ -580,7 +580,7 @@ adjudge_to_death: /* TODO: tcp_fin_time to get timeout */ sdp_dbg(sk, "%s: entering time wait refcnt %d\n", __func__, atomic_read(&sk->sk_refcnt)); - atomic_inc(sk->sk_prot->orphan_count); + percpu_counter_inc(sk->sk_prot->orphan_count); } /* TODO: limit number of orphaned sockets. @@ -861,7 +861,7 @@ void sdp_cancel_dreq_wait_timeout(struct sdp_sock *ssk) sock_put(&ssk->isk.sk, SOCK_REF_DREQ_TO); } - atomic_dec(ssk->isk.sk.sk_prot->orphan_count); + percpu_counter_dec(ssk->isk.sk.sk_prot->orphan_count); } void sdp_destroy_work(struct work_struct *work) @@ -902,7 +902,7 @@ void sdp_dreq_wait_timeout_work(struct work_struct *work) sdp_sk(sk)->dreq_wait_timeout = 0; if (sk->sk_state == TCP_FIN_WAIT1) - atomic_dec(ssk->isk.sk.sk_prot->orphan_count); + percpu_counter_dec(ssk->isk.sk.sk_prot->orphan_count); sdp_exch_state(sk, TCPF_LAST_ACK | TCPF_FIN_WAIT1, TCP_TIME_WAIT); @@ -2162,9 +2162,9 @@ void sdp_urg(struct sdp_sock *ssk, struct sk_buff *skb) sk->sk_data_ready(sk, 0); } -static atomic_t sockets_allocated; +static struct percpu_counter *sockets_allocated; static atomic_t memory_allocated; -static atomic_t orphan_count; +static struct percpu_counter *orphan_count; static int memory_pressure; struct proto sdp_proto = { .close = sdp_close, @@ -2182,10 +2182,8 @@ struct proto sdp_proto = { .get_port = sdp_get_port, /* Wish we had this: .listen = sdp_listen */ .enter_memory_pressure = sdp_enter_memory_pressure, - .sockets_allocated = &sockets_allocated, .memory_allocated = &memory_allocated, .memory_pressure = &memory_pressure, - .orphan_count = &orphan_count, .sysctl_mem = sysctl_tcp_mem, .sysctl_wmem = sysctl_tcp_wmem, .sysctl_rmem = sysctl_tcp_rmem, @@ -2540,6 +2538,15 @@ static int __init sdp_init(void) spin_lock_init(&sock_list_lock); spin_lock_init(&sdp_large_sockets_lock); + sockets_allocated = kmalloc(sizeof(*sockets_allocated), GFP_KERNEL); + orphan_count = kmalloc(sizeof(*orphan_count), GFP_KERNEL); + percpu_counter_init(sockets_allocated, 0); + percpu_counter_init(orphan_count, 0); + + sdp_proto.sockets_allocated = sockets_allocated; + sdp_proto.orphan_count = orphan_count; + + sdp_workqueue = create_singlethread_workqueue("sdp"); if (!sdp_workqueue) { return -ENOMEM; @@ -2574,9 +2581,9 @@ static void __exit sdp_exit(void) sock_unregister(PF_INET_SDP); proto_unregister(&sdp_proto); - if (atomic_read(&orphan_count)) - printk(KERN_WARNING "%s: orphan_count %d\n", __func__, - atomic_read(&orphan_count)); + if (percpu_counter_read_positive(orphan_count)) + printk(KERN_WARNING "%s: orphan_count %lld\n", __func__, + percpu_counter_read_positive(orphan_count)); destroy_workqueue(sdp_workqueue); flush_scheduled_work(); @@ -2589,6 +2596,8 @@ static void __exit sdp_exit(void) sdp_proc_unregister(); ib_unregister_client(&sdp_client); + kfree(orphan_count); + kfree(sockets_allocated); } module_init(sdp_init); -- 1.6.2.GIT From swise at opengridcomputing.com Wed Apr 29 09:26:05 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 29 Apr 2009 11:26:05 -0500 Subject: [ofa-general] Re: dapl attribute bug In-Reply-To: <49F7AE10.70904@opengridcomputing.com> References: <49871E6A.9000901@opengridcomputing.com> <4989F89D.8020905@opengridcomputing.com> <49F7AE10.70904@opengridcomputing.com> Message-ID: <49F87F9D.4000901@opengridcomputing.com> Bug 1613 opened to track this. I think we need this for ofed-1.4.1. Steve. Steve Wise wrote: > Hey Arlin, > > Did this ever get fixed? > > I think UNH is seeing this issue still. > > > > Steve Wise wrote: >> Davis, Arlin R wrote: >>> >>> >>> >>>> The DAPL dat_ia_attr->max_lmr_block_size is a u32, yet the dapl >>>> code maps this to the linux ib_device_attr->max_mr_size which is u64. >>>> >>>> This causes dapltest to fail in some cases when running over >>>> chelsio which sets max_mr_size to 0x100000000 (4GB). The dapl code >>>> truncates the value to 0. See dapl/openib_cma/dapl_ib_util.c. >>>> >>>> I'm not sure what the fix should be, but maybe the dapl code should >>>> set anything over 32 bits to 0xffffffff? >>>> >>>> >>> >>> This attribute changed with DAT 2.0 to match the 32-bit ibv_sge >>> length field. Since there are no direct max lmr segments mappings >>> I will need add some checks when setting max_lmr_block_size from >>> max_mr_size. Thanks. >>> >>> -arlin >> >> I'll test your fix when its ready. Lemme know. >> >> >> Steve. >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Wed Apr 29 09:51:47 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 29 Apr 2009 09:51:47 -0700 Subject: [ofa-general] Re: linux-next: infiniband tree build failure In-Reply-To: <20090429140101.d9c7467c.sfr@canb.auug.org.au> (Stephen Rothwell's message of "Wed, 29 Apr 2009 14:01:01 +1000") References: <20090429140101.d9c7467c.sfr@canb.auug.org.au> Message-ID: > Today's linux-next build (powerpc ppc64_defconfig) produced this new > warning: > > drivers/infiniband/hw/ehca/hcp_phyp.c: In function 'hcp_galpas_ctor': > drivers/infiniband/hw/ehca/hcp_phyp.c:65: warning: assignment makes integer from pointer without a cast > > Caused by commit 2bd93ed8b59d9bf8b918a0fa04be50482906c16b ("IB/ehca: > Remove unnecessary memory operations for userspace queue pairs"). Thanks for pointing this out. I rolled the below into the patch in question, which should fix this. diff --git a/drivers/infiniband/hw/ehca/hcp_phyp.c b/drivers/infiniband/hw/ehca/hcp_phyp.c index fc3a245..b3e0e72 100644 --- a/drivers/infiniband/hw/ehca/hcp_phyp.c +++ b/drivers/infiniband/hw/ehca/hcp_phyp.c @@ -62,7 +62,7 @@ int hcp_galpas_ctor(struct h_galpas *galpas, int is_user, if (ret) return ret; } else - galpas->kernel.fw_handle = NULL; + galpas->kernel.fw_handle = 0; galpas->user.fw_handle = paddr_user; From rdreier at cisco.com Wed Apr 29 10:03:10 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 29 Apr 2009 10:03:10 -0700 Subject: [ofa-general] New proposal for memory management In-Reply-To: (Jeff Squyres's message of "Mon, 13 Apr 2009 12:07:17 -0400") References: Message-ID: > But whacky situations might occur in a multithreaded application where > one thread calls free() while another thread calls malloc(), gets the > same virtual address that was just free()d but has not yet been > unregistered in the kernel, so a subsequent ibv_post_send() may > succeed but be sending the wrong data. > > Put simply: in a multi-threaded application, there's always the chance > that the notify won't get to the user-level process until after the > global notifier variable has been checked, right? Or, putting it the > other way: is there any kind of notify system that could be used that > *can't* create a potential race condition in a multi-threaded user > application? Without thinking too much about the proposal (except that it adds a lot of new verb interfaces and a lot of kernel code, and therefore feels like a hassle to me), I don't see how this race is solved by moving a cache to the kernel. If you have free()/malloc() of a buffer running in parallel with send operations targeting the same buffer, then that seems like a buggy MPI application. Since free()/malloc() might not involve the kernel at all (the userspace library might keep its own free list, etc) I don't see how a registration cache in the kernel would help anyway. Now, since free()/malloc() operations must be serialized with respect to send/receive operations in userspace anyway, I don't see why a simpler (and possibly more flexible/powerful) kernel notifier design can't work -- if free() releases virtual memory back to the kernel, then the kernel notifier will run before the free() call returns, so things should work as planned. - R. From andy.grover at oracle.com Wed Apr 29 10:52:40 2009 From: andy.grover at oracle.com (Andy Grover) Date: Wed, 29 Apr 2009 10:52:40 -0700 Subject: [ofa-general] RDMA needs a tutorial. Message-ID: <49F893E8.9060704@oracle.com> Has OFED or anyone considered the need for a software-oriented tutorial on how to use RDMA? Currently it seems like there is a) the IB spec, b) the MindShare IB book, c) ibv_* manpages, and d) the source, which has zero comments. All of these are HW-oriented and/or assume knowledge of IB already. I'm thinking something like The Java Tutorial, http://java.sun.com/docs/books/tutorial/ , which also was published on paper. RDMA needs something that provides an entry point to the thicket of IB jargon, that explains how all the pieces fit together, and simple example code that people can try out and play with. Googling "infiniband tutorial" returns nothing useful. Compare that to "java tutorial", "sockets tutorial", "python tutorial". Especially if rdma-capable HW is really going to be on every server motherboard soon, there will be a wave of casual/hobbyist programmers who may give RDMA a go, but only *if* they can understand it in an evening or two, instead of six months. Regards -- Andy From robert.j.woodruff at intel.com Wed Apr 29 10:56:27 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 29 Apr 2009 10:56:27 -0700 Subject: [ofa-general] RE: New proposal for memory management In-Reply-To: <3E350D1F-BADB-4B7B-8307-BE0DA085DBF3@cisco.com> References: <3E350D1F-BADB-4B7B-8307-BE0DA085DBF3@cisco.com> Message-ID: <382A478CAD40FA4FB46605CF81FE39F42B56D2BC@orsmsx507.amr.corp.intel.com> Jeff wrote, >Is anyone going to comment on this? I'm surprised / disappointed that >it's been over 2 weeks with *no* comments. >Roland can't lead *every* discussion... Having a memory registration cache in the kernel seems like a bad idea to me. It will likely be a lot of code that is very complicated and prone to bugs that are not easy to find or fix. In the past, caching of things like SA records, i.e., the local sa cache have been rejected and to me this seems like a similar type of request. In general if something can be done in user-space rather than the kernel, I think it should be done in user-space. MPIs today are clearly able to implement this type of caching in user-space. Rather than dump a whole bunch of new code into the kernel, why not make it a user-space library instead. If libc needs changes to allow additional hooking of things like malloc/free, then work with the libc maintainer to get those hooks into libc. I think there is already a standard way to do this for libc malloc/free. As for the automatic registration/deregistration, I do not think you really want this either. If it requires dynamic paging in and locking of pages that are not in memory or locked, this will lead to severe variability in job performance run to run, depending on system load and such, and I do not think you really want that. For example, if one node has to delay to have a page paged in so that it can be locked and registered, it can delay all of the nodes in the cluster that are waiting for that node to say respond to a collective operation, thus slowing down the whole job. my 2 cents, woody From rdreier at cisco.com Wed Apr 29 10:57:07 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 29 Apr 2009 10:57:07 -0700 Subject: [ofa-general] RDMA needs a tutorial. In-Reply-To: <49F893E8.9060704@oracle.com> (Andy Grover's message of "Wed, 29 Apr 2009 10:52:40 -0700") References: <49F893E8.9060704@oracle.com> Message-ID: > Has OFED or anyone considered the need for a software-oriented tutorial > on how to use RDMA? Are you volunteering? - R. From andy.grover at oracle.com Wed Apr 29 11:17:02 2009 From: andy.grover at oracle.com (Andy Grover) Date: Wed, 29 Apr 2009 11:17:02 -0700 Subject: [ofa-general] RDMA needs a tutorial. In-Reply-To: References: <49F893E8.9060704@oracle.com> Message-ID: <49F8999E.6030001@oracle.com> Roland Dreier wrote: >> Has OFED or anyone considered the need for a software-oriented >> tutorial on how to use RDMA? > > Are you volunteering? Sure, I'd help in a group effort. Do you think enough people besides me see this as a critical missing piece to make it happen? Because if it's just me, I'll just do a series of blog posts, but that won't be nearly as good as an official, comprehensive, reviewed, high-quality[1] one that comes up #1 on Google. Regards -- Andy [1] yes, get a professional technical writer involved From rdreier at cisco.com Wed Apr 29 11:22:36 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 29 Apr 2009 11:22:36 -0700 Subject: [ofa-general] RDMA needs a tutorial. In-Reply-To: <49F8999E.6030001@oracle.com> (Andy Grover's message of "Wed, 29 Apr 2009 11:17:02 -0700") References: <49F893E8.9060704@oracle.com> <49F8999E.6030001@oracle.com> Message-ID: > [1] yes, get a professional technical writer involved The issue is: who will fund a tech writer? - R. From rdreier at cisco.com Wed Apr 29 11:23:56 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 29 Apr 2009 11:23:56 -0700 Subject: [ofa-general] RDMA needs a tutorial. In-Reply-To: <49F8999E.6030001@oracle.com> (Andy Grover's message of "Wed, 29 Apr 2009 11:17:02 -0700") References: <49F893E8.9060704@oracle.com> <49F8999E.6030001@oracle.com> Message-ID: > Do you think enough people besides me see this as a critical missing > piece to make it happen? Because if it's just me, I'll just do a series > of blog posts, but that won't be nearly as good as an official, > comprehensive, reviewed, high-quality[1] one that comes up #1 on Google. I think a lot of people would like to have this. The issue is that the amount of time available to work on it is pretty low. But starting something like a CC-licensed online book might attract a sufficient community to finally get something done. - R. From andy.grover at oracle.com Wed Apr 29 11:55:12 2009 From: andy.grover at oracle.com (Andy Grover) Date: Wed, 29 Apr 2009 11:55:12 -0700 Subject: [ofa-general] RDMA needs a tutorial. In-Reply-To: References: <49F893E8.9060704@oracle.com> <49F8999E.6030001@oracle.com> Message-ID: <49F8A290.4000101@oracle.com> Roland Dreier wrote: >> [1] yes, get a professional technical writer involved > > The issue is: who will fund a tech writer? >From http://openfabrics.org/about.htm : "To that end, the Alliance provides tools and development resources to code, refine and publish standards-based, open-source software for RDMA." I'd think a tutorial qualifies as a "development resource", so I'd probably ask them first. If people agree this is a priority, then I feel confident funding can be arranged somehow, although it may be in 2011 :) Regards -- Andy From swise at opengridcomputing.com Wed Apr 29 12:14:38 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Wed, 29 Apr 2009 14:14:38 -0500 Subject: [ofa-general] [PATCH 2.6.30] RDMA/cxgb3: flushed sq wr completions get inserted twice in to the cqe Message-ID: <20090429191438.29393.59197.stgit@build.ogc.int> When the sq is flushed, mark the flushed entries as not signaled so the poll logic doesn't re-insert the cqe thinking its an out of order completion. The bug can cause the nfsrdma server to crash due to processing the same completed WR twice. Signed-off-by: Steve Wise --- drivers/infiniband/hw/cxgb3/cxio_hal.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c index 8d71086..62f9cf2 100644 --- a/drivers/infiniband/hw/cxgb3/cxio_hal.c +++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c @@ -410,6 +410,7 @@ int cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count) ptr = wq->sq_rptr + count; sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2); while (ptr != wq->sq_wptr) { + sqp->signaled = 0; insert_sq_cqe(wq, cq, sqp); ptr++; sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2); From ralph.campbell at qlogic.com Wed Apr 29 12:25:42 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 29 Apr 2009 12:25:42 -0700 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <20090429040329.GD29727@obsidianresearch.com> References: <3E350D1F-BADB-4B7B-8307-BE0DA085DBF3@cisco.com> <1240956668.3403.324.camel@chromite.mv.qlogic.com> <20090429040329.GD29727@obsidianresearch.com> Message-ID: <1241033142.3403.341.camel@chromite.mv.qlogic.com> On Tue, 2009-04-28 at 21:03 -0700, Jason Gunthorpe wrote: > On Tue, Apr 28, 2009 at 09:10:35PM -0400, Jeff Squyres wrote: > > > The conversation needs to start somewhere. MPI is verbs' biggest > > customer; this is a major pain point for all of us. Can't we fix it? > > Do you need something more than a specific use case and API proposal > > to start the conversation? No one has money to travel travel; the bi- > > weekly EWG call is for discussing bugs. What other vehicle do you > > suggest for this discussion? > > I've often wondered, wouldn't it just be fine for MPI if the entire > process address space is kept pinned, registered and consistent with > the HCA? The process would opt in to this behavior during MPI > startup. Similar in spirit to the all physical memory registration the > kernel can do. > > That seems like a much more straightfoward problem than trying to fit > the fairly incompatible verbs registration API and the MPI API together. > > Jason This would be nice except that it is impossible. The whole idea behind virtual memory is that the application uses virtual addresses which are only temporarily mapped to physical memory by the OS. If the OS allowed a user to lock the virtual to physical mapping for all of physical memory, the OS wouldn't be able to run any other processes, do I/O, etc. The process can have a larger virtual address space than the available physical memory so no fixed mapping is possible. To prevent this, the application has limits imposed by the OS (type "ulimit -l"). The kernel doesn't really have a "map all physical memory" registration. This mapping basically tells the HCA that the kernel will handle the mapping instead of the HCA driver. From Ted.Kim at Sun.COM Wed Apr 29 13:16:32 2009 From: Ted.Kim at Sun.COM (Ted H. Kim) Date: Wed, 29 Apr 2009 13:16:32 -0700 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <1241033142.3403.341.camel@chromite.mv.qlogic.com> References: <3E350D1F-BADB-4B7B-8307-BE0DA085DBF3@cisco.com> <1240956668.3403.324.camel@chromite.mv.qlogic.com> <20090429040329.GD29727@obsidianresearch.com> <1241033142.3403.341.camel@chromite.mv.qlogic.com> Message-ID: <49F8B5A0.9040600@sun.com> Unless you went the whole way of not really pinning and allowing paging of registered memory. Of course before anything got paged out, you would have to call the HCA driver to mark said pages "not present". And if you hit such pages when processing a work request in the HCA, you would have to take a page fault, get the page back and resume the work request (and probably hold off the remote side with an RNR_NAK in the meantime). This was a lively topic of conversation a long time ago in IBTA ... and probably a bigger patch to the world than people were contemplating :-). -ted > This would be nice except that it is impossible. > The whole idea behind virtual memory is that the application uses > virtual addresses which are only temporarily mapped to physical > memory by the OS. If the OS allowed a user to lock the > virtual to physical mapping for all of physical memory, the OS > wouldn't be able to run any other processes, do I/O, etc. > The process can have a larger virtual address space than the > available physical memory so no fixed mapping is possible. > To prevent this, the application has limits imposed by the OS > (type "ulimit -l"). -- Ted H. Kim Sun Microsystems, Inc. ted.kim at sun.com 222 North Sepulveda Blvd., 10th Floor (310) 341-1116 El Segundo, CA 90245 (310) 341-1120 FAX From jgunthorpe at obsidianresearch.com Wed Apr 29 13:25:30 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 29 Apr 2009 14:25:30 -0600 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: References: <3E350D1F-BADB-4B7B-8307-BE0DA085DBF3@cisco.com> <1240956668.3403.324.camel@chromite.mv.qlogic.com> <20090429040329.GD29727@obsidianresearch.com> Message-ID: <20090429202530.GS4431@obsidianresearch.com> On Wed, Apr 29, 2009 at 08:15:57AM -0400, Jeff Squyres wrote: > On Apr 29, 2009, at 12:03 AM, Jason Gunthorpe wrote: > >> I've often wondered, wouldn't it just be fine for MPI if the entire >> process address space is kept pinned, registered and consistent with >> the HCA? The process would opt in to this behavior during MPI >> startup. Similar in spirit to the all physical memory registration the >> kernel can do. > Re-reading your brief text; I'm wondering if I missed the zen of what > you're trying to suggest...? If I'm off the mark, can you explain more? > Thanks. Ah yes, you went down the wrong path. I don't suggest doing anything with physical memory, but basically the equivalent of adding the result of every mmap() and sbrk/brk() call to the HCA mapping, and removing from the mapping at every call to munmap(), synchornously with those syscalls. The net result would be that the verbs registration would follow the virtual memory allocation of the kernel. Basically, the API would work like this: ibv_mr *mr = // some MR.. ibv_register_mr_all(mr); // At this point mr has all of /proc/self/maps included void *foo = mmap(...); // Before mmap returns, the equivilant of ibv_reg_mr(mr,foo..) is // done munmap(foo...); // ibv_unreg_mr(mr,foo) is done.. Essentially when this mode is enabled, mr always contains every virtual address in /proc/self/maps. It is similar to the effect you get by calling mlockall(); The downside is that every byte of virtual memory in a MPI process must be pinned to physical ram before mmap() returns. You don't get to swap MPI jobs. (Well, perhaps there could be a new mmap flag to create un-registered memory that can be swapped for special needs) Since this is done at mmap/brk time and not at page fault time it should not alter the performance of the MPI job unless it is doing alot of mmap calls for some reason (which is slow anyhow). Jason From ralph.campbell at qlogic.com Wed Apr 29 13:33:02 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 29 Apr 2009 13:33:02 -0700 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <49F8B5A0.9040600@sun.com> References: <3E350D1F-BADB-4B7B-8307-BE0DA085DBF3@cisco.com> <1240956668.3403.324.camel@chromite.mv.qlogic.com> <20090429040329.GD29727@obsidianresearch.com> <1241033142.3403.341.camel@chromite.mv.qlogic.com> <49F8B5A0.9040600@sun.com> Message-ID: <1241037182.3403.345.camel@chromite.mv.qlogic.com> Correct. This is what I was referring to when I said hardware support would be needed to make use of the 2.6.27 page mapping callback notifications. I don't know of any current HCAs which support this sort of dynamic mapping. On Wed, 2009-04-29 at 13:16 -0700, Ted H. Kim wrote: > Unless you went the whole way of not really pinning > and allowing paging of registered memory. > Of course before anything got paged out, you would have > to call the HCA driver to mark said pages "not present". > And if you hit such pages when processing a work > request in the HCA, you would have to take a page fault, > get the page back and resume the work request > (and probably hold off the remote side with an RNR_NAK > in the meantime). > > This was a lively topic of conversation a long time ago > in IBTA ... and probably a bigger patch to the world > than people were contemplating :-). > > -ted > > > > This would be nice except that it is impossible. > > The whole idea behind virtual memory is that the application uses > > virtual addresses which are only temporarily mapped to physical > > memory by the OS. If the OS allowed a user to lock the > > virtual to physical mapping for all of physical memory, the OS > > wouldn't be able to run any other processes, do I/O, etc. > > The process can have a larger virtual address space than the > > available physical memory so no fixed mapping is possible. > > To prevent this, the application has limits imposed by the OS > > (type "ulimit -l"). > From ralph.campbell at qlogic.com Wed Apr 29 13:39:24 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 29 Apr 2009 13:39:24 -0700 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <20090429202530.GS4431@obsidianresearch.com> References: <3E350D1F-BADB-4B7B-8307-BE0DA085DBF3@cisco.com> <1240956668.3403.324.camel@chromite.mv.qlogic.com> <20090429040329.GD29727@obsidianresearch.com> <20090429202530.GS4431@obsidianresearch.com> Message-ID: <1241037564.3403.350.camel@chromite.mv.qlogic.com> On Wed, 2009-04-29 at 13:25 -0700, Jason Gunthorpe wrote: > On Wed, Apr 29, 2009 at 08:15:57AM -0400, Jeff Squyres wrote: > > On Apr 29, 2009, at 12:03 AM, Jason Gunthorpe wrote: > > > >> I've often wondered, wouldn't it just be fine for MPI if the entire > >> process address space is kept pinned, registered and consistent with > >> the HCA? The process would opt in to this behavior during MPI > >> startup. Similar in spirit to the all physical memory registration the > >> kernel can do. > > > Re-reading your brief text; I'm wondering if I missed the zen of what > > you're trying to suggest...? If I'm off the mark, can you explain more? > > Thanks. > > Ah yes, you went down the wrong path. I don't suggest doing anything > with physical memory, but basically the equivalent of adding the > result of every mmap() and sbrk/brk() call to the HCA mapping, and > removing from the mapping at every call to munmap(), synchornously > with those syscalls. > > The net result would be that the verbs registration would follow the > virtual memory allocation of the kernel. > > Basically, the API would work like this: > ibv_mr *mr = // some MR.. > ibv_register_mr_all(mr); > > // At this point mr has all of /proc/self/maps included > > void *foo = mmap(...); > > // Before mmap returns, the equivilant of ibv_reg_mr(mr,foo..) is > // done > > munmap(foo...); > // ibv_unreg_mr(mr,foo) is done.. > > > Essentially when this mode is enabled, mr always contains every > virtual address in /proc/self/maps. > > It is similar to the effect you get by calling mlockall(); > > The downside is that every byte of virtual memory in a MPI process > must be pinned to physical ram before mmap() returns. You don't get > to swap MPI jobs. (Well, perhaps there could be a new mmap flag to > create un-registered memory that can be swapped for special needs) > > Since this is done at mmap/brk time and not at page fault time it > should not alter the performance of the MPI job unless it is doing > alot of mmap calls for some reason (which is slow anyhow). > > Jason OK. This is a bit more reasonable. Putting this into my own words, the HCA's mapping would mirror the application's VM to physical mapping. Since the HCAs currently require this mapping to be fixed between register/unregister, it would not be practical to pin this amount of memory. It would require the dynamic mapping I mentioned in reply to Ted Kim. From jgunthorpe at obsidianresearch.com Wed Apr 29 13:41:26 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 29 Apr 2009 14:41:26 -0600 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <1241037182.3403.345.camel@chromite.mv.qlogic.com> References: <3E350D1F-BADB-4B7B-8307-BE0DA085DBF3@cisco.com> <1240956668.3403.324.camel@chromite.mv.qlogic.com> <20090429040329.GD29727@obsidianresearch.com> <1241033142.3403.341.camel@chromite.mv.qlogic.com> <49F8B5A0.9040600@sun.com> <1241037182.3403.345.camel@chromite.mv.qlogic.com> Message-ID: <20090429204126.GU4431@obsidianresearch.com> On Wed, Apr 29, 2009 at 01:33:02PM -0700, Ralph Campbell wrote: > Correct. This is what I was referring to when I said > hardware support would be needed to make use of the 2.6.27 > page mapping callback notifications. I don't know of any > current HCAs which support this sort of dynamic mapping. You can use the callback notifications to pin the memory and register it with the HCA when a new VMA is allocated, and do the reverse when a VMA is removed. Paging at the HCA level is only necessary to avoid pinning the ram. MPI may be a special enough case where this is actually doable without a huge overcommit. It would be very interesting to see /proc/PID/smaps information for a running MPI job to compute how many unallocated pages are present in a job. -- Jason Gunthorpe (780)4406067x832 Chief Technology Officer, Obsidian Research Corp Edmonton, Canada From bwbarre at sandia.gov Wed Apr 29 13:45:23 2009 From: bwbarre at sandia.gov (Barrett, Brian W) Date: Wed, 29 Apr 2009 14:45:23 -0600 Subject: [ofa-general] New proposal for memory management In-Reply-To: Message-ID: On 4/29/09 11:03 , "Roland Dreier" wrote: >> But whacky situations might occur in a multithreaded application where >> one thread calls free() while another thread calls malloc(), gets the >> same virtual address that was just free()d but has not yet been >> unregistered in the kernel, so a subsequent ibv_post_send() may >> succeed but be sending the wrong data. >> >> Put simply: in a multi-threaded application, there's always the chance >> that the notify won't get to the user-level process until after the >> global notifier variable has been checked, right? Or, putting it the >> other way: is there any kind of notify system that could be used that >> *can't* create a potential race condition in a multi-threaded user >> application? > > Without thinking too much about the proposal (except that it adds a lot > of new verb interfaces and a lot of kernel code, and therefore feels > like a hassle to me), I don't see how this race is solved by moving a > cache to the kernel. If you think this sounds like a hassle, think about what it looks like from the point of view of the MPI implementer (or any other developer writing libraries which sit between user data and OFED, like GASNet). We don't write kernel modules, can't do much to change libc, and have to compete on performance (particularly benchmarks that send large messages from the same buffer). We're forced into a library-level pin cache to get competitive performance, but don't have the hooks to do it properly. Instead, we try a whole list of hacks to intercept free() and munmap() and hope for the best, often missing. And Open Fabrics is the only "commodity" interfaces that makes implementers go through these pains. Myrinet's MX, Cray's Portals, and Quadric's Tports all handle the issues either at the driver library or kernel module level. One statistic I like to point out (as a supporter of proper offload interconnects and interfaces) is that there are 13,363 lines of code to support InfiniBand within Open MPI, and that doesn't include logic for pin caching, message matching, request management, or multi-nic striping. There are 4560 lines of code to support Cray Portals, and that includes all logic for pin caching, message matching, request management, and multi-nic. Guess which one I think is more complex and feels like a hassle to me? > If you have free()/malloc() of a buffer running in parallel with send > operations targeting the same buffer, then that seems like a buggy MPI > application. Since free()/malloc() might not involve the kernel at all > (the userspace library might keep its own free list, etc) I don't see > how a registration cache in the kernel would help anyway. > > Now, since free()/malloc() operations must be serialized with respect to > send/receive operations in userspace anyway, I don't see why a simpler > (and possibly more flexible/powerful) kernel notifier design can't > work -- if free() releases virtual memory back to the kernel, then the > kernel notifier will run before the free() call returns, so things > should work as planned. Jeff and I talked for a while today, and we're pretty sure that as long as the byte set by the kernel notifier is written before the pages are returned into the unallocated list, there isn't actually a race condition. It does mean that every time the page cache is searched, we also have to check the byte (and likely take a cache miss), but that's not too evil. However, there's still then the problem with the notifier concept of how the kernel passes which pages were given back to the kernel. It has to pass a (potentially very large) amount of data back to the user, so the memory ownership issues with kernel/user space are interesting. It also has to somewhat atomically prepare the list and undset the notifier byte, which is also problematic. But probably workable. So perhaps the notifier method would be sufficient after all. Brian -- Brian W. Barrett Dept. 1423: Scalable System Software Sandia National Laboratories From ralph.campbell at qlogic.com Wed Apr 29 14:04:18 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 29 Apr 2009 14:04:18 -0700 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: References: <3E350D1F-BADB-4B7B-8307-BE0DA085DBF3@cisco.com> <1240956668.3403.324.camel@chromite.mv.qlogic.com> Message-ID: <1241039058.3403.369.camel@chromite.mv.qlogic.com> On Tue, 2009-04-28 at 18:10 -0700, Jeff Squyres wrote: > On Apr 28, 2009, at 6:11 PM, Ralph Campbell wrote: > > > Ah, free() just puts the buffer on a free list and a subsequent > > malloc() > > can return it. The application isn't aware of the MPI library calling > > ibv_reg_mr() > > > > Right. > > > and the MPI library isn't aware of the application > > reusing the buffer differently. > > The virtual to physical mapping can't change while it is pinned > > so buffer B should have been written with new data overwriting > > the same physical pages that buffer A used. > > I would assume the application would wait for the MPI_isend() to > > complete before freeing the buffer so it shouldn't be the case that > > the same buffer is in the process of being sent when the application > > overwrites the address and tries to send it again. > > > > This is not the problem. > > An MPI program that re-uses a buffer that is in use in an ongoing non- > blocking send operation is clearly erroneous. > > Perhaps my explanations were incorrect and you kernel gurus can > educate me. What I know can happen is: > > - MPI application alloc's buffer A and gets virtual address B back, > corresponding to physical address C > - MPI application calls MPI_SEND with A > > - MPI implementation registers buffer A, and caches that address B is > registered, and then does the send > > - MPI application frees buffer A > > - MPI implementation does *NOT* unregister buffer A > > - MPI application alloc's buffer X and gets virtual address *B* back, > corresponding to physical address Z (Z!=C) > - MPI application calls MPI_SEND with X > > - MPI implementation sees virtual address B in its cache and thinks > that it is already registered... badness ensues > > Note that the virtual addresses are the same, but the physical > addresses are different. This can, and does, happen. It makes it > impossible to tell the buffer apart in userspace -- MPI cannot tell > that the buffer is not already pinned (because according to MPI's > internal cache, it *is* registered already). The only way to hack > around this is for the MPI implementation to intercept free/sbrk/ > whatever (horrors!) so that it can a) know to unregister the buffer > and b) remove the address from its "already registered" cache. > > It's quite possible that I don't know why this happens, or stated the > wrong reasons why. But it definitely does happen. The problem is that MPI needs to be aware of the application doing the free() and unregister or flush its MR cache for that virtual address range. Of course it would be difficult for OpenMPI to have callbacks or hooks into every way memory could be allocated/freed that an application might use. It seems to me that this is mostly an issue for rendezvous sends. Eager sends can use a pool of preregistered memory which are reused as data is copied from the buffer and ibv_post_recv()'ed. At least now, I think I understand your issue. From jsquyres at cisco.com Wed Apr 29 14:08:22 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Wed, 29 Apr 2009 17:08:22 -0400 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <1241039058.3403.369.camel@chromite.mv.qlogic.com> References: <3E350D1F-BADB-4B7B-8307-BE0DA085DBF3@cisco.com> <1240956668.3403.324.camel@chromite.mv.qlogic.com> <1241039058.3403.369.camel@chromite.mv.qlogic.com> Message-ID: <6B60B4BD-FB97-45DD-94DB-51883E05F14F@cisco.com> On Apr 29, 2009, at 5:04 PM, Ralph Campbell wrote: > The problem is that MPI needs to be aware of the application doing > the free() and unregister or flush its MR cache for that virtual > address range. Of course it would be difficult for OpenMPI to have > callbacks or hooks into every way memory could be allocated/freed > that an application might use. > > It seems to me that this is mostly an issue for rendezvous sends. > Eager sends can use a pool of preregistered memory which are > reused as data is copied from the buffer and ibv_post_recv()'ed. > Yes, exactly! > At least now, I think I understand your issue. Sorry for not explaining better; it's complicated to explain and I assumed that most people were somewhat familiar with the MPI issues already. Bad assumption on my part... -- Jeff Squyres Cisco Systems From weiny2 at llnl.gov Wed Apr 29 14:53:55 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 29 Apr 2009 14:53:55 -0700 Subject: [ofa-general] Re: Issues with combined routing in smpquery In-Reply-To: <20090428205525.4ffdd778.weiny2@llnl.gov> References: <20090428202736.0ff049e5.weiny2@llnl.gov> <20090428205525.4ffdd778.weiny2@llnl.gov> Message-ID: <20090429145355.704fb2f5.weiny2@llnl.gov> I have traced this down a bit more. The drslid and drdlid have been encoded in the MAD reversed! This has happened somewhere between version 1.5.0 and 1.5.1. Applying the following patch allows combined routing to work but I don't know where the real bug is. diff --git a/libibmad/src/mad.c b/libibmad/src/mad.c index 3f04da0..6f34e02 100644 --- a/libibmad/src/mad.c +++ b/libibmad/src/mad.c @@ -101,9 +101,9 @@ void *mad_encode(void *buf, ib_rpc_t * rpc, ib_dr_path_t * drpath, void *data) if (rpc->mgtclass == IB_SMI_DIRECT_CLASS) { /* word 9 */ mad_set_field(buf, 0, IB_DRSMP_DRDLID_F, - drpath->drdlid ? drpath->drdlid : 0xffff); - mad_set_field(buf, 0, IB_DRSMP_DRSLID_F, drpath->drslid ? drpath->drslid : 0xffff); + mad_set_field(buf, 0, IB_DRSMP_DRSLID_F, + drpath->drdlid ? drpath->drdlid : 0xffff); /* bytes 128 - 256 - by default should be zero due to memset */ if (is_resp) I don't see any differences between 1.5.0 and 1.5.1 which would cause this. Any ideas???? Ira On Tue, 28 Apr 2009 20:55:25 -0700 Ira Weiny wrote: > On Tue, 28 Apr 2009 20:27:36 -0700 > Ira Weiny wrote: > > > Sasha, Hal, > > > > I have some hardware on which the following query does not work. > > > > 18:40:54 > ./smpquery -c nodeinfo 243 0,1 > > ibwarn: [22072] mad_rpc: _do_madrpc failed; dport (Lid 243 DR path slid 148; dlid 65535; 0,1) > > ./smpquery: iberror: failed: operation nodeinfo: node info query failed > > > > from the node I am running on. > > > > 20:08:46 > ibstat > > CA 'mlx4_0' > > CA type: MT25418 > > Number of ports: 2 > > Firmware version: 2.6.0 > > Hardware version: a0 > > Node GUID: 0x0002c9020025feb4 > > System image GUID: 0x0002c9020025feb7 > > Port 1: > > State: Active > > Physical state: LinkUp > > Rate: 10 > > Base lid: 148 > > LMC: 2 > > SM lid: 148 > > Capability mask: 0x0251086a > > Port GUID: 0x0002c9020025feb5 > > [snip] > > > > 19:12:10 > hostname > > hype137 > > > > > > A query on the LID alone returns this. > > > > 18:41:20 > ./smpquery nodeinfo 243 > > # Node info: Lid 243 > > [snip] > > NodeType:........................Switch > > NumPorts:........................24 > > SystemGuid:......................0x0008f10400400e69 > > Guid:............................0x0008f10400400e69 > > PortGuid:........................0x0008f10400400e69 > > [snip] > > > > And iblinkinfo is. > > > > 18:41:26 > iblinkinfo.pl -S 0x0008f10400400e69 > > Switch 0x0008f10400400e69 ISR9288 Voltaire sFB-12D: > > 243 1[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 646 10[ ] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > > [snip] > > > > > > It looks like combined routing is not working at all except for this one > > query. (LID 37 is the switch which is connected to the HCA I am running > > on.) > > > > 18:53:18 > ./smpquery -c portinfo 37 0,1 > > # Port info: Lid 37 DR path slid 148; dlid 65535; 0,1 port 0 > > Mkey:............................0x0000000000000000 > > GidPrefix:.......................0xfe80000000000000 > > Lid:.............................148 > > SMLid:...........................148 > > [snip] > > > > All other combined routing queries I try fail. And even this one above is > > wrong. It is returning the data on port 6 not 1. Look at the output from the > > local switch. > > > > 19:12:00 > iblinkinfo.pl -R -S 0x000b8cffff004663 > > Switch 0x000b8cffff004663 MT47396 Infiniscale-III Mellanox Technologies: > > 37 1[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 108 1[ ] "hype132" ( ) > > 37 2[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 528 1[ ] "hype133" ( ) > > 37 3[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 296 1[ ] "hype134" ( ) > > 37 4[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 92 1[ ] "hype135" ( ) > > 37 5[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 144 1[ ] "hype136" ( ) > > > > This is what is connected to LID 148... > > 37 6[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 148 1[ ] "hype137" ( ) > > > > 37 7[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 540 1[ ] "hype138" ( ) > > 37 8[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 212 1[ ] "hype139" ( ) > > 37 9[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 532 1[ ] "hype140" ( ) > > 37 10[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 60 1[ ] "hype141" ( ) > > 37 11[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 192 1[ ] "hype142" ( ) > > 37 12[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 312 1[ ] "hype143" ( ) > > 37 13[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 647 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > > 37 14[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 641 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > > 37 15[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 643 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > > 37 16[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 653 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > > 37 17[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 637 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > > 37 18[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 610 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > > 37 19[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 655 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > > 37 20[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 645 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > > 37 21[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 635 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > > 37 22[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 651 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > > 37 23[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 639 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > > 37 24[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 649 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( ) > > > > Any idea what is going on? These were all run with a smpquery built from the > > current master tree. > > > > On my little test system this seems to work just fine... But not on this > > system. Did some older hardware not support combined DR routing? > > Actually I take this back. It seems an older version of smpquery works but > not this newer one. So I don't think this is a hardware issue. :-( > > 20:54:47 > ./smpquery -c nodeinfo 14 0,10 > ibwarn: [21947] _do_madrpc: send failed; Invalid argument > ibwarn: [21947] mad_rpc: _do_madrpc failed; dport (Lid 14 DR path slid 4; dlid 65535; 0,10) > ./smpquery: iberror: failed: operation nodeinfo: node info query failed > > 20:54:52 > ./smpquery -V > ./smpquery BUILD VERSION: 1.5.1_76524e3_dirty Build date: Apr 28 2009 20:47:10 > > 20:54:55 > smpquery -c nodeinfo 14 0,10 > # Node info: Lid 14 DR path 0,10 > BaseVers:........................1 > ClassVers:.......................1 > NodeType:........................Switch > NumPorts:........................24 > SystemGuid:......................0x0008f10400411b19 > Guid:............................0x0008f10400411b18 > PortGuid:........................0x0008f10400411b18 > PartCap:.........................8 > DevId:...........................0x5a30 > Revision:........................0x000001a1 > LocalPort:.......................24 > VendorId:........................0x0008f1 > > 20:54:59 > smpquery -V > smpquery BUILD VERSION: 1.3.6 Build date: Oct 13 2008 12:20:42 > > Ira > -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab weiny2 at llnl.gov From jgunthorpe at obsidianresearch.com Wed Apr 29 14:55:08 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 29 Apr 2009 15:55:08 -0600 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <1241039058.3403.369.camel@chromite.mv.qlogic.com> References: <3E350D1F-BADB-4B7B-8307-BE0DA085DBF3@cisco.com> <1240956668.3403.324.camel@chromite.mv.qlogic.com> <1241039058.3403.369.camel@chromite.mv.qlogic.com> Message-ID: <20090429215508.GW4431@obsidianresearch.com> > The problem is that MPI needs to be aware of the application doing > the free() and unregister or flush its MR cache for that virtual > address range. Of course it would be difficult for OpenMPI to have > callbacks or hooks into every way memory could be allocated/freed > that an application might use. There are only three calls that affect the way VM memory maps to physical and thus would invalidate the mr cache: mmap, munmap and brk. Specifically what must be happening is the app registers memory, calls munmap on it, then gets the same VA back from mmap and the kernel level mr is still pointing to the original mmap: foo = mmap(...); ibv_reg_mr(mr,foo) munmap(foo..) mmap(...) == foo; // By chance due to VA randomization // Ooops, mr no longer matches proc/self/maps Actually, maybe that is the simple answer here - have the kernel fixup the mr before returning from the 2nd mmap. Then the cache in user space is still correct to assume that VA XX is registered and working. Removing entries from the registration cache would have to be done in some other way (age?). Jason From robert.j.woodruff at intel.com Wed Apr 29 15:07:48 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 29 Apr 2009 15:07:48 -0700 Subject: [ofa-general] New proposal for memory management In-Reply-To: References: Message-ID: <382A478CAD40FA4FB46605CF81FE39F42B5C7C5C@orsmsx507.amr.corp.intel.com> Brian wrote, >And Open Fabrics is the only "commodity" interfaces that makes implementers >go through these pains. Myrinet's MX, Cray's Portals, and Quadric's Tports >all handle the issues either at the driver library or kernel module level. One important note is that in general, Myrinet, Quadrics, and even Portals were designed to primarily to run MPI, so it is not a surprise that their interfaces map almost 1:1 to the MPI interfaces. Also, note that all of these use a tag-matching capability, which also seems to map well to MPI. RDMA/OFA verbs were designed to be a more general interface to support lots of ULPs, networking (tcp/ip), storage, etc, not just MPI. That said, for hardware that does support these tag-matching capabilities, like myrinet, Qlogic's HCA (i.e. PSM), OpenMX, and even quadrix, maybe OFA should have a generic tag-matching set of verbs that the MPIs could use instead of the RDMA verbs. The IHVs, like Qlogic, MX, and others that support tag-matching could plug into this generic tag-matching infrastructure. The MPIs would then only have to write one driver in MPI to support all these different IHVs that support tag-matching, and that MPI driver would be a very simple one, since the tag-matching verbs would map almost 1:1 to the MPI interfaces, like MX or PSM do. Heck, maybe we should even encourage the IBTA and iWARP associations to add tag-matching as a feature to the next version of the IBTA and iWARP specs. If they did that, it would make the MPI implementers life a lot easier. I would rather see that done, then hack thousands of lines of memory registration caching code and stuff it into the kernel. From bwbarre at sandia.gov Wed Apr 29 15:11:56 2009 From: bwbarre at sandia.gov (Barrett, Brian W) Date: Wed, 29 Apr 2009 16:11:56 -0600 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <20090429215508.GW4431@obsidianresearch.com> Message-ID: On 4/29/09 15:55 , "Jason Gunthorpe" wrote: >> The problem is that MPI needs to be aware of the application doing >> the free() and unregister or flush its MR cache for that virtual >> address range. Of course it would be difficult for OpenMPI to have >> callbacks or hooks into every way memory could be allocated/freed >> that an application might use. > > There are only three calls that affect the way VM memory maps to > physical and thus would invalidate the mr cache: mmap, munmap and brk. There's also System V shared memory, which at least one scientific code out there uses. > Specifically what must be happening is the app registers memory, calls > munmap on it, then gets the same VA back from mmap and the kernel > level mr is still pointing to the original mmap: > > foo = mmap(...); > ibv_reg_mr(mr,foo) > munmap(foo..) > mmap(...) == foo; // By chance due to VA randomization > // Ooops, mr no longer matches proc/self/maps > > Actually, maybe that is the simple answer here - have the kernel fixup > the mr before returning from the 2nd mmap. Then the cache in user > space is still correct to assume that VA XX is registered and working. Yeah, although that could get really nasty as there's generally not one call to ibv_reg_mr per call to mmap. It's usually a couple of calls to ibv_reg_mr for different segments of the same mmap buffer (think sending faces of a 3-d block of space to the nearest neighbors in a physics simulation). > Removing entries from the registration cache would have to be done in > some other way (age?). Brian -- Brian W. Barrett Dept. 1423: Scalable System Software Sandia National Laboratories From rdreier at cisco.com Wed Apr 29 15:16:07 2009 From: rdreier at cisco.com (Roland Dreier) Date: Wed, 29 Apr 2009 15:16:07 -0700 Subject: [ofa-general] Re: [PATCH 2.6.30] RDMA/cxgb3: flushed sq wr completions get inserted twice in to the cqe In-Reply-To: <20090429191438.29393.59197.stgit@build.ogc.int> (Steve Wise's message of "Wed, 29 Apr 2009 14:14:38 -0500") References: <20090429191438.29393.59197.stgit@build.ogc.int> Message-ID: thanks, applied From jgunthorpe at obsidianresearch.com Wed Apr 29 15:21:25 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 29 Apr 2009 16:21:25 -0600 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: References: <20090429215508.GW4431@obsidianresearch.com> Message-ID: <20090429222125.GX4431@obsidianresearch.com> On Wed, Apr 29, 2009 at 04:11:56PM -0600, Barrett, Brian W wrote: > On 4/29/09 15:55 , "Jason Gunthorpe" > wrote: > > >> The problem is that MPI needs to be aware of the application doing > >> the free() and unregister or flush its MR cache for that virtual > >> address range. Of course it would be difficult for OpenMPI to have > >> callbacks or hooks into every way memory could be allocated/freed > >> that an application might use. > > > > There are only three calls that affect the way VM memory maps to > > physical and thus would invalidate the mr cache: mmap, munmap and brk. > > There's also System V shared memory, which at least one scientific code out > there uses. People use that stuff? Yuk, toxic. :) > Yeah, although that could get really nasty as there's generally not one call > to ibv_reg_mr per call to mmap. It's usually a couple of calls to > ibv_reg_mr for different segments of the same mmap buffer (think sending > faces of a 3-d block of space to the nearest neighbors in a physics > simulation). Plus you have to be careful if VA randomization creates holes, ie you might have a MR registration covering 1GB that got munmapped but after a while you have a dozen fragmented mmaps in that same space. A 3rd alternative would be to make mmap not return VA's that are still registered with IB. Then on munmap you are assured to never get that address back until you call ibv_mem_unreg. From time to time MPI can inspect proc/self/maps and remove cached registrations that have no VM address. Jason From ralph.campbell at qlogic.com Wed Apr 29 15:28:00 2009 From: ralph.campbell at qlogic.com (Ralph Campbell) Date: Wed, 29 Apr 2009 15:28:00 -0700 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <20090429222125.GX4431@obsidianresearch.com> References: <20090429215508.GW4431@obsidianresearch.com> <20090429222125.GX4431@obsidianresearch.com> Message-ID: <1241044080.3403.374.camel@chromite.mv.qlogic.com> On Wed, 2009-04-29 at 15:21 -0700, Jason Gunthorpe wrote: > On Wed, Apr 29, 2009 at 04:11:56PM -0600, Barrett, Brian W wrote: > > On 4/29/09 15:55 , "Jason Gunthorpe" > > wrote: > > > > >> The problem is that MPI needs to be aware of the application doing > > >> the free() and unregister or flush its MR cache for that virtual > > >> address range. Of course it would be difficult for OpenMPI to have > > >> callbacks or hooks into every way memory could be allocated/freed > > >> that an application might use. > > > > > > There are only three calls that affect the way VM memory maps to > > > physical and thus would invalidate the mr cache: mmap, munmap and brk. > > > > There's also System V shared memory, which at least one scientific code out > > there uses. > > People use that stuff? Yuk, toxic. :) > > > Yeah, although that could get really nasty as there's generally not one call > > to ibv_reg_mr per call to mmap. It's usually a couple of calls to > > ibv_reg_mr for different segments of the same mmap buffer (think sending > > faces of a 3-d block of space to the nearest neighbors in a physics > > simulation). > > Plus you have to be careful if VA randomization creates holes, ie you > might have a MR registration covering 1GB that got munmapped but after > a while you have a dozen fragmented mmaps in that same space. > > A 3rd alternative would be to make mmap not return VA's that are still > registered with IB. Then on munmap you are assured to never get that > address back until you call ibv_mem_unreg. From time to time MPI can > inspect proc/self/maps and remove cached registrations that have no VM > address. > > Jason Besides, mmap() only allocates a virtual address range in the user's address space. It doesn't fault in all the pages into physical memory. That happens when the application tries to read or write memory in the VA range of the mmap. The IB memory registrations need physical addresses and it would be impractical to do this for every mmap or brk. From bwbarre at sandia.gov Wed Apr 29 15:28:06 2009 From: bwbarre at sandia.gov (Barrett, Brian W) Date: Wed, 29 Apr 2009 16:28:06 -0600 Subject: [ofa-general] New proposal for memory management In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F42B5C7C5C@orsmsx507.amr.corp.intel.com> Message-ID: On 4/29/09 16:07 , "Woodruff, Robert J" wrote: > Brian wrote, > >> And Open Fabrics is the only "commodity" interfaces that makes implementers >> go through these pains. Myrinet's MX, Cray's Portals, and Quadric's Tports >> all handle the issues either at the driver library or kernel module level. > > One important note is that in general, Myrinet, Quadrics, and even Portals > were designed to primarily to run MPI, so it is not a surprise that their > interfaces > map almost 1:1 to the MPI interfaces. Also, note that all of these use a > tag-matching > capability, which also seems to map well to MPI. > RDMA/OFA verbs were designed to be a more general interface to > support lots of ULPs, networking (tcp/ip), storage, etc, not just MPI. True, although any of those could be extended to support the features necessary for storage and such (and many already support IP). The code complexity claim is also true of sockets (TCP, in particular). It's a lot less code and doesn't make us jump through nearly as many hoops. Obviously it doesn't perform as well, but 5-6x the code complexity for OFED isn't a good thing. > That said, for hardware that does support these tag-matching capabilities, > like > myrinet, Qlogic's HCA (i.e. PSM), OpenMX, and even quadrix, maybe OFA should > have a > generic tag-matching set of verbs that the MPIs could use instead of the > RDMA verbs. The IHVs, like Qlogic, MX, and others that support tag-matching > could > plug into this generic tag-matching infrastructure. The MPIs would then only > have to > write one driver in MPI to support all these different IHVs that support > tag-matching, > and that MPI driver would be a very simple one, since the tag-matching verbs > would map almost 1:1 to the MPI interfaces, like MX or PSM do. I think there are other problems with the verbs interface that would still make MPI implementers twitch (some of which are in the slides Jeff sent out to begin this discussion). But I certainly wouldn't say no to a real set of tag matching primitives. Of course, that opens a whole can of worms that I'm not sure OFED is ready to deal with. It also may or may not solve the memory registration problem. If the memory in the matching verb still had to be registered, we haven't solved the problem that started this discussion. So the verb would have to also handle memory registration, which seems to go against the general "OFA way". > Heck, maybe we should even encourage the IBTA and iWARP associations to add > tag-matching > as a feature to the next version of the IBTA and iWARP specs. If they did > that, > it would make the MPI implementers life a lot easier. I would rather see that > done, > then hack thousands of lines of memory registration caching code and stuff it > into the > kernel. I would love matching in the spec. But I'm not sure it directly solves any of the problems Jeff brought up in his talk at Sonoma. I can cope with having to do matching in the MPI (I'm going to have that code anyway for TCP networks). But it's the connection management, the memory pinning, and the receive buffer space requirements that really drive us nuts and require the bulk of our effort. Brian -- Brian W. Barrett Dept. 1423: Scalable System Software Sandia National Laboratories From jgunthorpe at obsidianresearch.com Wed Apr 29 15:44:11 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Wed, 29 Apr 2009 16:44:11 -0600 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <1241044080.3403.374.camel@chromite.mv.qlogic.com> References: <20090429215508.GW4431@obsidianresearch.com> <20090429222125.GX4431@obsidianresearch.com> <1241044080.3403.374.camel@chromite.mv.qlogic.com> Message-ID: <20090429224411.GC32114@obsidianresearch.com> On Wed, Apr 29, 2009 at 03:28:00PM -0700, Ralph Campbell wrote: > Besides, mmap() only allocates a virtual address range in the user's > address space. It doesn't fault in all the pages into physical memory. > That happens when the application tries to read or write memory in > the VA range of the mmap. The IB memory registrations need physical > addresses and it would be impractical to do this for every mmap or > brk. If your goal is to keep the mr consistent then you only need to fault and pin pages from the new mmap that intersect with pre-existing memory registrations. I chucked out 3 things to consider: - Pin and register all process memory (no swap!) - Keep the MR consistent by pinning and registering new mmaps that intersect with pre-existing memory registrations - Keep the MR consistent by preventing the kernel from returning new mmaps that overlap existing memory registrations. Jason From robert.j.woodruff at intel.com Wed Apr 29 15:52:22 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Wed, 29 Apr 2009 15:52:22 -0700 Subject: [ofa-general] New proposal for memory management In-Reply-To: References: <382A478CAD40FA4FB46605CF81FE39F42B5C7C5C@orsmsx507.amr.corp.intel.com> Message-ID: <382A478CAD40FA4FB46605CF81FE39F42B5C7D05@orsmsx507.amr.corp.intel.com> Brian wrote, >I think there are other problems with the verbs interface that would still >make MPI implementers twitch (some of which are in the slides Jeff sent out >to begin this discussion). But I certainly wouldn't say no to a real set of >tag matching primitives. Of course, that opens a whole can of worms that >I'm not sure OFED is ready to deal with. >It also may or may not solve the memory registration problem. If the memory >in the matching verb still had to be registered, we haven't solved the >problem that started this discussion. So the verb would have to also handle >memory registration, which seems to go against the general "OFA way". I think if we did such a thing, we could implement a set of tag-matching primitives (similar to MX or PSM) that are kind of a separate library from the OFA RDMA verbs, just like PSM for Qlogic is a separate library and not part of the OFA verbs. Just like with MX and PSM, I think the registration can be done my the tag-matching driver (like PSM or MX do) and not require MPI to do it. Think of this as "the MPI tag-matching interface" library for OFA. However, this would only completely solve your problem and complexity of using the OFA RDMA verbs if all the hardware vendors implemented tag-matching in their NICs. Seems like if they want to better support MPIs, that is what they would do and then MPIs would only have to use the simple tag-matching primitives and would not have to worry about things like memory registration caches and such. Anyway, I think it is an interesting idea worth perusing with the IHVs as the long term solution to most of the issues that Jeff raised in Sonoma. woody From weiny2 at llnl.gov Wed Apr 29 16:04:38 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Wed, 29 Apr 2009 16:04:38 -0700 Subject: [ofa-general] Re: Issues with combined routing in smpquery In-Reply-To: <20090429145355.704fb2f5.weiny2@llnl.gov> References: <20090428202736.0ff049e5.weiny2@llnl.gov> <20090428205525.4ffdd778.weiny2@llnl.gov> <20090429145355.704fb2f5.weiny2@llnl.gov> Message-ID: <20090429160438.db62cde1.weiny2@llnl.gov> On Wed, 29 Apr 2009 14:53:55 -0700 Ira Weiny wrote: > I have traced this down a bit more. > > The drslid and drdlid have been encoded in the MAD reversed! > > This has happened somewhere between version 1.5.0 and 1.5.1. I know what changed but there appears to be a discrepancy between ib_mad_f and the spec. Commit 2dbb8b95d9dc27423a6fdb85d88ef385ecee0005 "libibmad: remove c99 definitions within the ib_mad_f structure" removed the designated initializers from ib_mad_f. Appling the patch below aligns the MAD_FIELDS with ib_mad_f. However, if you look at the offsets specified in ib_mad_f they are wrong. According to 14.2.1.2, DrSLID is at offset 32 bytes (256 bits). ib_mad_f places the offset at 272. I have verified the bytes using a debugger and byte 32 is the DrSLID. I hesitate to say there is a bug in mad_set_field however there does appear to be something amiss. :-/ Ira 15:03:15 > git diff diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index 1aaaa1b..2b89193 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -246,8 +246,8 @@ enum MAD_FIELDS { IB_MAD_MKEY_F, /* word 9 (32-37 bytes) */ - IB_DRSMP_DRSLID_F, IB_DRSMP_DRDLID_F, + IB_DRSMP_DRSLID_F, /* word 10,11 (36-43 bytes) */ IB_SA_MKEY_F, From john.gregor at qlogic.com Wed Apr 29 18:28:06 2009 From: john.gregor at qlogic.com (John A. Gregor) Date: Wed, 29 Apr 2009 18:28:06 -0700 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <6B60B4BD-FB97-45DD-94DB-51883E05F14F@cisco.com> References: <3E350D1F-BADB-4B7B-8307-BE0DA085DBF3@cisco.com> <1240956668.3403.324.camel@chromite.mv.qlogic.com> <1241039058.3403.369.camel@chromite.mv.qlogic.com> <6B60B4BD-FB97-45DD-94DB-51883E05F14F@cisco.com> Message-ID: <49f8fea6.LDLCEFIRlnHmS++4%john.gregor@qlogic.com> Jeff Squyres wrote: > On Apr 29, 2009, at 5:04 PM, Ralph Campbell wrote: > > It seems to me that this is mostly an issue for rendezvous sends. > > Eager sends can use a pool of preregistered memory which are > > reused as data is copied from the buffer and ibv_post_recv()'ed. > > Yes, exactly! Another MPI/Verbs neophyte chiming in... So, for a rendezvous, I imagine there's an exchange that looks vaguely like: A B | | +-- RTS ------->| | | |<---------CTS--+ | | +-- DATA ------>| +-- DATA ------>| +-- DATA ------>| +-- DATA ------>| : : And the first critical path for B is to receive the RTS and turn it around into a CTS as quickly as possible. It seems like all you need at the time of the CTS is a physical address (or set of them) to program into your hardware and set up the mapping from memory region to chip resources. While the CTS is flying back to the requester, there is time for playing with mappings and other tricks - anything that doesn't invalidate the physical mappings in the hardware. So, how about this: Maintain a pool of pre-pinned pages. When an RTS comes in, use one of the pre-pinned buffers as the place the DATA will land. Set up the remaining hw context to enable receipt into the page(s) and fire back your CTS. While the CTS is in flight and the DATA is streaming back (and you therefore have a couple microseconds to play with), remap the virt-to-phys mapping of the application so that the original virtual address now points at the pre-pinned page. If the transfer didn't completely fill a page, provide an option to copy into the new page any memory that wasn't overwritten by the transfer. Add the original physical page into the pool of pages available as a buffer. The app goes its merry way using the new physical page. Of course, this does presuppose a system call that looks like phys_swap(void *a, void *b) that would atomically swap the physical pages backing virtual address a and b. And I know some architectures have funny page-coloring issues wrt what virtual addresses can map to what physical adddresses. So it might have to be a pool per color for those. Anyway, just a thought. -John Gregor From sfr at canb.auug.org.au Wed Apr 29 20:22:58 2009 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Thu, 30 Apr 2009 13:22:58 +1000 Subject: [ofa-general] Re: linux-next: infiniband tree build failure In-Reply-To: References: <20090429140101.d9c7467c.sfr@canb.auug.org.au> Message-ID: <20090430132258.2febce7b.sfr@canb.auug.org.au> Hi Roland, On Wed, 29 Apr 2009 09:51:47 -0700 Roland Dreier wrote: > > Thanks for pointing this out. I rolled the below into the patch in > question, which should fix this. Thanks. -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available URL: From aafabbri at cisco.com Wed Apr 29 22:30:20 2009 From: aafabbri at cisco.com (Aaron Fabbri) Date: Thu, 30 Apr 2009 05:30:20 +0000 (UTC) Subject: [ofa-general] New proposal for memory management References: Message-ID: Jeff Squyres cisco.com> writes: ... > > Introduction: > ============= > > MPI has long had a problem maintaining its own verbs memory > registration cache in userspace. The main issue is that user > applications are responsible for allocating/freeing their own data > buffers -- the MPI layer does not (usually) have visibility when > application buffers are allocated or freed. I'm late to the debate, so sorry if you've already covered this... Have you considered changing the MPI API to require applications to use MPI to allocate any/all buffers that may be used for network I/O? That is, instead of calling malloc() et al., call a new mpi_malloc() which allocates from pre- registered memory. I'm sure it is hard to just up and change MPI, but it seems like the right thing to do. (While you're at it, change the sockets interface too.) -AF From monis at Voltaire.COM Wed Apr 29 23:29:28 2009 From: monis at Voltaire.COM (Moni Shoua) Date: Thu, 30 Apr 2009 09:29:28 +0300 Subject: [ofa-general] ***SPAM*** Re: IB Bonding errors with recent kernel In-Reply-To: <52436c7f0904281404ycc353f2k95fb6d8168e28276@mail.gmail.com> References: <52436c7f0904200421s53d65a31vdcbb26babd9be196@mail.gmail.com> <49EC6505.40406@Voltaire.COM> <52436c7f0904200629q447d969cwbe12ce8bb606584b@mail.gmail.com> <49EDB644.8040604@Voltaire.COM> <52436c7f0904281404ycc353f2k95fb6d8168e28276@mail.gmail.com> Message-ID: <49F94548.7080505@Voltaire.COM> Dennis Portello wrote: > Hi Moni, > > Thank you for looking into this. The discovery of multicast not working > with bonding caused a major course correction in my project, I haven't > checked emails from the list in a few days. I expect to verify if > bonding works as you described later this week. > Please let me know how it goes > Unfortunately, bonding as you described will not work in my situation > since we use Ethernet bonding as well. I'm not sure I understand why bonding won't work. The only limitation for bonding is that you can't enslave slaves of different types under the same master. However, you can have several masters, each with a different types. The issue of multicast that isn't working for you has has an easy workaround. > > I hope to revisit IPoIB at a later time. > > Thanks again, > Dennis P. > > On Tue, Apr 21, 2009 at 8:04 AM, Moni Shoua > wrote: > > Dennis Portello wrote: > > I can confirm that this issue exists beyond Redhat 4, I'm using Ubuntu > > 8.10 (2.6.27). > > > > I'm using ib-bond and I've also tried adding he bonds directly with > > > > echo +bond0 > /sys/class/net/bonding_masters > > echo 1 > /sys/class/net/bond0/bonding/mode > > echo 100 > /sys/class/net/bond0/bonding/miimon > > echo +ib0 > /sys/class/net/bond0/bonding/slaves > > echo +ib1 > /sys/class/net/bond0/bonding/slaves > > ifconfig bond0 192.168.47.102/24 > > > route add -net 224.0.0.0/3 > gw 192.168.47.100 > > > I guess that what you see is a result of 2 issues. > First, a garbage multicast addresses that is passed to ib0 by bond0 > The second, a garbage mulicast address in the list of mcast > addresses of interface ib0 prevents other legal addresses from > joining the mcast group. > > To avoid this (at least as a workaround) you should make sure that > interface bond0 won't be up before it has ib slaves > or in other words, bond0 was never up between 'modprobe bonding' and > 'echo +ib0 > /sys/class/net/bond0/bonding/slaves' > > Let me know if this helps > > > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From ogerlitz at voltaire.com Thu Apr 30 00:15:16 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 30 Apr 2009 10:15:16 +0300 Subject: [ofa-general] Re: adding a purely software based RDMA driver In-Reply-To: References: Message-ID: <49F95004.7050608@voltaire.com> Philip Frey1 wrote: > as announced earlier on this channel as well as at the Sonoma > Workshop, we are adding a purely software based RDMA driver to OFED. You should have post your code for review and merge into the mainline (upstream) Linux kernel. Or. From ogerlitz at voltaire.com Thu Apr 30 00:43:10 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 30 Apr 2009 10:43:10 +0300 Subject: [ofa-general] IPoIB performance numbers? In-Reply-To: References: Message-ID: <49F9568E.6080000@voltaire.com> Arlin Davis wrote: > Does anyone have IPoIB performance numbers comparing connected versus unconnected modes? From my experience using connected mode one can almost saturate DDR link (1.9 GBs), I never saw datagram mode yielding such results, but when I did the tests LRO was not yet available, so its possible things got better. Going beyond DDR bandwith with connected mode may be possible as well, e.g with the suggest set of optimized sysctls calls. Or. From arlin.r.davis at intel.com Thu Apr 30 01:10:36 2009 From: arlin.r.davis at intel.com (Arlin Davis) Date: Thu, 30 Apr 2009 01:10:36 -0700 Subject: [ofa-general] [PATCH] uDAPL v2: dtest: add flush EVD call after data transfer errors Message-ID: <47DF6DF314FC4E0B95759FCD47D5C483@amr.corp.intel.com> Flush and print entries on async, request, and receive queues after any data transfer error. Will help identify failing operation during operations without completion events requested. Fix -B0 so burst size of 0 works. Signed-off-by: Arlin Davis --- test/dtest/dtest.c | 61 ++++++++++++++++++++++++++++++++++++++++----------- 1 files changed, 48 insertions(+), 13 deletions(-) diff --git a/test/dtest/dtest.c b/test/dtest/dtest.c index d099c95..6ff7798 100755 --- a/test/dtest/dtest.c +++ b/test/dtest/dtest.c @@ -61,7 +61,7 @@ #define ntohll _byteswap_uint64 #define htonll _byteswap_uint64 -#else // _WIN32 || _WIN64 +#else // _WIN32 || _WIN64 #include #include @@ -89,7 +89,7 @@ #define ntohll(x) bswap_64(x) #endif -#endif // _WIN32 || _WIN64 +#endif // _WIN32 || _WIN64 /* Debug: 1 == connect & close only, otherwise full-meal deal */ #define CONNECT_ONLY 0 @@ -229,6 +229,37 @@ DAT_RETURN do_ping_pong_msg(void); #define LOGPRINTF if (verbose) printf +void flush_evds(void) +{ + DAT_EVENT event; + + /* Flush async error queue */ + printf("%d ERR: Checking ASYNC EVD...\n", getpid()); + while (dat_evd_dequeue(h_async_evd, &event) == DAT_SUCCESS) { + printf(" ASYNC EVD ENTRY: handle=%p reason=%d\n", + event.event_data.asynch_error_event_data.dat_handle, + event.event_data.asynch_error_event_data.reason); + } + /* Flush receive queue */ + printf("%d ERR: Checking RECEIVE EVD...\n", getpid()); + while (dat_evd_dequeue(h_dto_rcv_evd, &event) == DAT_SUCCESS) { + printf(" RCV EVD ENTRY: op=%d stat=%d ln=%d ck="F64x"\n", + event.event_data.dto_completion_event_data.operation, + event.event_data.dto_completion_event_data.status, + event.event_data.dto_completion_event_data.transfered_length, + event.event_data.dto_completion_event_data.user_cookie.as_64); + } + /* Flush request queue */ + printf("%d ERR: Checking REQUEST EVD...\n", getpid()); + while (dat_evd_dequeue(h_dto_req_evd, &event) == DAT_SUCCESS) { + printf(" REQ EVD ENTRY: op=%d stat=%d ln=%d ck="F64x"\n", + event.event_data.dto_completion_event_data.operation, + event.event_data.dto_completion_event_data.status, + event.event_data.dto_completion_event_data.transfered_length, + event.event_data.dto_completion_event_data.user_cookie.as_64); + } +} + int main(int argc, char **argv) { int i, c; @@ -305,8 +336,8 @@ int main(int argc, char **argv) fflush(stdout); /* allocate send and receive buffers */ - if (((rbuf = malloc(buf_len * burst)) == NULL) || - ((sbuf = malloc(buf_len * burst)) == NULL)) { + if (((rbuf = malloc(buf_len * (burst+1))) == NULL) || + ((sbuf = malloc(buf_len * (burst+1))) == NULL)) { perror("malloc"); exit(1); } @@ -446,7 +477,7 @@ int main(int argc, char **argv) goto cleanup; #endif - /*********** RDMA write data *************/ + /*********** RDMA write data *************/ ret = do_rdma_write_with_msg(); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error do_rdma_write_with_msg: %s\n", @@ -455,7 +486,7 @@ int main(int argc, char **argv) } else LOGPRINTF("%d do_rdma_write_with_msg complete\n", getpid()); - /*********** RDMA read data *************/ + /*********** RDMA read data *************/ ret = do_rdma_read_with_msg(); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error do_rdma_read_with_msg: %s\n", @@ -464,7 +495,7 @@ int main(int argc, char **argv) } else LOGPRINTF("%d do_rdma_read_with_msg complete\n", getpid()); - /*********** PING PING messages ************/ + /*********** PING PING messages ************/ ret = do_ping_pong_msg(); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error do_ping_pong_msg: %s\n", @@ -475,9 +506,10 @@ int main(int argc, char **argv) goto complete; } - cleanup: +cleanup: + flush_evds(); failed++; - complete: +complete: /* disconnect and free EP resources */ if (h_ep != DAT_HANDLE_NULL) { @@ -541,7 +573,6 @@ int main(int argc, char **argv) if (ret != DAT_SUCCESS) { fprintf(stderr, "%d: Error Adaptor close: %s\n", getpid(), DT_RetToString(ret)); - exit(1); } else LOGPRINTF("%d Closed Interface Adaptor\n", getpid()); @@ -552,6 +583,9 @@ int main(int argc, char **argv) printf("\n%d: DAPL Test Complete. %s\n\n", getpid(), failed ? "FAILED" : "PASSED"); + fflush(stderr); + fflush(stdout); + if (!performance_times) exit(0); @@ -1751,7 +1785,7 @@ DAT_RETURN register_rdma_memory(void) ret = dat_lmr_create(h_ia, DAT_MEM_TYPE_VIRTUAL, region, - buf_len * burst, + buf_len * (burst+1), h_pz, DAT_MEM_PRIV_ALL_FLAG, DAT_VA_TYPE_VA, @@ -1778,7 +1812,7 @@ DAT_RETURN register_rdma_memory(void) ret = dat_lmr_create(h_ia, DAT_MEM_TYPE_VIRTUAL, region, - buf_len * burst, + buf_len * (burst + 1), h_pz, DAT_MEM_PRIV_ALL_FLAG, DAT_VA_TYPE_VA, @@ -1917,7 +1951,7 @@ DAT_RETURN create_events(void) /* create dto RCV EVD, with CNO if use_cno was set */ ret = dat_evd_create(h_ia, - MSG_BUF_COUNT, + MSG_BUF_COUNT + burst, h_dto_cno, DAT_EVD_DTO_FLAG, &h_dto_rcv_evd); if (ret != DAT_SUCCESS) { fprintf(stderr, "%d Error dat_evd_create RCV: %s\n", @@ -2110,3 +2144,4 @@ void print_usage(void) printf("P: provider name (default = OpenIB-cma)\n"); printf("\n"); } + -- 1.5.2.5 From arlin.r.davis at intel.com Thu Apr 30 01:10:41 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Thu, 30 Apr 2009 01:10:41 -0700 Subject: [ofa-general] [PATCH] uDAPL v2: openib_scm, cma: use direct SGE mappings from dat_lmr_triplet to ibv_sge Message-ID: no need to rebuild scatter gather list given that DAT v2.0 is now aligned with verbs ibv_sge. Fix ib_send_op_type_t typedef. Signed-off-by: Arlin Davis --- dapl/openib_cma/dapl_ib_dto.h | 156 ++++++++++------------------------------ dapl/openib_cma/dapl_ib_util.h | 2 +- dapl/openib_scm/dapl_ib_dto.h | 155 ++++++++++----------------------------- dapl/openib_scm/dapl_ib_util.h | 2 +- 4 files changed, 79 insertions(+), 236 deletions(-) diff --git a/dapl/openib_cma/dapl_ib_dto.h b/dapl/openib_cma/dapl_ib_dto.h index feaba6e..d97c26b 100644 --- a/dapl/openib_cma/dapl_ib_dto.h +++ b/dapl/openib_cma/dapl_ib_dto.h @@ -54,8 +54,6 @@ #include #endif -#define DEFAULT_DS_ENTRIES 8 - STATIC _INLINE_ int dapls_cqe_opcode(ib_work_completion_t *cqe_p); /* @@ -70,10 +68,9 @@ dapls_ib_post_recv ( IN DAT_COUNT segments, IN DAT_LMR_TRIPLET *local_iov ) { - ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; - ib_data_segment_t *ds_array_p, *ds_array_start_p = NULL; struct ibv_recv_wr wr; struct ibv_recv_wr *bad_wr; + ib_data_segment_t *ds = (ib_data_segment_t *)local_iov; DAT_COUNT i, total_len; int ret; @@ -81,48 +78,26 @@ dapls_ib_post_recv ( " post_rcv: ep %p cookie %p segs %d l_iov %p\n", ep_ptr, cookie, segments, local_iov); - if (segments <= DEFAULT_DS_ENTRIES) - ds_array_p = ds_array; - else - ds_array_start_p = ds_array_p = - dapl_os_alloc(segments * sizeof(ib_data_segment_t)); - - if (NULL == ds_array_p) - return (DAT_INSUFFICIENT_RESOURCES); - /* setup work request */ total_len = 0; wr.next = 0; - wr.num_sge = 0; + wr.num_sge = segments; wr.wr_id = (uint64_t)(uintptr_t)cookie; - wr.sg_list = ds_array_p; - - for (i = 0; i < segments; i++) { - if (!local_iov[i].segment_length) - continue; - - ds_array_p->addr = (uint64_t) local_iov[i].virtual_address; - ds_array_p->length = local_iov[i].segment_length; - ds_array_p->lkey = local_iov[i].lmr_context; - - dapl_dbg_log(DAPL_DBG_TYPE_EP, - " post_rcv: l_key 0x%x va %p len %d\n", - ds_array_p->lkey, ds_array_p->addr, - ds_array_p->length ); - - total_len += ds_array_p->length; - wr.num_sge++; - ds_array_p++; - } - - if (cookie != NULL) + wr.sg_list = ds; + + if (cookie != NULL) { + for (i = 0; i < segments; i++) { + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_rcv: l_key 0x%x va %p len %d\n", + ds->lkey, ds->addr, ds->length); + total_len += ds->length; + ds++; + } cookie->val.dto.size = total_len; + } ret = ibv_post_recv(ep_ptr->qp_handle->cm_id->qp, &wr, &bad_wr); - if (ds_array_start_p != NULL) - dapl_os_free(ds_array_start_p, segments * sizeof(ib_data_segment_t)); - if (ret) return( dapl_convert_errno(errno,"ibv_recv") ); @@ -147,10 +122,9 @@ dapls_ib_post_send ( IN const DAT_RMR_TRIPLET *remote_iov, IN DAT_COMPLETION_FLAGS completion_flags) { - ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; - ib_data_segment_t *ds_array_p, *ds_array_start_p = NULL; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr; + ib_data_segment_t *ds = (ib_data_segment_t *)local_iov; ib_hca_transport_t *ibt_ptr = &ep_ptr->header.owner_ia->hca_ptr->ib_trans; DAT_COUNT i, total_len; @@ -162,48 +136,25 @@ dapls_ib_post_send ( ep_ptr, op_type, cookie, segments, local_iov, remote_iov, completion_flags); - dapl_dbg_log(DAPL_DBG_TYPE_EP, - " post_snd: ep %p cookie %p segs %d l_iov %p\n", - ep_ptr, cookie, segments, local_iov); - - if(segments <= DEFAULT_DS_ENTRIES) - ds_array_p = ds_array; - else - ds_array_start_p = ds_array_p = - dapl_os_alloc(segments * sizeof(ib_data_segment_t)); - - if (NULL == ds_array_p) - return (DAT_INSUFFICIENT_RESOURCES); - /* setup the work request */ wr.next = 0; wr.opcode = op_type; - wr.num_sge = 0; + wr.num_sge = segments; wr.send_flags = 0; wr.wr_id = (uint64_t)(uintptr_t)cookie; - wr.sg_list = ds_array_p; + wr.sg_list = ds; total_len = 0; - for (i = 0; i < segments; i++ ) { - if ( !local_iov[i].segment_length ) - continue; - - ds_array_p->addr = (uint64_t) local_iov[i].virtual_address; - ds_array_p->length = local_iov[i].segment_length; - ds_array_p->lkey = local_iov[i].lmr_context; - - dapl_dbg_log(DAPL_DBG_TYPE_EP, - " post_snd: lkey 0x%x va %p len %d\n", - ds_array_p->lkey, ds_array_p->addr, - ds_array_p->length ); - - total_len += ds_array_p->length; - wr.num_sge++; - ds_array_p++; - } - - if (cookie != NULL) + if (cookie != NULL) { + for (i = 0; i < segments; i++ ) { + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: lkey 0x%x va %p len %d\n", + ds->lkey, ds->addr, ds->length ); + total_len += ds->length; + ds++; + } cookie->val.dto.size = total_len; + } if (wr.num_sge && (op_type == OP_RDMA_WRITE || op_type == OP_RDMA_READ)) { @@ -214,7 +165,6 @@ dapls_ib_post_send ( wr.wr.rdma.rkey, wr.wr.rdma.remote_addr); } - /* inline data for send or write ops */ if ((total_len <= ibt_ptr->max_inline_send) && ((op_type == OP_SEND) || (op_type == OP_RDMA_WRITE))) @@ -234,9 +184,6 @@ dapls_ib_post_send ( ret = ibv_post_send(ep_ptr->qp_handle->cm_id->qp, &wr, &bad_wr); - if (ds_array_start_p != NULL) - dapl_os_free(ds_array_start_p, segments * sizeof(ib_data_segment_t)); - if (ret) return( dapl_convert_errno(errno,"ibv_send") ); @@ -319,61 +266,37 @@ dapls_ib_post_ext_send ( IN DAT_UINT64 swap, IN DAT_COMPLETION_FLAGS completion_flags) { - ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; - ib_data_segment_t *ds_array_p, *ds_array_start_p = NULL; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr; + ib_data_segment_t *ds = (ib_data_segment_t *)local_iov; DAT_COUNT i, total_len; int ret; dapl_dbg_log(DAPL_DBG_TYPE_EP, - " post_snd: ep %p op %d ck %p sgs", + " post_ext_snd: ep %p op %d ck %p sgs", "%d l_iov %p r_iov %p f %d\n", ep_ptr, op_type, cookie, segments, local_iov, remote_iov, completion_flags); - dapl_dbg_log(DAPL_DBG_TYPE_EP, - " post_snd: ep %p cookie %p segs %d l_iov %p\n", - ep_ptr, cookie, segments, local_iov); - - if(segments <= DEFAULT_DS_ENTRIES) - ds_array_p = ds_array; - else - ds_array_start_p = ds_array_p = - dapl_os_alloc(segments * sizeof(ib_data_segment_t)); - - if (NULL == ds_array_p) - return (DAT_INSUFFICIENT_RESOURCES); - /* setup the work request */ wr.next = 0; wr.opcode = op_type; - wr.num_sge = 0; + wr.num_sge = segments; wr.send_flags = 0; wr.wr_id = (uint64_t)(uintptr_t)cookie; - wr.sg_list = ds_array_p; + wr.sg_list = ds; total_len = 0; - for (i = 0; i < segments; i++ ) { - if ( !local_iov[i].segment_length ) - continue; - - ds_array_p->addr = (uint64_t) local_iov[i].virtual_address; - ds_array_p->length = local_iov[i].segment_length; - ds_array_p->lkey = local_iov[i].lmr_context; - - dapl_dbg_log(DAPL_DBG_TYPE_EP, - " post_snd: lkey 0x%x va %p len %d\n", - ds_array_p->lkey, ds_array_p->addr, - ds_array_p->length ); - - total_len += ds_array_p->length; - wr.num_sge++; - ds_array_p++; - } - - if (cookie != NULL) + if (cookie != NULL) { + for (i = 0; i < segments; i++ ) { + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_ext_snd: lkey 0x%x va %p ln %d\n", + ds->lkey, ds->addr, ds->length); + total_len += ds->length; + ds++; + } cookie->val.dto.size = total_len; + } switch (op_type) { case OP_RDMA_WRITE_IMM: @@ -433,9 +356,6 @@ dapls_ib_post_ext_send ( ret = ibv_post_send(ep_ptr->qp_handle->cm_id->qp, &wr, &bad_wr); - if (segments > DEFAULT_DS_ENTRIES) - dapl_os_free(ds_array_start_p, segments * sizeof(ib_data_segment_t)); - if (ret) return( dapl_convert_errno(errno,"ibv_send") ); diff --git a/dapl/openib_cma/dapl_ib_util.h b/dapl/openib_cma/dapl_ib_util.h index 93635ef..46c9b35 100755 --- a/dapl/openib_cma/dapl_ib_util.h +++ b/dapl/openib_cma/dapl_ib_util.h @@ -177,7 +177,7 @@ typedef struct dapl_cm_id *dp_ib_cm_handle_t; typedef struct dapl_cm_id *ib_cm_srvc_handle_t; /* Operation and state mappings */ -typedef enum ibv_send_flags ib_send_op_type_t; +typedef int ib_send_op_type_t; typedef struct ibv_sge ib_data_segment_t; typedef enum ibv_qp_state ib_qp_state_t; typedef enum ibv_event_type ib_async_event_type; diff --git a/dapl/openib_scm/dapl_ib_dto.h b/dapl/openib_scm/dapl_ib_dto.h index ff338fc..9118b2e 100644 --- a/dapl/openib_scm/dapl_ib_dto.h +++ b/dapl/openib_scm/dapl_ib_dto.h @@ -73,10 +73,9 @@ dapls_ib_post_recv ( IN DAT_COUNT segments, IN DAT_LMR_TRIPLET *local_iov ) { - ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; - ib_data_segment_t *ds_array_p, *ds_array_start_p = NULL; struct ibv_recv_wr wr; struct ibv_recv_wr *bad_wr; + ib_data_segment_t *ds = (ib_data_segment_t *)local_iov; DAT_COUNT i, total_len; int ret; @@ -84,50 +83,28 @@ dapls_ib_post_recv ( " post_rcv: ep %p cookie %p segs %d l_iov %p\n", ep_ptr, cookie, segments, local_iov); - if (segments <= DEFAULT_DS_ENTRIES) - ds_array_p = ds_array; - else - ds_array_start_p = ds_array_p = - dapl_os_alloc(segments * sizeof(ib_data_segment_t)); - - if (NULL == ds_array_p) - return (DAT_INSUFFICIENT_RESOURCES); - /* setup work request */ total_len = 0; wr.next = 0; - wr.num_sge = 0; + wr.num_sge = segments; wr.wr_id = (uint64_t)(uintptr_t)cookie; - wr.sg_list = ds_array_p; - - for (i = 0; i < segments; i++) { - if (!local_iov[i].segment_length) - continue; - - ds_array_p->addr = (uint64_t) local_iov[i].virtual_address; - ds_array_p->length = local_iov[i].segment_length; - ds_array_p->lkey = local_iov[i].lmr_context; - - dapl_dbg_log(DAPL_DBG_TYPE_EP, - " post_rcv: l_key 0x%x va %p len %d\n", - ds_array_p->lkey, ds_array_p->addr, - ds_array_p->length ); - - total_len += ds_array_p->length; - wr.num_sge++; - ds_array_p++; - } - - if (cookie != NULL) + wr.sg_list = ds; + + if (cookie != NULL) { + for (i = 0; i < segments; i++) { + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_rcv: l_key 0x%x va %p len %d\n", + ds->lkey, ds->addr, ds->length ); + total_len += ds->length; + ds++; + } cookie->val.dto.size = total_len; + } ret = ibv_post_recv(ep_ptr->qp_handle, &wr, &bad_wr); - if (ds_array_start_p != NULL) - dapl_os_free(ds_array_start_p, segments * sizeof(ib_data_segment_t)); - if (ret) - return( dapl_convert_errno(errno,"ibv_recv") ); + return(dapl_convert_errno(errno,"ibv_recv")); DAPL_CNTR(ep_ptr, DCNT_EP_POST_RECV); DAPL_CNTR_DATA(ep_ptr, DCNT_EP_POST_RECV_DATA, total_len); @@ -150,10 +127,9 @@ dapls_ib_post_send ( IN const DAT_RMR_TRIPLET *remote_iov, IN DAT_COMPLETION_FLAGS completion_flags) { - ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; - ib_data_segment_t *ds_array_p, *ds_array_start_p = NULL; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr; + ib_data_segment_t *ds = (ib_data_segment_t *)local_iov; ib_hca_transport_t *ibt_ptr = &ep_ptr->header.owner_ia->hca_ptr->ib_trans; DAT_COUNT i, total_len; @@ -165,19 +141,6 @@ dapls_ib_post_send ( ep_ptr, op_type, cookie, segments, local_iov, remote_iov, completion_flags); - dapl_dbg_log(DAPL_DBG_TYPE_EP, - " post_snd: ep %p cookie %p segs %d l_iov %p\n", - ep_ptr, cookie, segments, local_iov); - - if(segments <= DEFAULT_DS_ENTRIES) - ds_array_p = ds_array; - else - ds_array_start_p = ds_array_p = - dapl_os_alloc(segments * sizeof(ib_data_segment_t)); - - if (NULL == ds_array_p) - return (DAT_INSUFFICIENT_RESOURCES); - #ifdef DAT_EXTENSIONS if (ep_ptr->qp_handle->qp_type != IBV_QPT_RC) return(DAT_ERROR(DAT_INVALID_HANDLE, DAT_INVALID_HANDLE_EP)); @@ -185,32 +148,22 @@ dapls_ib_post_send ( /* setup the work request */ wr.next = 0; wr.opcode = op_type; - wr.num_sge = 0; + wr.num_sge = segments; wr.send_flags = 0; wr.wr_id = (uint64_t)(uintptr_t)cookie; - wr.sg_list = ds_array_p; + wr.sg_list = ds; total_len = 0; - for (i = 0; i < segments; i++ ) { - if ( !local_iov[i].segment_length ) - continue; - - ds_array_p->addr = (uint64_t) local_iov[i].virtual_address; - ds_array_p->length = local_iov[i].segment_length; - ds_array_p->lkey = local_iov[i].lmr_context; - - dapl_dbg_log(DAPL_DBG_TYPE_EP, - " post_snd: lkey 0x%x va %p len %d\n", - ds_array_p->lkey, ds_array_p->addr, - ds_array_p->length ); - - total_len += ds_array_p->length; - wr.num_sge++; - ds_array_p++; - } - - if (cookie != NULL) + if (cookie != NULL) { + for (i = 0; i < segments; i++ ) { + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: lkey 0x%x va %p len %d\n", + ds->lkey, ds->addr, ds->length ); + total_len += ds->length; + ds++; + } cookie->val.dto.size = total_len; + } if (wr.num_sge && (op_type == OP_RDMA_WRITE || op_type == OP_RDMA_READ)) { @@ -241,11 +194,8 @@ dapls_ib_post_send ( ret = ibv_post_send(ep_ptr->qp_handle, &wr, &bad_wr); - if (ds_array_start_p != NULL) - dapl_os_free(ds_array_start_p, segments * sizeof(ib_data_segment_t)); - if (ret) - return( dapl_convert_errno(errno,"ibv_send") ); + return(dapl_convert_errno(errno,"ibv_send")); #ifdef DAPL_COUNTERS switch (op_type) { @@ -339,10 +289,9 @@ dapls_ib_post_ext_send ( IN DAT_COMPLETION_FLAGS completion_flags, IN DAT_IB_ADDR_HANDLE *remote_ah) { - ib_data_segment_t ds_array[DEFAULT_DS_ENTRIES]; - ib_data_segment_t *ds_array_p, *ds_array_start_p = NULL; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr; + ib_data_segment_t *ds = (ib_data_segment_t *)local_iov; DAT_COUNT i, total_len; int ret; @@ -352,48 +301,25 @@ dapls_ib_post_ext_send ( ep_ptr, op_type, cookie, segments, local_iov, remote_iov, completion_flags, remote_ah); - dapl_dbg_log(DAPL_DBG_TYPE_EP, - " post_snd: ep %p cookie %p segs %d l_iov %p\n", - ep_ptr, cookie, segments, local_iov); - - if(segments <= DEFAULT_DS_ENTRIES) - ds_array_p = ds_array; - else - ds_array_start_p = ds_array_p = - dapl_os_alloc(segments * sizeof(ib_data_segment_t)); - - if (NULL == ds_array_p) - return (DAT_INSUFFICIENT_RESOURCES); - /* setup the work request */ wr.next = 0; wr.opcode = op_type; - wr.num_sge = 0; + wr.num_sge = segments; wr.send_flags = 0; wr.wr_id = (uint64_t)(uintptr_t)cookie; - wr.sg_list = ds_array_p; + wr.sg_list = ds; total_len = 0; - for (i = 0; i < segments; i++ ) { - if ( !local_iov[i].segment_length ) - continue; - - ds_array_p->addr = (uint64_t) local_iov[i].virtual_address; - ds_array_p->length = local_iov[i].segment_length; - ds_array_p->lkey = local_iov[i].lmr_context; - - dapl_dbg_log(DAPL_DBG_TYPE_EP, - " post_snd: lkey 0x%x va %p len %d\n", - ds_array_p->lkey, ds_array_p->addr, - ds_array_p->length ); - - total_len += ds_array_p->length; - wr.num_sge++; - ds_array_p++; - } - - if (cookie != NULL) + if (cookie != NULL) { + for (i = 0; i < segments; i++ ) { + dapl_dbg_log(DAPL_DBG_TYPE_EP, + " post_snd: lkey 0x%x va %p len %d\n", + ds->lkey, ds->addr, ds->length ); + total_len += ds->length; + ds++; + } cookie->val.dto.size = total_len; + } switch (op_type) { case OP_RDMA_WRITE_IMM: @@ -468,9 +394,6 @@ dapls_ib_post_ext_send ( ret = ibv_post_send(ep_ptr->qp_handle, &wr, &bad_wr); - if (segments > DEFAULT_DS_ENTRIES) - dapl_os_free(ds_array_start_p, segments * sizeof(ib_data_segment_t)); - if (ret) return( dapl_convert_errno(errno,"ibv_send") ); diff --git a/dapl/openib_scm/dapl_ib_util.h b/dapl/openib_scm/dapl_ib_util.h index 7011e7c..9ddf231 100644 --- a/dapl/openib_scm/dapl_ib_util.h +++ b/dapl/openib_scm/dapl_ib_util.h @@ -138,7 +138,7 @@ typedef enum } ib_cm_events_t; /* Operation and state mappings */ -typedef enum ibv_send_flags ib_send_op_type_t; +typedef int ib_send_op_type_t; typedef struct ibv_sge ib_data_segment_t; typedef enum ibv_qp_state ib_qp_state_t; typedef enum ibv_event_type ib_async_event_type; -- 1.5.2.5 From arlin.r.davis at intel.com Thu Apr 30 01:11:22 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Thu, 30 Apr 2009 01:11:22 -0700 Subject: [ofa-general] [PATCH] uDAPL v2: scm, cma: dat max_lmr_block_size is 32 bit, verbs max_mr_size is 64 bit Message-ID: mismatch of device attribute size restricts max_lmr_block_size to 32 bit value. Add check, if larger then limit to 4G-1 until DAT v2 spec changes. Consumers will need check max_lmr_virtual_address for actual max registration block size until attribute interface changes. Signed-off-by: Arlin Davis --- dapl/openib_cma/dapl_ib_util.c | 9 ++++++--- dapl/openib_scm/dapl_ib_util.c | 13 ++++++++----- 2 files changed, 14 insertions(+), 8 deletions(-) diff --git a/dapl/openib_cma/dapl_ib_util.c b/dapl/openib_cma/dapl_ib_util.c index 1545d78..3b83ab8 100755 --- a/dapl/openib_cma/dapl_ib_util.c +++ b/dapl/openib_cma/dapl_ib_util.c @@ -553,7 +553,9 @@ DAT_RETURN dapls_ib_query_hca(IN DAPL_HCA * hca_ptr, ia_attr->max_evd_qlen = dev_attr.max_cqe; ia_attr->max_iov_segments_per_dto = dev_attr.max_sge; ia_attr->max_lmrs = dev_attr.max_mr; - ia_attr->max_lmr_block_size = dev_attr.max_mr_size; + /* 32bit attribute from 64bit, 4G-1 limit, DAT v2 needs fix */ + ia_attr->max_lmr_block_size = + (dev_attr.max_mr_size >> 32) ? ~0 : dev_attr.max_mr_size; ia_attr->max_rmrs = dev_attr.max_mw; ia_attr->max_lmr_virtual_address = dev_attr.max_mr_size; ia_attr->max_rmr_target_address = dev_attr.max_mr_size; @@ -583,10 +585,11 @@ DAT_RETURN dapls_ib_query_hca(IN DAPL_HCA * hca_ptr, #endif dapl_log(DAPL_DBG_TYPE_UTIL, "dapl_query_hca: (ver=%x) ep's %d ep_q %d" - " evd's %d evd_q %d\n", + " evd's %d evd_q %d mr %u\n", ia_attr->hardware_version_major, ia_attr->max_eps, ia_attr->max_dto_per_ep, - ia_attr->max_evds, ia_attr->max_evd_qlen); + ia_attr->max_evds, ia_attr->max_evd_qlen, + ia_attr->max_lmr_block_size); dapl_log(DAPL_DBG_TYPE_UTIL, "dapl_query_hca: msg %llu rdma %llu iov's %d" " lmr %d rmr %d rd_in,out %d,%d inline=%d\n", diff --git a/dapl/openib_scm/dapl_ib_util.c b/dapl/openib_scm/dapl_ib_util.c index 13c07c9..c95b0c2 100644 --- a/dapl/openib_scm/dapl_ib_util.c +++ b/dapl/openib_scm/dapl_ib_util.c @@ -546,7 +546,9 @@ DAT_RETURN dapls_ib_query_hca(IN DAPL_HCA * hca_ptr, ia_attr->max_evd_qlen = dev_attr.max_cqe; ia_attr->max_iov_segments_per_dto = dev_attr.max_sge; ia_attr->max_lmrs = dev_attr.max_mr; - ia_attr->max_lmr_block_size = dev_attr.max_mr_size; + /* 32bit attribute from 64bit, 4G-1 limit, DAT v2 needs fix */ + ia_attr->max_lmr_block_size = + (dev_attr.max_mr_size >> 32) ? ~0 : dev_attr.max_mr_size; ia_attr->max_rmrs = dev_attr.max_mw; ia_attr->max_lmr_virtual_address = dev_attr.max_mr_size; ia_attr->max_rmr_target_address = dev_attr.max_mr_size; @@ -574,7 +576,7 @@ DAT_RETURN dapls_ib_query_hca(IN DAPL_HCA * hca_ptr, hca_ptr->ib_trans.named_attr.value = dapl_ib_mtu_str(hca_ptr->ib_trans.mtu); - dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + dapl_log(DAPL_DBG_TYPE_UTIL, " query_hca: (%x.%x) ep %d ep_q %d evd %d" " evd_q %d mtu %d\n", ia_attr->hardware_version_major, @@ -583,13 +585,14 @@ DAT_RETURN dapls_ib_query_hca(IN DAPL_HCA * hca_ptr, ia_attr->max_evds, ia_attr->max_evd_qlen, 128 << hca_ptr->ib_trans.mtu); - dapl_dbg_log(DAPL_DBG_TYPE_UTIL, + dapl_log(DAPL_DBG_TYPE_UTIL, " query_hca: msg %llu rdma %llu iov %d lmr %d rmr %d" - " ack_time %d\n", + " ack_time %d mr %u\n", ia_attr->max_message_size, ia_attr->max_rdma_size, ia_attr->max_iov_segments_per_dto, ia_attr->max_lmrs, ia_attr->max_rmrs, - hca_ptr->ib_trans.ack_timer); + hca_ptr->ib_trans.ack_timer, + ia_attr->max_lmr_block_size); } if (ep_attr != NULL) { -- 1.5.2.5 From arlin.r.davis at intel.com Thu Apr 30 01:53:54 2009 From: arlin.r.davis at intel.com (Davis, Arlin R) Date: Thu, 30 Apr 2009 01:53:54 -0700 Subject: [ofa-general] [ANNOUNCE] uDAPL v2.0 - dapl-2.0.19 release Message-ID: New release for uDAPL 2.0 available on the OFA download page and in my git tree. md5sum: c45cf419ee137555fb74f55d175517bb dapl-2.0.19.tar.gz Summary of changes: - scm, cma: dat max_lmr_block_size is 32 bit, verbs max_mr_size is 64 bit (bug#1613) - openib_scm, cma: use direct SGE mappings from dat_lmr_triplet to ibv_sge - dtest: add flush EVD call after data transfer errors - scm: change default mtu size to 2048 Vlad, please pull v2 package into OFED 1.4.1 RC4 and install the following: compat-dapl-1.2.14-1 compat-dapl-devel-1.2.14-1 dapl-2.0.19-1 dapl-utils-2.0.19-1 dapl-devel-2.0.19-1 dapl-debuginfo-2.0.19-1 See http://www.openfabrics.org/downloads/dapl/ more details. -arlin From BMT at zurich.ibm.com Thu Apr 30 03:08:54 2009 From: BMT at zurich.ibm.com (Bernard Metzler) Date: Thu, 30 Apr 2009 12:08:54 +0200 Subject: [ofa-general] Re: adding a purely software based RDMA driver Message-ID: Roland, Or, absolutely right. we do not intend nor are willing or entitled to dump a 'softiwarp product' here but are still in the internal process of getting something open sourced soon - to be reviewed by the community and eventually addded to mainline Linux kernel when appropriate. Philip was asking, since building an rpm for the OFED installation procedure might be helpful for our internal usage at this point in time - but it is not at high priority. so, lets discuss the code, if it is here. many thanks, bernard. From vlad at dev.mellanox.co.il Thu Apr 30 03:11:43 2009 From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky) Date: Thu, 30 Apr 2009 13:11:43 +0300 Subject: [ofa-general] Re: [ANNOUNCE] uDAPL v2.0 - dapl-2.0.19 release In-Reply-To: References: Message-ID: <49F9795F.70508@dev.mellanox.co.il> Davis, Arlin R wrote: > New release for uDAPL 2.0 available on the OFA download page and in my git tree. > > md5sum: c45cf419ee137555fb74f55d175517bb dapl-2.0.19.tar.gz > > Summary of changes: > > - scm, cma: dat max_lmr_block_size is 32 bit, verbs max_mr_size is 64 bit (bug#1613) > - openib_scm, cma: use direct SGE mappings from dat_lmr_triplet to ibv_sge > - dtest: add flush EVD call after data transfer errors > - scm: change default mtu size to 2048 > > Vlad, please pull v2 package into OFED 1.4.1 RC4 and install the following: > > compat-dapl-1.2.14-1 > compat-dapl-devel-1.2.14-1 > dapl-2.0.19-1 > dapl-utils-2.0.19-1 > dapl-devel-2.0.19-1 > dapl-debuginfo-2.0.19-1 > > See http://www.openfabrics.org/downloads/dapl/ more details. > > -arlin > > Done, Regards, Vladimir From vlad at lists.openfabrics.org Thu Apr 30 03:22:28 2009 From: vlad at lists.openfabrics.org (Vladimir Sokolovsky Mellanox) Date: Thu, 30 Apr 2009 03:22:28 -0700 (PDT) Subject: [ofa-general] ofa_1_4_kernel 20090430-0200 daily build status Message-ID: <20090430102228.DC17AE615AF@openfabrics.org> This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_4/linux-2.6.git git_branch: ofed_kernel Common build parameters: Passed: Passed on i686 with linux-2.6.16 Passed on i686 with linux-2.6.17 Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.22 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on x86_64 with linux-2.6.16 Passed on x86_64 with linux-2.6.16.21-0.8-smp Passed on x86_64 with linux-2.6.16.43-0.3-smp Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.17 Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-1.2798.fc6 Passed on x86_64 with linux-2.6.18-128.el5 Passed on x86_64 with linux-2.6.18-53.el5 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.18-8.el5 Passed on x86_64 with linux-2.6.18-93.el5 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.22.5-31-default Passed on x86_64 with linux-2.6.9-42.ELsmp Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.9-55.ELsmp Passed on x86_64 with linux-2.6.9-78.ELsmp Passed on x86_64 with linux-2.6.9-67.ELsmp Passed on ia64 with linux-2.6.16 Passed on ia64 with linux-2.6.16.21-0.8-default Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.17 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.25 Passed on ia64 with linux-2.6.26 Passed on ppc64 with linux-2.6.16 Passed on ppc64 with linux-2.6.17 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.18-8.el5 Passed on ppc64 with linux-2.6.19 Failed: From truelove at array.ca Thu Apr 30 05:49:30 2009 From: truelove at array.ca (Steven Truelove) Date: Thu, 30 Apr 2009 08:49:30 -0400 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <49f8fea6.LDLCEFIRlnHmS++4%john.gregor@qlogic.com> References: <3E350D1F-BADB-4B7B-8307-BE0DA085DBF3@cisco.com> <1240956668.3403.324.camel@chromite.mv.qlogic.com> <1241039058.3403.369.camel@chromite.mv.qlogic.com> <6B60B4BD-FB97-45DD-94DB-51883E05F14F@cisco.com> <49f8fea6.LDLCEFIRlnHmS++4%john.gregor@qlogic.com> Message-ID: <49F99E5A.6030207@array.ca> John A. Gregor wrote: > So, how about this: > > Maintain a pool of pre-pinned pages. > > When an RTS comes in, use one of the pre-pinned buffers as the place the > DATA will land. Set up the remaining hw context to enable receipt into > the page(s) and fire back your CTS. > > While the CTS is in flight and the DATA is streaming back (and you > therefore have a couple microseconds to play with), remap the virt-to-phys > mapping of the application so that the original virtual address now > points at the pre-pinned page. A big part of the performance improvement associated with RDMA is avoiding constant page remappings and data copies. If pinning the physical/virtual memory mapping was cheap enough to do this for each message, MPI applications could simply pin and register the mapping when sending/receiving each message and then unmap when the operation was complete. MPI implementations maintain a cache of what memory has been registered because it is too expensive to map/unmap/remap memory constantly. Copying parts of the page(s) not involved in the transfer would also raise overhead quite a bit for smaller RDMAs. It is quite easy to see a 5 or 6K message requiring a 2-3K copy to fix the rest of a page. And heaven help those systems with huge pages, ~1MB, in such a case. I have seen this problem in our own MPI application. The 'simple' solution I have seen used in at least one MPI implementation for this problem is to prevent the malloc/free implementation being used from ever returning memory to the OS. The virtual/physical mapping can only become invalid if virtual addresses are given back to the OS, then returned with different physical pages. Under Linux with at least, it is quite easy to tell libc to never return memory to the OS. In this case free() and similar functions will simply retain the memory for use with future malloc (and similar) calls. Because the memory is never unpinned and never given back to the OS, the physical virtual mapping is consistent forever. I don't if other OSes make this as easy, or even what systems most MPI implementors want their software to run on. The obvious downside to this is that a process with highly irregular memory demand will always have the memory usage of its previous peak. And because the memory is pinned, it will not even be swapped out, and will count against the memory pinning ulimit. For many MPI applications that is not a problem -- they often have quite fixed memory usage and wouldn't be returning much if any memory to the OS anyway. This is the case for our application. I imagine someone out there has some job that doesn't behave so neatly, of course. Steven Truelove From jsquyres at cisco.com Thu Apr 30 06:19:06 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 30 Apr 2009 09:19:06 -0400 Subject: [ofa-general] New proposal for memory management In-Reply-To: References: Message-ID: <48644923-A2EF-4C4F-8F9A-B1658BAA36FE@cisco.com> On Apr 30, 2009, at 1:30 AM, Aaron Fabbri (aafabbri) wrote: > Have you considered changing the MPI API to require applications to > use MPI to > allocate any/all buffers that may be used for network I/O? That is, > instead of > calling malloc() et al., call a new mpi_malloc() which allocates > from pre- > registered memory. Yes, MPI_ALLOC_MEM / MPI_FREE_MEM calls have been around for a long time (~10 years?). Using them does avoid many of the problems that have been discussed. Most (all?) MPI's either support ALLOC_MEM / FREE_MEM by registering at allocation time and unregistering at free time, or some variation of that. But unfortunately, very few MPI apps use these calls; they use malloc() and friends instead. Or they're written in Fortran, where such concepts are not easily mapped (don't underestimate how much Fortran MPI code runs on verbs!). Indeed, in some layered scenarios, it's not easy to use these calls (e.g., if an MPI-enabled computational library may re-use user-provided buffers because they're so large, etc.). -- Jeff Squyres Cisco Systems From arkady.kanevsky at gmail.com Thu Apr 30 06:24:51 2009 From: arkady.kanevsky at gmail.com (arkady kanevsky) Date: Thu, 30 Apr 2009 09:24:51 -0400 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <49F99E5A.6030207@array.ca> References: <3E350D1F-BADB-4B7B-8307-BE0DA085DBF3@cisco.com> <1240956668.3403.324.camel@chromite.mv.qlogic.com> <1241039058.3403.369.camel@chromite.mv.qlogic.com> <6B60B4BD-FB97-45DD-94DB-51883E05F14F@cisco.com> <49f8fea6.LDLCEFIRlnHmS++4%john.gregor@qlogic.com> <49F99E5A.6030207@array.ca> Message-ID: <517c62fb0904300624w16c530b6ib8cc197c68fffa2a@mail.gmail.com> Jeff,had you considered a notion of buffer and buffer iteration introduced by MPI/RT (The Real-Time Message Passing Interface Standard, in Concurency and Computation: Practice and Experience, Volume 16, N0 S1, pp S1-S332, Dec 2004; see Chapter 5). It basically sets up a contract of buffer (and underlying memory) ownership between MPI implementation and user. Arkady On Thu, Apr 30, 2009 at 8:49 AM, Steven Truelove wrote: > > > John A. Gregor wrote: > >> So, how about this: >> >> Maintain a pool of pre-pinned pages. >> >> When an RTS comes in, use one of the pre-pinned buffers as the place the >> DATA will land. Set up the remaining hw context to enable receipt into >> the page(s) and fire back your CTS. >> >> While the CTS is in flight and the DATA is streaming back (and you >> therefore have a couple microseconds to play with), remap the virt-to-phys >> mapping of the application so that the original virtual address now >> points at the pre-pinned page. >> > > A big part of the performance improvement associated with RDMA is avoiding > constant page remappings and data copies. If pinning the physical/virtual > memory mapping was cheap enough to do this for each message, MPI > applications could simply pin and register the mapping when > sending/receiving each message and then unmap when the operation was > complete. MPI implementations maintain a cache of what memory has been > registered because it is too expensive to map/unmap/remap memory constantly. > > Copying parts of the page(s) not involved in the transfer would also raise > overhead quite a bit for smaller RDMAs. It is quite easy to see a 5 or 6K > message requiring a 2-3K copy to fix the rest of a page. And heaven help > those systems with huge pages, ~1MB, in such a case. > > I have seen this problem in our own MPI application. The 'simple' solution > I have seen used in at least one MPI implementation for this problem is to > prevent the malloc/free implementation being used from ever returning memory > to the OS. The virtual/physical mapping can only become invalid if virtual > addresses are given back to the OS, then returned with different physical > pages. Under Linux with at least, it is quite easy to tell libc to never > return memory to the OS. In this case free() and similar functions will > simply retain the memory for use with future malloc (and similar) calls. > Because the memory is never unpinned and never given back to the OS, the > physical virtual mapping is consistent forever. I don't if other OSes make > this as easy, or even what systems most MPI implementors want their software > to run on. > > The obvious downside to this is that a process with highly irregular memory > demand will always have the memory usage of its previous peak. And because > the memory is pinned, it will not even be swapped out, and will count > against the memory pinning ulimit. For many MPI applications that is not a > problem -- they often have quite fixed memory usage and wouldn't be > returning much if any memory to the OS anyway. This is the case for our > application. I imagine someone out there has some job that doesn't behave > so neatly, of course. > > > Steven Truelove > > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -- Cheers, Arkady Kanevsky -------------- next part -------------- An HTML attachment was scrubbed... URL: From ogerlitz at voltaire.com Thu Apr 30 06:27:05 2009 From: ogerlitz at voltaire.com (Or Gerlitz) Date: Thu, 30 Apr 2009 16:27:05 +0300 Subject: [ofa-general] Re: [PATCH v2] rdma_cm: Add debugfs entries to monitor rdma_cm connections In-Reply-To: <20090427162349.GI4431@obsidianresearch.com> References: <49F05AAE.4020606@Voltaire.COM> <90A07B5E8FAD49C187EAE583A976A3CD@amr.corp.intel.com> <49F42D40.5000200@Voltaire.COM> <49F5A2EC.3050807@Voltaire.com> <49F5AED6.4070208@Voltaire.COM> <49F5AFEA.5090003@voltaire.com> <20090427162349.GI4431@obsidianresearch.com> Message-ID: <49F9A729.3090904@voltaire.com> Jason Gunthorpe wrote: > including a PID is not best, you should include enough information to figure out the pid(s) from proc/xx/fd, and vice versa. maybe its not the best solution but it seems to me good enough Or. From jsquyres at cisco.com Thu Apr 30 06:38:55 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 30 Apr 2009 09:38:55 -0400 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <49F99E5A.6030207@array.ca> References: <3E350D1F-BADB-4B7B-8307-BE0DA085DBF3@cisco.com> <1240956668.3403.324.camel@chromite.mv.qlogic.com> <1241039058.3403.369.camel@chromite.mv.qlogic.com> <6B60B4BD-FB97-45DD-94DB-51883E05F14F@cisco.com> <49f8fea6.LDLCEFIRlnHmS++4%john.gregor@qlogic.com> <49F99E5A.6030207@array.ca> Message-ID: <1D6E9A08-DA80-4F13-B23C-243A273E2936@cisco.com> On Apr 30, 2009, at 8:49 AM, Steven Truelove wrote: > I have seen this problem in our own MPI application. The 'simple' > solution I have seen used in at least one MPI implementation for this > problem is to prevent the malloc/free implementation being used from > ever returning memory to the OS. The virtual/physical mapping can > only > become invalid if virtual addresses are given back to the OS, then > returned with different physical pages. Under Linux with at least, it > is quite easy to tell libc to never return memory to the OS. > Unfortunately, this is a false assumption. There are definitely code paths in glibc where, even if you use the mallopt() hints, memory *can* (will) be returned to the OS. This led Open MPI to change its memory allocation / intercept scheme in 1.3.2. See: http://www.open-mpi.org/community/lists/announce/2009/03/0029.php -- Jeff Squyres Cisco Systems From jsquyres at cisco.com Thu Apr 30 06:45:54 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 30 Apr 2009 09:45:54 -0400 Subject: [ofa-general] New proposal for memory management In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F42B5C7D05@orsmsx507.amr.corp.intel.com> References: <382A478CAD40FA4FB46605CF81FE39F42B5C7C5C@orsmsx507.amr.corp.intel.com> <382A478CAD40FA4FB46605CF81FE39F42B5C7D05@orsmsx507.amr.corp.intel.com> Message-ID: <48EA395C-FCFF-4BB7-8575-93B6DB7DCC52@cisco.com> On Apr 29, 2009, at 6:52 PM, Woodruff, Robert J wrote: > I think if we did such a thing, we could implement a set of tag- > matching > primitives (similar to MX or PSM) that are kind of a separate library > from the OFA RDMA verbs, just like PSM for Qlogic is a separate > library and > not part of the OFA verbs. Just like with MX and PSM, I think the > registration > can be done my the tag-matching driver (like PSM or MX do) and > not require MPI to do it. Think of this as "the MPI tag-matching > interface" library > for OFA. > I would be extremely hesitant to have an OpenFabrics-provided library do this. MPI implementations spend a *lot* of time an effort on this section of code because it is *the* heart of the MPI message passing engine. To be blunt: here is not enough MPI expertise in the current set of OpenFabrics developers to build such a library. I doubt that the academic and proprietary MPI implementations would want to contribute resources to make one, either (it's their secret sauce!). Indeed, to make such a proposal work, there would, by definition, have to be new hardware capabilities, and therefore new verbs to support those hardware capabilities. So this might just end up as new verbs anyway -- not a new middleware library. -- Jeff Squyres Cisco Systems From jsquyres at cisco.com Thu Apr 30 06:52:32 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 30 Apr 2009 09:52:32 -0400 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <20090429224411.GC32114@obsidianresearch.com> References: <20090429215508.GW4431@obsidianresearch.com> <20090429222125.GX4431@obsidianresearch.com> <1241044080.3403.374.camel@chromite.mv.qlogic.com> <20090429224411.GC32114@obsidianresearch.com> Message-ID: <23635E11-F18E-4799-9B6E-C3163000A3A3@cisco.com> I think Jason is the only one who is remaining at least somewhat on- topic here. My goal for this thread was to open discussion about solving the broken memory management model for OpenFabrics. Specifically, I'm looking for a software solution for today's hardware (both IB and iWARP). While MPI is currently the biggest victim, this broken memory management model is also an enormous roadblock for any other application or ULP to write to verbs. On-demand paging, while the desired end result sounds great, will definitely require new hardware and will likely be a very complex design conversation that takes a long, long time. That's great if my proposal [finally] seriously inspires people to tackle this problem (and/or other hardware-assisted ideas), but I consider those to be longer-term solutions. I'm looking for a) a much shorter-term solution that is b) workable on current hardware. Thanks. On Apr 29, 2009, at 6:44 PM, Jason Gunthorpe wrote: > On Wed, Apr 29, 2009 at 03:28:00PM -0700, Ralph Campbell wrote: > > > Besides, mmap() only allocates a virtual address range in the user's > > address space. It doesn't fault in all the pages into physical > memory. > > That happens when the application tries to read or write memory in > > the VA range of the mmap. The IB memory registrations need physical > > addresses and it would be impractical to do this for every mmap or > > brk. > > If your goal is to keep the mr consistent then you only need to fault > and pin pages from the new mmap that intersect with pre-existing > memory registrations. > > I chucked out 3 things to consider: > - Pin and register all process memory (no swap!) > - Keep the MR consistent by pinning and registering new mmaps > that intersect with pre-existing memory registrations > - Keep the MR consistent by preventing the kernel from returning > new mmaps that overlap existing memory registrations. > > Jason > -- Jeff Squyres Cisco Systems From pashash at gmail.com Thu Apr 30 07:35:05 2009 From: pashash at gmail.com (Pavel Shamis (Pasha)) Date: Thu, 30 Apr 2009 17:35:05 +0300 Subject: [ofa-general] New proposal for memory management In-Reply-To: References: Message-ID: <49F9B719.30406@dev.mellanox.co.il> Barrett, Brian W wrote: > Jeff and I talked for a while today, and we're pretty sure that as long as > the byte set by the kernel notifier is written before the pages are returned > into the unallocated list, there isn't actually a race condition. It does > mean that every time the page cache is searched, we also have to check the > byte (and likely take a cache miss), but that's not too evil. > > However, there's still then the problem with the notifier concept of how the > kernel passes which pages were given back to the kernel. It has to pass a > (potentially very large) amount of data back to the user, so the memory > ownership issues with kernel/user space are interesting. It also has to > somewhat atomically prepare the list and undset the notifier byte, which is > also problematic. But probably workable. > It sounds like we will have another 5k lines of code in MPI that will try to resolve the kernel/user notification issue :-) IMHO, Lets avoid all these tricks and move the registration cache to kernel. Pasha From jsquyres at cisco.com Thu Apr 30 07:39:19 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 30 Apr 2009 10:39:19 -0400 Subject: [ofa-general] New proposal for memory management In-Reply-To: References: Message-ID: On Apr 29, 2009, at 4:45 PM, Barrett, Brian W wrote: > If you think this sounds like a hassle, think about what it looks > like from > the point of view of the MPI implementer (or any other developer > writing > libraries which sit between user data and OFED, like GASNet). > If you don't care about what pain MPI implementors have to go through (and you probably don't ;-) ) -- consider that this is a major roadblock to most *anyone* who wants to write to user verbs. I heard lots of variations of "Why isn't OFED more popular?" in Sonoma this year. This is at least one big reason why: no (normal/non- superhuman programmers) can write verbs code (IMHO). MPI's *have* to support OpenFabrics -- HPC customers demand it. But non-HPC customers have a clear alternative: they'll just write sockets code. And the price/performance for using sockets over IB/iWARP may or may not be attractive depending on the customer's buying capacity. Hence -- they just buy gigE (10gigE, when the price drops low enough). Doesn't OpenFabrics want to grow beyond MPI? Woody said that verbs is designed to support a billion different things -- outside of MPI and a few storage protocols (none of which are widely adopted), how much is OFED used? > Jeff and I talked for a while today, and we're pretty sure that as > long as > the byte set by the kernel notifier is written before the pages are > returned > into the unallocated list, there isn't actually a race condition. > [snip] > > However, there's still then the problem with the notifier concept of > how the > kernel passes which pages were given back to the kernel. It has to > pass a > (potentially very large) amount of data back to the user, so the > memory > ownership issues with kernel/user space are interesting. It also > has to > somewhat atomically prepare the list and undset the notifier byte, > which is > also problematic. But probably workable. > I feel compelled to amend this: this notifier concept *may be workable*, but it's still quite complex for the reasons Brian cited. The goal here is to *reduce* complexity, especially for applications/ ULPs using the verbs stack. If we put the registration cache in the network stack, application/ULP complexity will be reduced significantly. My $0.02 is that using a notifier solution is still fairly complex and introduces a new set of problems. FWIW: Putting the registration cache in the userspace verbs stack means that verbs will now have to do the horrid malloc/mmap/etc. intercept tricks that MPI implementations currently do. Take it from us -- this is not a business you want to be in. Such intercepts breaks tools like valgrind and other memory-checking debuggers. Even the best intercept hooks available today can still be subverted. Open MPI (and MX!) has to insert a pre-main hook to setup these intercepts, and then check later to ensure that no one else subverted our hooks. Yuck. It's memory management. And that belongs in the kernel. -- Jeff Squyres Cisco Systems From alexander.supalov at intel.com Thu Apr 30 08:03:13 2009 From: alexander.supalov at intel.com (Supalov, Alexander) Date: Thu, 30 Apr 2009 16:03:13 +0100 Subject: [ofa-general] New proposal for memory management In-Reply-To: References: Message-ID: <928CFBE8E7CB0040959E56B4EA41A77E9BB66D10@irsmsx504.ger.corp.intel.com> Hi, Mem reg caching has direct relation to the apps performance. Can we guarantee, while putting the caching into the kernel, that the algorithms used will be good for all apps? How will one control their parameters at runtime? Will one be able to change the algorithm if necessary? Best regards. Alexander -----Original Message----- From: Jeff Squyres [mailto:jsquyres at cisco.com] Sent: Thursday, April 30, 2009 4:39 PM To: Barrett, Brian W Cc: Roland Dreier (rdreier); OpenFabrics General; Pavel Shamis; Hans Westgaard Ry; Terry Dontje; Lenny Verkhovsky; Håkon Bugge; Donald Kerr; Supalov, Alexander Subject: Re: [ofa-general] New proposal for memory management On Apr 29, 2009, at 4:45 PM, Barrett, Brian W wrote: > If you think this sounds like a hassle, think about what it looks > like from > the point of view of the MPI implementer (or any other developer > writing > libraries which sit between user data and OFED, like GASNet). > If you don't care about what pain MPI implementors have to go through (and you probably don't ;-) ) -- consider that this is a major roadblock to most *anyone* who wants to write to user verbs. I heard lots of variations of "Why isn't OFED more popular?" in Sonoma this year. This is at least one big reason why: no (normal/non- superhuman programmers) can write verbs code (IMHO). MPI's *have* to support OpenFabrics -- HPC customers demand it. But non-HPC customers have a clear alternative: they'll just write sockets code. And the price/performance for using sockets over IB/iWARP may or may not be attractive depending on the customer's buying capacity. Hence -- they just buy gigE (10gigE, when the price drops low enough). Doesn't OpenFabrics want to grow beyond MPI? Woody said that verbs is designed to support a billion different things -- outside of MPI and a few storage protocols (none of which are widely adopted), how much is OFED used? > Jeff and I talked for a while today, and we're pretty sure that as > long as > the byte set by the kernel notifier is written before the pages are > returned > into the unallocated list, there isn't actually a race condition. > [snip] > > However, there's still then the problem with the notifier concept of > how the > kernel passes which pages were given back to the kernel. It has to > pass a > (potentially very large) amount of data back to the user, so the > memory > ownership issues with kernel/user space are interesting. It also > has to > somewhat atomically prepare the list and undset the notifier byte, > which is > also problematic. But probably workable. > I feel compelled to amend this: this notifier concept *may be workable*, but it's still quite complex for the reasons Brian cited. The goal here is to *reduce* complexity, especially for applications/ ULPs using the verbs stack. If we put the registration cache in the network stack, application/ULP complexity will be reduced significantly. My $0.02 is that using a notifier solution is still fairly complex and introduces a new set of problems. FWIW: Putting the registration cache in the userspace verbs stack means that verbs will now have to do the horrid malloc/mmap/etc. intercept tricks that MPI implementations currently do. Take it from us -- this is not a business you want to be in. Such intercepts breaks tools like valgrind and other memory-checking debuggers. Even the best intercept hooks available today can still be subverted. Open MPI (and MX!) has to insert a pre-main hook to setup these intercepts, and then check later to ensure that no one else subverted our hooks. Yuck. It's memory management. And that belongs in the kernel. -- Jeff Squyres Cisco Systems --------------------------------------------------------------------- Intel GmbH Dornacher Strasse 1 85622 Feldkirchen/Muenchen Germany Sitz der Gesellschaft: Feldkirchen bei Muenchen Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer Registergericht: Muenchen HRB 47456 Ust.-IdNr. VAT Registration No.: DE129385895 Citibank Frankfurt (BLZ 502 109 00) 600119052 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. From robert.j.woodruff at intel.com Thu Apr 30 10:09:19 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 30 Apr 2009 10:09:19 -0700 Subject: [ofa-general] New proposal for memory management In-Reply-To: <48EA395C-FCFF-4BB7-8575-93B6DB7DCC52@cisco.com> References: <382A478CAD40FA4FB46605CF81FE39F42B5C7C5C@orsmsx507.amr.corp.intel.com> <382A478CAD40FA4FB46605CF81FE39F42B5C7D05@orsmsx507.amr.corp.intel.com> <48EA395C-FCFF-4BB7-8575-93B6DB7DCC52@cisco.com> Message-ID: <382A478CAD40FA4FB46605CF81FE39F42B5C8285@orsmsx507.amr.corp.intel.com> Jeff wrote, >I would be extremely hesitant to have an OpenFabrics-provided library >do this. MPI implementations spend a *lot* of time an effort on this >section of code because it is *the* heart of the MPI message passing >engine. To be blunt: here is not enough MPI expertise in the current >set of OpenFabrics developers to build such a library. I doubt that >the academic and proprietary MPI implementations would want to >contribute resources to make one, either (it's their secret sauce!). Interesting that you would want the OFA developers to implement a memory registration cache and think they could manage the registration of MPI memory better than MPI can, but then say that tag-matching drivers in MPI are their secret sauce. Seems like registration caching is also some of various MPI's secret sauce. >Indeed, to make such a proposal work, there would, by definition, have >to be new hardware capabilities, and therefore new verbs to support >those hardware capabilities. So this might just end up as new verbs >anyway -- not a new middleware library. Yes, new hardware capabilities would be needed for this and it is always hard to get new hardware features added, but if they were added to some future IBTA or iWarp spec, I think it would be good for MPIs, as we have seen that this is the way other interconnects like myrinet can achieve good performance for MPI applications. Anyway, just thought I would bring it up as a possibility for solving some of the issues that you raised at Sonoma. woody From bwbarre at sandia.gov Thu Apr 30 08:08:30 2009 From: bwbarre at sandia.gov (Barrett, Brian W) Date: Thu, 30 Apr 2009 09:08:30 -0600 Subject: [ofa-general] New proposal for memory management In-Reply-To: <49F9B719.30406@dev.mellanox.co.il> Message-ID: On 4/30/09 8:35 , "Pavel Shamis (Pasha)" wrote: > Barrett, Brian W wrote: >> Jeff and I talked for a while today, and we're pretty sure that as long as >> the byte set by the kernel notifier is written before the pages are returned >> into the unallocated list, there isn't actually a race condition. It does >> mean that every time the page cache is searched, we also have to check the >> byte (and likely take a cache miss), but that's not too evil. >> >> However, there's still then the problem with the notifier concept of how the >> kernel passes which pages were given back to the kernel. It has to pass a >> (potentially very large) amount of data back to the user, so the memory >> ownership issues with kernel/user space are interesting. It also has to >> somewhat atomically prepare the list and undset the notifier byte, which is >> also problematic. But probably workable. >> > It sounds like we will have another 5k lines of code in MPI that will > try to resolve > the kernel/user notification issue :-) > IMHO, Lets avoid all these tricks and move the registration cache to kernel. I don't disagree - this is a problem best solved in kernel space, preferably using the approach Jeff originally proposed. I think the complexity of handling notifier callback will be fairly high, but it is still an improvement over where we are today. Today's user-space memory hooks need to go - they cause too many problems for both the MPI and the application. Brian -- Brian W. Barrett Dept. 1423: Scalable System Software Sandia National Laboratories From bwbarre at sandia.gov Thu Apr 30 10:24:18 2009 From: bwbarre at sandia.gov (Barrett, Brian W) Date: Thu, 30 Apr 2009 11:24:18 -0600 Subject: [ofa-general] New proposal for memory management In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F42B5C8285@orsmsx507.amr.corp.intel.com> Message-ID: On 4/30/09 11:09 , "Woodruff, Robert J" wrote: > Jeff wrote, >> I would be extremely hesitant to have an OpenFabrics-provided library >> do this. MPI implementations spend a *lot* of time an effort on this >> section of code because it is *the* heart of the MPI message passing >> engine. To be blunt: here is not enough MPI expertise in the current >> set of OpenFabrics developers to build such a library. I doubt that >> the academic and proprietary MPI implementations would want to >> contribute resources to make one, either (it's their secret sauce!). > > Interesting that you would want the OFA developers to implement a > memory registration cache and think they could manage the registration > of MPI memory better than MPI can, but then say that tag-matching drivers > in MPI are their secret sauce. Seems like registration caching is also > some of various MPI's secret sauce. I somewhat disagree with Jeff - I'd love to see OFA implement tag-matching, as we MPI implementors can (optionally) use it to save development time and then pound on the hardware guys until they actually implement proper tag matching and offload in hardware. But my goals are driven by a slightly different market than the rest of the planet (ie, huge machines that actually work when running a single 10k-20k process job). The registration caching isn't really secret sauce. It's more like the residue that forms around the cap to the secret sauce bottle. We have to do it to get good performance, it doesn't work reliably, and we can't fix it. I have all the information I need to do tag matching properly on the main processor. I don't have all the information I need to write a registration cache. I can't reliably know when memory is going back to the OS (because there still isn't a 100% foolproof way of intercepting when memory is given back to the OS). It's also used in long messages, where a couple hundred nanoseconds of added latency aren't critical, as opposed to tag matching, which is in the critical path of short messages and a couple extra tens of nanoseconds is a deal breaker. >> Indeed, to make such a proposal work, there would, by definition, have >> to be new hardware capabilities, and therefore new verbs to support >> those hardware capabilities. So this might just end up as new verbs >> anyway -- not a new middleware library. > > Yes, new hardware capabilities would be needed for this and it is always > hard to get new hardware features added, but if they were added to some > future IBTA or iWarp spec, I think it would be good for MPIs, as we have > seen that this is the way other interconnects like myrinet can achieve good > performance for MPI applications. > > Anyway, just thought I would bring it up as a possibility for solving > some of the issues that you raised at Sonoma. I don't think it actually solves any of the problems. Assuming it's like other verbs, you still have to deal with memory registration caches. There's still registered memory somewhere, so fork() is still going to be problematic. While it might not require an RC QP, it's going to require some kind of QP, so CM setup is still a problem. API Portability will still be a problem. It might solve the reliable connectionless problem, since I could envision a new QP type to support the tag matching. And there's still the problem of how unexpected receives are handled and how much space it takes. In short, while I'd love to see tag matching, I'd rather make sure all the other issues get solved properly first. Otherwise, we've just added another interface that drives me up the wall. Brian -- Brian W. Barrett Dept. 1423: Scalable System Software Sandia National Laboratories From robert.j.woodruff at intel.com Thu Apr 30 10:37:52 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 30 Apr 2009 10:37:52 -0700 Subject: [ofa-general] New proposal for memory management In-Reply-To: References: <382A478CAD40FA4FB46605CF81FE39F42B5C8285@orsmsx507.amr.corp.intel.com> Message-ID: <382A478CAD40FA4FB46605CF81FE39F42B5C8301@orsmsx507.amr.corp.intel.com> Brian wrote, >In short, while I'd love to see tag matching, I'd rather make sure all the >other issues get solved properly first. Otherwise, we've just added another >interface that drives me up the wall. Well I agree that the other issues will need to be solved in any case, as even if we were able to get tag-matching into all the hardware, the lead time for this would be very long, so MPIs are going to have to deal with using the current verbs for the foreseeable future. I am still not sure that having the OFA kernel developers create and manage a registration cache in the kernel is a good idea. As Alexander pointed out, I am not sure that a one size fits all memory registration cache can be developed that meets the needs of all applications. Of coarse if memory registration was not so expensive and could simply be done for each operation without a big performance penalty, then we would not need caching at all. So another way to fix this is to just fix the hardware so memory registration can be done in the speed path, but this is unfortunately something that will likely not happen either.( woody From Bill.Boas at openfabrics.org Thu Apr 30 10:39:03 2009 From: Bill.Boas at openfabrics.org (Bill Boas) Date: Thu, 30 Apr 2009 10:39:03 -0700 Subject: [ofa-general] RE: RDMA tutorial and OFA In-Reply-To: <49F8B149.2050904@oracle.com> References: <49F8A59C.3070001@oracle.com> <49F8B149.2050904@oracle.com> Message-ID: <11F9001990984F12B589A718D8976045@BillGWAYLAPTOP> Richard, Andy, Thanks for copying me Richard. I had not seen Andy's email on the general list. Figuring out how to get tutorial and other documentation created and published in the list of things to get done in 2009 for me in my part-time role as Exec. Dir. There is no funding set up for this at the moment but I believe there will be in about 30 days. That's because I'm thinking that we can get funding for this by making it part of the funding for a new marketing plan for OFA that, with Wayne Augsburger and Jim Ryan, we are preparing for the OFA Board to vote on at the next con-call meeting which is on May 20 at 9.00AM PDT. Would you be willing to work with me and create a small team from others within OFA who have the same interest to prepare a description by May 20 of what the tutorial would look like, who would contribute to it, how to get it "polished up" for web and/or book style publication, what the overall costs would be, etc. My thoughts, that could be a starting point for the team's work, are that we would make the creation a collective effort. The tutorial would have several sections for example general intro, benefits of RDMA, applicability in HPC and Enterprise, networking background etc. Members of the Marketing Working Group would be responsible for this. The "meat" would be sections for kernel level things (verbs etc.), then user space things (verbs etc.), then APIs like MPI, SDP, EDS etc. - each section overseen by the technical leaders/maintainers of the code within OFA for that section (for Example Tom Talpey for NFSoRDMA, or you Richard for RDS) Finally the tutorial would have sections about Interoperability Testing that OFA/IOL does but also what customers can do on there own systems - Arkady and Rupert and IOL have put in an SC09 tutorial proposal that we could leverage in this section. To all readers of this email:- If you have read this far, please give us all some feedback. If you have material you'd like to contribute please say so. If there's a better way, tell us what you think it is! Thanks, Bill. Bill Boas Executive Director and Vice Chair OpenFabrics Alliance 510-375-8840 Bill.Boas at openfabrics.org www.openfabrics.org -----Original Message----- From: Richard Frank [mailto:richard.frank at oracle.com] Sent: Wednesday, April 29, 2009 12:58 PM To: Andy Grover Cc: Bill Boas; Sumanta Chatterjee Subject: Re: RDMA tutorial and OFA Andy, I saw your postings to ofa-general on this and I agree it would be great to have this documentation. As OpenFabrics is really about RDMA... we need to make it simpler for folks to pick up and run with RDMA concepts ...vs.. digging thru the IB specs and code examples, etc. Let's see what Bill Boas thinks...perhaps OFA has a writer on board that can help us do this..? I can also help provide input for a new OFA RDMA tutorial doc.. Rick Andy Grover wrote: > Hi Rick, > > Are you around for a brief chat this afternoon? I have a crazy idea that > involves OFA doing something (or putting up $$) and I wanted to see what > you thought, since you're Oracle's OFA rep, right? > > -- Andy > > From bwbarre at sandia.gov Thu Apr 30 10:51:32 2009 From: bwbarre at sandia.gov (Barrett, Brian W) Date: Thu, 30 Apr 2009 11:51:32 -0600 Subject: [ofa-general] New proposal for memory management In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F42B5C8301@orsmsx507.amr.corp.intel.com> Message-ID: On 4/30/09 11:37 , "Woodruff, Robert J" wrote: > Brian wrote, >> In short, while I'd love to see tag matching, I'd rather make sure all the >> other issues get solved properly first. Otherwise, we've just added another >> interface that drives me up the wall. > > Well I agree that the other issues will need to be solved in any case, as > even if we were able to get tag-matching into all the hardware, the lead > time for this would be very long, so MPIs are going to have to deal with > using the current verbs for the foreseeable future. > > I am still not sure that having the OFA kernel developers create and manage > a registration cache in the kernel is a good idea. As Alexander pointed out, I > am not sure > that a one size fits all memory registration cache can be developed that meets > the needs of all applications. > > Of coarse if memory registration was not so expensive and could simply be done > for each operation without a big performance penalty, then we would not > need caching at all. So another way to fix this is to just fix the hardware so > memory registration can be done in the speed path, but this is unfortunately > something that will likely not happen either.( Well, here's the situation today. Every MPI implementation out there has a registration cache to be close on performance. And every MPI implementation uses one of a small number of hacks to figure out when memory is given back to the OS, all of which are in at least one well-known, subtle way, broken. So today every MPI implementation (which if marketing folk are to be believed, is a large portion of IB's business) are doing dangerous things to compete on performance. All we want is *SOMETHING* we can do that we know is safe. The registration cache seems like the safest, but the notifier would be better than where we are today if we can get the important race conditions out of it. To answer Alexander's question, if the kernel cache doesn't work for some particular application, an MPI implementation can make the choice to go back to the unsafe practices and run the cache entirely within the MPI implementation. Personally, I don't think many would, but the option hadn't been removed. The current state of the field is unacceptably stupid and I'm actually amazed there's any resistance to fixing the problem. We need both short-term and long-term solutions (which might not be the same, if hardware people want to tackle the problem properly), and they need to be usable by a variety of ULPs. Brian -- Brian W. Barrett Dept. 1423: Scalable System Software Sandia National Laboratories From gus at ldeo.columbia.edu Thu Apr 30 11:06:30 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 30 Apr 2009 14:06:30 -0400 Subject: [ofa-general] RE: RDMA tutorial and OFA In-Reply-To: <11F9001990984F12B589A718D8976045@BillGWAYLAPTOP> References: <49F8A59C.3070001@oracle.com> <49F8B149.2050904@oracle.com> <11F9001990984F12B589A718D8976045@BillGWAYLAPTOP> Message-ID: <49F9E8A6.5030807@ldeo.columbia.edu> Bill Boas wrote: > Richard, Andy, > > Thanks for copying me Richard. I had not seen Andy's email on the general > list. > > Figuring out how to get tutorial and other documentation created and > published in the list of things to get done in 2009 for me in my part-time > role as Exec. Dir. > > There is no funding set up for this at the moment but I believe there will > be in about 30 days. > > That's because I'm thinking that we can get funding for this by making it > part of the funding for a new marketing plan for OFA that, with Wayne > Augsburger and Jim Ryan, we are preparing for the OFA Board to vote on at > the next con-call meeting which is on May 20 at 9.00AM PDT. > > Would you be willing to work with me and create a small team from others > within OFA who have the same interest to prepare a description by May 20 of > what the tutorial would look like, who would contribute to it, how to get it > "polished up" for web and/or book style publication, what the overall costs > would be, etc. > > My thoughts, that could be a starting point for the team's work, are that we > would make the creation a collective effort. > > The tutorial would have several sections for example general intro, benefits > of RDMA, applicability in HPC and Enterprise, networking background etc. > Members of the Marketing Working Group would be responsible for this. > > The "meat" would be sections for kernel level things (verbs etc.), then user > space things (verbs etc.), then APIs like MPI, SDP, EDS etc. - each section > overseen by the technical leaders/maintainers of the code within OFA for > that section (for Example Tom Talpey for NFSoRDMA, or you Richard for RDS) > > Finally the tutorial would have sections about Interoperability Testing that > OFA/IOL does but also what customers can do on there own systems - Arkady > and Rupert and IOL have put in an SC09 tutorial proposal that we could > leverage in this section. > > To all readers of this email:- > If you have read this far, please give us all some feedback. If you have > material you'd like to contribute please say so. If there's a better way, > tell us what you think it is! > Hi All For newbies to IB/RDMA/OFA like me, a good set of FAQs would be helpful. (The OpenMPI FAQs are a good example of informal but informative documentation to look at.) Maybe an FAQ list can be put together and made available with less effort or funding, and hopefully right away, while the full tutorial is being worked on. Thank you. Gus Correa. --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- > Thanks, > > Bill. > > Bill Boas > Executive Director and Vice Chair > OpenFabrics Alliance > 510-375-8840 > Bill.Boas at openfabrics.org > www.openfabrics.org > > -----Original Message----- > From: Richard Frank [mailto:richard.frank at oracle.com] > Sent: Wednesday, April 29, 2009 12:58 PM > To: Andy Grover > Cc: Bill Boas; Sumanta Chatterjee > Subject: Re: RDMA tutorial and OFA > > Andy, I saw your postings to ofa-general on this and I agree it would be > great to have this documentation. > > As OpenFabrics is really about RDMA... we need to make it simpler > for folks to pick up and run with RDMA concepts ...vs.. digging thru the IB > specs and code examples, etc. > > Let's see what Bill Boas thinks...perhaps OFA has a writer on board that > can help us do this..? > > I can also help provide input for a new OFA RDMA tutorial doc.. > > Rick > > Andy Grover wrote: >> Hi Rick, >> >> Are you around for a brief chat this afternoon? I have a crazy idea that >> involves OFA doing something (or putting up $$) and I wanted to see what >> you thought, since you're Oracle's OFA rep, right? >> >> -- Andy >> >> > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From tmtalpey at gmail.com Thu Apr 30 11:24:47 2009 From: tmtalpey at gmail.com (Tom Talpey) Date: Thu, 30 Apr 2009 14:24:47 -0400 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: References: <20090429215508.GW4431@obsidianresearch.com> Message-ID: <49f9ecf6.02c3f10a.1bb5.ffff959e@mx.google.com> At 06:11 PM 4/29/2009, Barrett, Brian W wrote: >On 4/29/09 15:55 , "Jason Gunthorpe" >wrote: > >>> The problem is that MPI needs to be aware of the application doing >>> the free() and unregister or flush its MR cache for that virtual >>> address range. Of course it would be difficult for OpenMPI to have >>> callbacks or hooks into every way memory could be allocated/freed >>> that an application might use. >> >> There are only three calls that affect the way VM memory maps to >> physical and thus would invalidate the mr cache: mmap, munmap and brk. > >There's also System V shared memory, which at least one scientific code out >there uses. Don't forget fork, vfork, clone and exec, and also don't forget any copy-on-write mappings that result. Oh, and those pesky stack pages. I think the point is that making any guarantees that memory remains fixed and present will inevitably lead to nontransparent API requirements on the applications. Been there, done that, got plenty of t-shirts. It's a hard road, because APIs are forever. Tom. > >> Specifically what must be happening is the app registers memory, calls >> munmap on it, then gets the same VA back from mmap and the kernel >> level mr is still pointing to the original mmap: >> >> foo = mmap(...); >> ibv_reg_mr(mr,foo) >> munmap(foo..) >> mmap(...) == foo; // By chance due to VA randomization >> // Ooops, mr no longer matches proc/self/maps >> >> Actually, maybe that is the simple answer here - have the kernel fixup >> the mr before returning from the 2nd mmap. Then the cache in user >> space is still correct to assume that VA XX is registered and working. > >Yeah, although that could get really nasty as there's generally not one call >to ibv_reg_mr per call to mmap. It's usually a couple of calls to >ibv_reg_mr for different segments of the same mmap buffer (think sending >faces of a 3-d block of space to the nearest neighbors in a physics >simulation). > >> Removing entries from the registration cache would have to be done in >> some other way (age?). > >Brian > >-- > Brian W. Barrett > Dept. 1423: Scalable System Software > Sandia National Laboratories > >_______________________________________________ >general mailing list >general at lists.openfabrics.org >http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general From arkady.kanevsky at gmail.com Thu Apr 30 11:26:55 2009 From: arkady.kanevsky at gmail.com (arkady kanevsky) Date: Thu, 30 Apr 2009 14:26:55 -0400 Subject: [ofa-general] Re: RDMA tutorial and OFA In-Reply-To: <11F9001990984F12B589A718D8976045@BillGWAYLAPTOP> References: <49F8A59C.3070001@oracle.com> <49F8B149.2050904@oracle.com> <11F9001990984F12B589A718D8976045@BillGWAYLAPTOP> Message-ID: <517c62fb0904301126n277faae7ye816eb496d985945@mail.gmail.com> Keep me in the loop.I am interested to do it also. Thanks, Arkady On Thu, Apr 30, 2009 at 1:39 PM, Bill Boas wrote: > Richard, Andy, > > Thanks for copying me Richard. I had not seen Andy's email on the general > list. > > Figuring out how to get tutorial and other documentation created and > published in the list of things to get done in 2009 for me in my part-time > role as Exec. Dir. > > There is no funding set up for this at the moment but I believe there will > be in about 30 days. > > That's because I'm thinking that we can get funding for this by making it > part of the funding for a new marketing plan for OFA that, with Wayne > Augsburger and Jim Ryan, we are preparing for the OFA Board to vote on at > the next con-call meeting which is on May 20 at 9.00AM PDT. > > Would you be willing to work with me and create a small team from others > within OFA who have the same interest to prepare a description by May 20 of > what the tutorial would look like, who would contribute to it, how to get > it > "polished up" for web and/or book style publication, what the overall costs > would be, etc. > > My thoughts, that could be a starting point for the team's work, are that > we > would make the creation a collective effort. > > The tutorial would have several sections for example general intro, > benefits > of RDMA, applicability in HPC and Enterprise, networking background etc. > Members of the Marketing Working Group would be responsible for this. > > The "meat" would be sections for kernel level things (verbs etc.), then > user > space things (verbs etc.), then APIs like MPI, SDP, EDS etc. - each section > overseen by the technical leaders/maintainers of the code within OFA for > that section (for Example Tom Talpey for NFSoRDMA, or you Richard for RDS) > > Finally the tutorial would have sections about Interoperability Testing > that > OFA/IOL does but also what customers can do on there own systems - Arkady > and Rupert and IOL have put in an SC09 tutorial proposal that we could > leverage in this section. > > To all readers of this email:- > If you have read this far, please give us all some feedback. If you have > material you'd like to contribute please say so. If there's a better way, > tell us what you think it is! > > Thanks, > > Bill. > > Bill Boas > Executive Director and Vice Chair > OpenFabrics Alliance > 510-375-8840 > Bill.Boas at openfabrics.org > www.openfabrics.org > > -----Original Message----- > From: Richard Frank [mailto:richard.frank at oracle.com] > Sent: Wednesday, April 29, 2009 12:58 PM > To: Andy Grover > Cc: Bill Boas; Sumanta Chatterjee > Subject: Re: RDMA tutorial and OFA > > Andy, I saw your postings to ofa-general on this and I agree it would be > great to have this documentation. > > As OpenFabrics is really about RDMA... we need to make it simpler > for folks to pick up and run with RDMA concepts ...vs.. digging thru the IB > specs and code examples, etc. > > Let's see what Bill Boas thinks...perhaps OFA has a writer on board that > can help us do this..? > > I can also help provide input for a new OFA RDMA tutorial doc.. > > Rick > > Andy Grover wrote: > > Hi Rick, > > > > Are you around for a brief chat this afternoon? I have a crazy idea that > > involves OFA doing something (or putting up $$) and I wanted to see what > > you thought, since you're Oracle's OFA rep, right? > > > > -- Andy > > > > > > -- Cheers, Arkady Kanevsky -------------- next part -------------- An HTML attachment was scrubbed... URL: From tmtalpey at gmail.com Thu Apr 30 11:28:13 2009 From: tmtalpey at gmail.com (Tom Talpey) Date: Thu, 30 Apr 2009 14:28:13 -0400 Subject: [ofa-general] Re: adding a purely software based RDMA driver In-Reply-To: References: Message-ID: <49f9edc5.48c3f10a.3de2.ffffb115@mx.google.com> At 06:08 AM 4/30/2009, Bernard Metzler wrote: > >Roland, Or, > >absolutely right. we do not intend nor are willing or entitled to dump >a 'softiwarp product' here but are still in the internal process of >getting something open sourced soon - to be reviewed by the community >and eventually addded to mainline Linux kernel when appropriate. >Philip was asking, since building an rpm for the OFED installation >procedure might be helpful for our internal usage at this point in >time - but it is not at high priority. so, lets discuss the code, if >it is here. I think this is a wonderful place to discuss and review any such code! Same as for other RDMA providers, of any stripe. Do you expect any major delays in being able to release it? I, for one, am eagerly awaiting it. ;-) Tom. From jgunthorpe at obsidianresearch.com Thu Apr 30 11:36:11 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 30 Apr 2009 12:36:11 -0600 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <49f9ecf6.02c3f10a.1bb5.ffff959e@mx.google.com> References: <20090429215508.GW4431@obsidianresearch.com> <49f9ecf6.02c3f10a.1bb5.ffff959e@mx.google.com> Message-ID: <20090430183611.GA3475@obsidianresearch.com> On Thu, Apr 30, 2009 at 02:24:47PM -0400, Tom Talpey wrote: > At 06:11 PM 4/29/2009, Barrett, Brian W wrote: > >On 4/29/09 15:55 , "Jason Gunthorpe" > >wrote: > > > >>> The problem is that MPI needs to be aware of the application doing > >>> the free() and unregister or flush its MR cache for that virtual > >>> address range. Of course it would be difficult for OpenMPI to have > >>> callbacks or hooks into every way memory could be allocated/freed > >>> that an application might use. > >> > >> There are only three calls that affect the way VM memory maps to > >> physical and thus would invalidate the mr cache: mmap, munmap and brk. > > > >There's also System V shared memory, which at least one scientific code out > >there uses. > > Don't forget fork, vfork, clone and exec, and also don't forget any > copy-on-write mappings that result. Oh, and those pesky stack pages. verbs just plain can't inherit QP and MR across fork and all the related. There is no way to split a QP and a MR across to processes and have things still make sense. One process gets it, so there isn't really a problem. Stack pages are mostly fixed in VM address space, particularly if you give up MAP_GROWSDOWN which I think most threading libraries do these days? Anyway, it doesn't matter, if the kernel has a syscall to enable registering all VM in a process then the kernel is perfectly able to capture all the wierd cases and fix them up. The API to userspace is very simple and sane. > I think the point is that making any guarantees that memory remains fixed > and present will inevitably lead to nontransparent API requirements on the > applications. Been there, done that, got plenty of t-shirts. It's a hard road, > because APIs are forever. That's why we are here, the MPI spec says all memory in a process is valid to use with a network operation - that is utterly incompatible with verb's notion of memory registration. Jason From richard.frank at oracle.com Thu Apr 30 11:41:35 2009 From: richard.frank at oracle.com (Richard Frank) Date: Thu, 30 Apr 2009 14:41:35 -0400 Subject: [ofa-general] Re: adding a purely software based RDMA driver In-Reply-To: <49f9edc5.48c3f10a.3de2.ffffb115@mx.google.com> References: <49f9edc5.48c3f10a.3de2.ffffb115@mx.google.com> Message-ID: <49F9F0DF.7060603@oracle.com> Who does this soft iWARP solution compare to the work done at Ohio Supercomputer Center ? Is that stack also available ? www.osc.edu/research/network_file/projects/*iwarp*/papers/dalessandro_ipdps2006_cac.pdf > At 06:08 AM 4/30/2009, Bernard Metzler wrote: > >> Roland, Or, >> >> absolutely right. we do not intend nor are willing or entitled to dump >> a 'softiwarp product' here but are still in the internal process of >> getting something open sourced soon - to be reviewed by the community >> and eventually addded to mainline Linux kernel when appropriate. >> Philip was asking, since building an rpm for the OFED installation >> procedure might be helpful for our internal usage at this point in >> time - but it is not at high priority. so, lets discuss the code, if >> it is here. >> > > I think this is a wonderful place to discuss and review any such code! > Same as for other RDMA providers, of any stripe. > > Do you expect any major delays in being able to release it? I, for one, > am eagerly awaiting it. ;-) > > Tom. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > From tmtalpey at gmail.com Thu Apr 30 12:21:42 2009 From: tmtalpey at gmail.com (Tom Talpey) Date: Thu, 30 Apr 2009 15:21:42 -0400 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <20090430183611.GA3475@obsidianresearch.com> References: <20090429215508.GW4431@obsidianresearch.com> <49f9ecf6.02c3f10a.1bb5.ffff959e@mx.google.com> <20090430183611.GA3475@obsidianresearch.com> Message-ID: <49f9fa69.15045a0a.62e8.ffffdd68@mx.google.com> At 02:36 PM 4/30/2009, Jason Gunthorpe wrote: >On Thu, Apr 30, 2009 at 02:24:47PM -0400, Tom Talpey wrote: >> At 06:11 PM 4/29/2009, Barrett, Brian W wrote: >> >On 4/29/09 15:55 , "Jason Gunthorpe" >> >wrote: >> > >> >>> The problem is that MPI needs to be aware of the application doing >> >>> the free() and unregister or flush its MR cache for that virtual >> >>> address range. Of course it would be difficult for OpenMPI to have >> >>> callbacks or hooks into every way memory could be allocated/freed >> >>> that an application might use. >> >> >> >> There are only three calls that affect the way VM memory maps to >> >> physical and thus would invalidate the mr cache: mmap, munmap and brk. >> > >> >There's also System V shared memory, which at least one scientific code out >> >there uses. >> >> Don't forget fork, vfork, clone and exec, and also don't forget any >> copy-on-write mappings that result. Oh, and those pesky stack pages. > >verbs just plain can't inherit QP and MR across fork and all the >related. There is no way to split a QP and a MR across to processes >and have things still make sense. One process gets it, so there isn't >really a problem. > >Stack pages are mostly fixed in VM address space, particularly if you >give up MAP_GROWSDOWN which I think most threading libraries do these >days? > >Anyway, it doesn't matter, if the kernel has a syscall to enable >registering all VM in a process then the kernel is perfectly able to >capture all the wierd cases and fix them up. The API to userspace is >very simple and sane. Not sure I agree with the "perfectly able" part. What if the process' stack grows after the syscall? Do the extra pages magically become registered? What if the adapter's page table required remapping to cover the extra pages, maybe changing the memory handle? Also, does this get harder or easier if there's an IOMMU in the loop? With direct access to I/O mapping, there are additional degrees of freedom to rearrange pages. OTOH, there's an additional layer to manage far beyond the page-wiring and mapping of userspace. > >> I think the point is that making any guarantees that memory remains fixed >> and present will inevitably lead to nontransparent API requirements on the >> applications. Been there, done that, got plenty of t-shirts. It's a >hard road, >> because APIs are forever. > >That's why we are here, the MPI spec says all memory in a process is >valid to use with a network operation - that is utterly incompatible >with verb's notion of memory registration. Of course I agree. I am however questioning the goal of making this 100% transparent. To do so, both sides of the interface will need to behave in specific ways (e.g. no forking), and that can be very, very difficult to achieve. I'm not trying to argue against this, btw. But I do think it's prohibitively hard, without making requirements on the applications, which in turn are very difficult to change. Choose the requirements carefully, IOW. Tom. From arkady.kanevsky at gmail.com Thu Apr 30 12:47:00 2009 From: arkady.kanevsky at gmail.com (arkady kanevsky) Date: Thu, 30 Apr 2009 15:47:00 -0400 Subject: [ofa-general] Re: RDMA tutorial and OFA In-Reply-To: <004301c9c9cb$b714af60$253e0e20$@com> References: <49F8A59C.3070001@oracle.com> <49F8B149.2050904@oracle.com> <11F9001990984F12B589A718D8976045@BillGWAYLAPTOP> <004301c9c9cb$b714af60$253e0e20$@com> Message-ID: <517c62fb0904301247n1946ed25uc6b4801d857c90ac@mail.gmail.com> I suspect more like a book along the lines of VIPL one from Intel. On Thu, Apr 30, 2009 at 3:41 PM, Paul Grun wrote: > In any event, I would be happy to volunteer to serve as an editor. Since > I'm a hardware guy I'm probably not a good choice to volunteer to author > any > of the technical sections. > > Did you really mean a tutorial, which to me implies a > presentation/classroom > format, or did you mean tutorial in the more colloquial sense of capturing > the RDMA folklore in written form? > > -Paul > > Paul Grun > Chief Scientist > System Fabric Works, Inc > Office: (503) 620-8757 > Cell : (503) 703-5382 > > Fabric Computing that works > > > > -----Original Message----- > From: Bill Boas [mailto:Bill.Boas at openfabrics.org] > Sent: Thursday, April 30, 2009 10:39 AM > To: 'Richard Frank'; 'Andy Grover' > Cc: 'Sumanta Chatterjee'; general at lists.openfabrics.org; 'Wayne > Augsburger'; > 'Ryan, Jim'; 'Paul Gray'; 'Paul Grun'; bobs at voltaire.com; Scott Friedman; > 'Jeff Squyres'; 'Roland Dreier'; asafs`@voltaire.com; 'Rupert Dance'; > Mikkel > Hagen; 'arkady kanevsky'; 'OFA Marketing Working Group'; > iwg at lists.openfabrics.org > Subject: RE: RDMA tutorial and OFA > > Richard, Andy, > > Thanks for copying me Richard. I had not seen Andy's email on the general > list. > > Figuring out how to get tutorial and other documentation created and > published in the list of things to get done in 2009 for me in my part-time > role as Exec. Dir. > > There is no funding set up for this at the moment but I believe there will > be in about 30 days. > > That's because I'm thinking that we can get funding for this by making it > part of the funding for a new marketing plan for OFA that, with Wayne > Augsburger and Jim Ryan, we are preparing for the OFA Board to vote on at > the next con-call meeting which is on May 20 at 9.00AM PDT. > > Would you be willing to work with me and create a small team from others > within OFA who have the same interest to prepare a description by May 20 of > what the tutorial would look like, who would contribute to it, how to get > it > "polished up" for web and/or book style publication, what the overall costs > would be, etc. > > My thoughts, that could be a starting point for the team's work, are that > we > would make the creation a collective effort. > > The tutorial would have several sections for example general intro, > benefits > of RDMA, applicability in HPC and Enterprise, networking background etc. > Members of the Marketing Working Group would be responsible for this. > > The "meat" would be sections for kernel level things (verbs etc.), then > user > space things (verbs etc.), then APIs like MPI, SDP, EDS etc. - each section > overseen by the technical leaders/maintainers of the code within OFA for > that section (for Example Tom Talpey for NFSoRDMA, or you Richard for RDS) > > Finally the tutorial would have sections about Interoperability Testing > that > OFA/IOL does but also what customers can do on there own systems - Arkady > and Rupert and IOL have put in an SC09 tutorial proposal that we could > leverage in this section. > > To all readers of this email:- > If you have read this far, please give us all some feedback. If you have > material you'd like to contribute please say so. If there's a better way, > tell us what you think it is! > > Thanks, > > Bill. > > Bill Boas > Executive Director and Vice Chair > OpenFabrics Alliance > 510-375-8840 > Bill.Boas at openfabrics.org > www.openfabrics.org > > -----Original Message----- > From: Richard Frank [mailto:richard.frank at oracle.com] > Sent: Wednesday, April 29, 2009 12:58 PM > To: Andy Grover > Cc: Bill Boas; Sumanta Chatterjee > Subject: Re: RDMA tutorial and OFA > > Andy, I saw your postings to ofa-general on this and I agree it would be > great to have this documentation. > > As OpenFabrics is really about RDMA... we need to make it simpler > for folks to pick up and run with RDMA concepts ...vs.. digging thru the IB > specs and code examples, etc. > > Let's see what Bill Boas thinks...perhaps OFA has a writer on board that > can help us do this..? > > I can also help provide input for a new OFA RDMA tutorial doc.. > > Rick > > Andy Grover wrote: > > Hi Rick, > > > > Are you around for a brief chat this afternoon? I have a crazy idea that > > involves OFA doing something (or putting up $$) and I wanted to see what > > you thought, since you're Oracle's OFA rep, right? > > > > -- Andy > > > > > -- Cheers, Arkady Kanevsky -------------- next part -------------- An HTML attachment was scrubbed... URL: From arkady.kanevsky at gmail.com Thu Apr 30 12:47:00 2009 From: arkady.kanevsky at gmail.com (arkady kanevsky) Date: Thu, 30 Apr 2009 15:47:00 -0400 Subject: [ofa-general] Re: RDMA tutorial and OFA In-Reply-To: <004301c9c9cb$b714af60$253e0e20$@com> References: <49F8A59C.3070001@oracle.com> <49F8B149.2050904@oracle.com> <11F9001990984F12B589A718D8976045@BillGWAYLAPTOP> <004301c9c9cb$b714af60$253e0e20$@com> Message-ID: <517c62fb0904301247n1946ed25uc6b4801d857c90ac@mail.gmail.com> I suspect more like a book along the lines of VIPL one from Intel. On Thu, Apr 30, 2009 at 3:41 PM, Paul Grun wrote: > In any event, I would be happy to volunteer to serve as an editor. Since > I'm a hardware guy I'm probably not a good choice to volunteer to author > any > of the technical sections. > > Did you really mean a tutorial, which to me implies a > presentation/classroom > format, or did you mean tutorial in the more colloquial sense of capturing > the RDMA folklore in written form? > > -Paul > > Paul Grun > Chief Scientist > System Fabric Works, Inc > Office: (503) 620-8757 > Cell : (503) 703-5382 > > Fabric Computing that works > > > > -----Original Message----- > From: Bill Boas [mailto:Bill.Boas at openfabrics.org] > Sent: Thursday, April 30, 2009 10:39 AM > To: 'Richard Frank'; 'Andy Grover' > Cc: 'Sumanta Chatterjee'; general at lists.openfabrics.org; 'Wayne > Augsburger'; > 'Ryan, Jim'; 'Paul Gray'; 'Paul Grun'; bobs at voltaire.com; Scott Friedman; > 'Jeff Squyres'; 'Roland Dreier'; asafs`@voltaire.com; 'Rupert Dance'; > Mikkel > Hagen; 'arkady kanevsky'; 'OFA Marketing Working Group'; > iwg at lists.openfabrics.org > Subject: RE: RDMA tutorial and OFA > > Richard, Andy, > > Thanks for copying me Richard. I had not seen Andy's email on the general > list. > > Figuring out how to get tutorial and other documentation created and > published in the list of things to get done in 2009 for me in my part-time > role as Exec. Dir. > > There is no funding set up for this at the moment but I believe there will > be in about 30 days. > > That's because I'm thinking that we can get funding for this by making it > part of the funding for a new marketing plan for OFA that, with Wayne > Augsburger and Jim Ryan, we are preparing for the OFA Board to vote on at > the next con-call meeting which is on May 20 at 9.00AM PDT. > > Would you be willing to work with me and create a small team from others > within OFA who have the same interest to prepare a description by May 20 of > what the tutorial would look like, who would contribute to it, how to get > it > "polished up" for web and/or book style publication, what the overall costs > would be, etc. > > My thoughts, that could be a starting point for the team's work, are that > we > would make the creation a collective effort. > > The tutorial would have several sections for example general intro, > benefits > of RDMA, applicability in HPC and Enterprise, networking background etc. > Members of the Marketing Working Group would be responsible for this. > > The "meat" would be sections for kernel level things (verbs etc.), then > user > space things (verbs etc.), then APIs like MPI, SDP, EDS etc. - each section > overseen by the technical leaders/maintainers of the code within OFA for > that section (for Example Tom Talpey for NFSoRDMA, or you Richard for RDS) > > Finally the tutorial would have sections about Interoperability Testing > that > OFA/IOL does but also what customers can do on there own systems - Arkady > and Rupert and IOL have put in an SC09 tutorial proposal that we could > leverage in this section. > > To all readers of this email:- > If you have read this far, please give us all some feedback. If you have > material you'd like to contribute please say so. If there's a better way, > tell us what you think it is! > > Thanks, > > Bill. > > Bill Boas > Executive Director and Vice Chair > OpenFabrics Alliance > 510-375-8840 > Bill.Boas at openfabrics.org > www.openfabrics.org > > -----Original Message----- > From: Richard Frank [mailto:richard.frank at oracle.com] > Sent: Wednesday, April 29, 2009 12:58 PM > To: Andy Grover > Cc: Bill Boas; Sumanta Chatterjee > Subject: Re: RDMA tutorial and OFA > > Andy, I saw your postings to ofa-general on this and I agree it would be > great to have this documentation. > > As OpenFabrics is really about RDMA... we need to make it simpler > for folks to pick up and run with RDMA concepts ...vs.. digging thru the IB > specs and code examples, etc. > > Let's see what Bill Boas thinks...perhaps OFA has a writer on board that > can help us do this..? > > I can also help provide input for a new OFA RDMA tutorial doc.. > > Rick > > Andy Grover wrote: > > Hi Rick, > > > > Are you around for a brief chat this afternoon? I have a crazy idea that > > involves OFA doing something (or putting up $$) and I wanted to see what > > you thought, since you're Oracle's OFA rep, right? > > > > -- Andy > > > > > -- Cheers, Arkady Kanevsky -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsquyres at cisco.com Thu Apr 30 12:56:21 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 30 Apr 2009 15:56:21 -0400 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <49f9fa69.15045a0a.62e8.ffffdd68@mx.google.com> References: <20090429215508.GW4431@obsidianresearch.com><49f9ecf6.02c3f10a.1bb5.ffff959e@mx.google.com><20090430183611.GA3475@obsidianresearch.com> <49f9fa69.15045a0a.62e8.ffffdd68@mx.google.com> Message-ID: <7AFB1325-2D10-4EB9-ACD3-0B50546FCF4B@cisco.com> On Apr 30, 2009, at 3:21 PM, Tom Talpey wrote: > Of course I agree. I am however questioning the goal of making this > 100% transparent. To do so, both sides of the interface will need to > behave in specific ways (e.g. no forking), and that can be very, very > difficult to achieve. > There's no goal of making this 100% transparent. Sure, that'd be nice (perhaps as a long-term goal), but that's not what I started this thread for. We're not even trying to be lazy; MPI implementations have done a *lot* of work to make MPI function properly over OpenFabrics. But as Brian said, we just need *something* that is safe and sane. Right now, we don't have that. -- Jeff Squyres Cisco Systems From jsquyres at cisco.com Thu Apr 30 13:03:18 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 30 Apr 2009 16:03:18 -0400 Subject: [ofa-general] New proposal for memory management In-Reply-To: References: Message-ID: <659DF081-1112-47B6-9CB2-45B8D5C5E8C2@cisco.com> On Apr 30, 2009, at 1:51 PM, Barrett, Brian W wrote: > All we want is *SOMETHING* we can do that we know is safe. > This is the most important thing. +1 > The registration cache seems like the safest, > +100 > but the notifier would be better than where we are today if we can > get the important race conditions out of it. > +0.1. Yes, it would work, and I agree that it would be marginally better than what we have today. IMHO, it seems like fixing this Right would be better than "let's do a possibly simpler but sub-optimal solution." > The current state of the field is unacceptably stupid and I'm actually > amazed there's any resistance to fixing the problem. > I'm a little amazed that it's gone this long without being fixed (I know I spoke about this exact issue at Sonoma 3 years ago!). -- Jeff Squyres Cisco Systems From pashash at gmail.com Thu Apr 30 13:20:44 2009 From: pashash at gmail.com (Pavel Shamis (Pasha)) Date: Thu, 30 Apr 2009 23:20:44 +0300 Subject: [ofa-general] New proposal for memory management In-Reply-To: <928CFBE8E7CB0040959E56B4EA41A77E9BB66D10@irsmsx504.ger.corp.intel.com> References: <928CFBE8E7CB0040959E56B4EA41A77E9BB66D10@irsmsx504.ger.corp.intel.com> Message-ID: <49FA081C.50801@dev.mellanox.co.il> Supalov, Alexander wrote:The proposal talks about interface extension and not replacement. So if application will want to implement some exotic caching algorithm it may do it > Hi, > > Mem reg caching has direct relation to the apps performance. Can we guarantee, while putting the caching into the kernel, that the algorithms used will be good for all apps? How will one control their parameters at runtime? Will one be able to change the algorithm if necessary? > The proposal talks about interface extension and not replacement. So if application will want to implement some exotic caching algorithm it may do it on top of current interface (actually as it do it today). Pasha. From jsquyres at cisco.com Thu Apr 30 13:25:33 2009 From: jsquyres at cisco.com (Jeff Squyres) Date: Thu, 30 Apr 2009 16:25:33 -0400 Subject: [ofa-general] Re: [mwg] Re: RDMA tutorial and OFA In-Reply-To: <35AAF1E4A771E142979F27B51793A4885D9B5E36A0@AVEXMB1.qlogic.org> References: <49F8A59C.3070001@oracle.com> <49F8B149.2050904@oracle.com><11F9001990984F12B589A718D8976045@BillGWAYLAPTOP> <517c62fb0904301126n277faae7ye816eb496d985945@mail.gmail.com> <35AAF1E4A771E142979F27B51793A4885D9B5E36A0@AVEXMB1.qlogic.org> Message-ID: <77C0E887-0C04-4ACE-ADFD-73864D589700@cisco.com> Any information is good information. The lack of information about verbs / RDMA is definitely a hinderance. It would also be good to scrub all the old / useless / incorrect information that is out there. I believe there's a ton of outdated information on the wiki with no real indication that it's old / outdated (I see "Install OpenIB for Ammasso1100" right on the front wiki page). As Gus mentioned, even a low-hanging-fruit FAQ (that is actively maintained -- that's the killer) would be useful. Screencasts are good; those are an easy way for developers to broadcast info to the world. Programming guides are noticeably missing; you probably want several, ranging from "explaining the basics" to "advanced programming". Videotaping humans giving tutorials are good, too. Those can be easily web-hosted somewhere. Like I said, any information is good information. But to do it right, you really need one or more full-time people doing this stuff (IMHO). And no, I'm not volunteering. :-) On Apr 30, 2009, at 4:17 PM, Lloyd Dickman wrote: > I support the idea of the RDMA tutorial. Beyond the “meat” as > described below, I would encourage the tutorial to include a “how to > program RDMA” section. While OFA Verbs provides a rich set of > mechanisms, it is difficult for the average programmer to get a > solid handle on how to use the capabilities, register memory, … > Some cookbook examples, or perhaps development of several > programming “patterns” can go a long way to having RDMA become a > much more mainstream application programming paradigm. > > Lloyd > > From: mwg-bounces at lists.openfabrics.org [mailto:mwg-bounces at lists.openfabrics.org > ] On Behalf Of arkady kanevsky > Sent: Thursday, April 30, 2009 11:27 AM > To: bill.boas at openfabrics.org > Cc: iwg at lists.openfabrics.org; Paul Grun; Paul Gray; OFA Marketing > Working Group; Wayne Augsburger; Andy Grover; Richard Frank; asafs`@voltaire.com > ; Jeff Squyres; Mikkel Hagen;general at lists.openfabrics.org; Scott > Friedman; bobs at voltaire.com; Sumanta Chatterjee; Roland Dreier > Subject: [mwg] Re: RDMA tutorial and OFA > > Keep me in the loop. > I am interested to do it also. > Thanks, > Arkady > On Thu, Apr 30, 2009 at 1:39 PM, Bill Boas > wrote: > Richard, Andy, > > Thanks for copying me Richard. I had not seen Andy's email on the > general > list. > > Figuring out how to get tutorial and other documentation created and > published in the list of things to get done in 2009 for me in my > part-time > role as Exec. Dir. > > There is no funding set up for this at the moment but I believe > there will > be in about 30 days. > > That's because I'm thinking that we can get funding for this by > making it > part of the funding for a new marketing plan for OFA that, with Wayne > Augsburger and Jim Ryan, we are preparing for the OFA Board to vote > on at > the next con-call meeting which is on May 20 at 9.00AM PDT. > > Would you be willing to work with me and create a small team from > others > within OFA who have the same interest to prepare a description by > May 20 of > what the tutorial would look like, who would contribute to it, how > to get it > "polished up" for web and/or book style publication, what the > overall costs > would be, etc. > > My thoughts, that could be a starting point for the team's work, are > that we > would make the creation a collective effort. > > The tutorial would have several sections for example general intro, > benefits > of RDMA, applicability in HPC and Enterprise, networking background > etc. > Members of the Marketing Working Group would be responsible for this. > > The "meat" would be sections for kernel level things (verbs etc.), > then user > space things (verbs etc.), then APIs like MPI, SDP, EDS etc. - each > section > overseen by the technical leaders/maintainers of the code within OFA > for > that section (for Example Tom Talpey for NFSoRDMA, or you Richard > for RDS) > > Finally the tutorial would have sections about Interoperability > Testing that > OFA/IOL does but also what customers can do on there own systems - > Arkady > and Rupert and IOL have put in an SC09 tutorial proposal that we could > leverage in this section. > > To all readers of this email:- > If you have read this far, please give us all some feedback. If you > have > material you'd like to contribute please say so. If there's a better > way, > tell us what you think it is! > > Thanks, > > Bill. > > Bill Boas > Executive Director and Vice Chair > OpenFabrics Alliance > 510-375-8840 > Bill.Boas at openfabrics.org > www.openfabrics.org > > -----Original Message----- > From: Richard Frank [mailto:richard.frank at oracle.com] > Sent: Wednesday, April 29, 2009 12:58 PM > To: Andy Grover > Cc: Bill Boas; Sumanta Chatterjee > Subject: Re: RDMA tutorial and OFA > > Andy, I saw your postings to ofa-general on this and I agree it > would be > great to have this documentation. > > As OpenFabrics is really about RDMA... we need to make it simpler > for folks to pick up and run with RDMA concepts ...vs.. digging thru > the IB > specs and code examples, etc. > > Let's see what Bill Boas thinks...perhaps OFA has a writer on board > that > can help us do this..? > > I can also help provide input for a new OFA RDMA tutorial doc.. > > Rick > > Andy Grover wrote: > > Hi Rick, > > > > Are you around for a brief chat this afternoon? I have a crazy > idea that > > involves OFA doing something (or putting up $$) and I wanted to > see what > > you thought, since you're Oracle's OFA rep, right? > > > > -- Andy > > > > > > > > -- > Cheers, > Arkady Kanevsky -- Jeff Squyres Cisco Systems From robert.j.woodruff at intel.com Thu Apr 30 13:45:33 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 30 Apr 2009 13:45:33 -0700 Subject: [ofa-general] New proposal for memory management In-Reply-To: <659DF081-1112-47B6-9CB2-45B8D5C5E8C2@cisco.com> References: <659DF081-1112-47B6-9CB2-45B8D5C5E8C2@cisco.com> Message-ID: <382A478CAD40FA4FB46605CF81FE39F42B5C8631@orsmsx507.amr.corp.intel.com> Jeff wrote, >I'm a little amazed that it's gone this long without being fixed (I >know I spoke about this exact issue at Sonoma 3 years ago!). If MPIs are so broken over OFA, then can you give me the list of applications that are failing ? I did not hear from any customer at Sonoma that all the MPIs are totally broken and do not work. As far as I know, most if not all of the applications are running just fine with the MPIs as they are today. At least I am not aware of any applications that are failing when using Intel MPI. Sure, you need to hook malloc and such and that is a bit tricky, but it seems like you have figured out how to do it and it works. If a notifier capability would make this easier, then perhaps that should be added, but adding a memory registration cache to the kernel, that may or may not even meet the needs of all MPIs does not seem like the right approach, it will just lead to kernel bloat. Sure you have to manage your own cache, but you have chosen to do a cache to get better performance. You did not have to do a cache at all. You chose to do it to make your MPI better. To me, all this sounds like a lot of whining.... Why can't the OS fix all my problems. From pashash at gmail.com Thu Apr 30 13:49:21 2009 From: pashash at gmail.com (Pavel Shamis (Pasha)) Date: Thu, 30 Apr 2009 23:49:21 +0300 Subject: [ofa-general] New proposal for memory management In-Reply-To: <659DF081-1112-47B6-9CB2-45B8D5C5E8C2@cisco.com> References: <659DF081-1112-47B6-9CB2-45B8D5C5E8C2@cisco.com> Message-ID: <49FA0ED1.3080405@dev.mellanox.co.il> > >> The registration cache seems like the safest, >> > > +100 I'm absolutely agree here, it is full solution that totally resolve the issue for HW that we have now. > >> but the notifier would be better than where we are today if we can >> get the important race conditions out of it. >> > > +0.1. Yes, it would work, and I agree that it would be marginally > better than what we have today. IMHO, it seems like fixing this Right > would be better than "let's do a possibly simpler but sub-optimal > solution." Sound like solution that may to work, but definitely we introduce here much more complexity that we have now(to all mpi implementation!)....and I guess the idea was make it functional and simple. Pasha From swise at opengridcomputing.com Thu Apr 30 13:55:10 2009 From: swise at opengridcomputing.com (Steve Wise) Date: Thu, 30 Apr 2009 15:55:10 -0500 Subject: [ofa-general] Re: adding a purely software based RDMA driver In-Reply-To: <49F9F0DF.7060603@oracle.com> References: <49f9edc5.48c3f10a.3de2.ffffb115@mx.google.com> <49F9F0DF.7060603@oracle.com> Message-ID: <49FA102E.5070904@opengridcomputing.com> FYI: The OSC code isn't complete nor does it support full linux kernel and user iWARP verbs interfaces and semantics. Richard Frank wrote: > Who does this soft iWARP solution compare to the work done at Ohio > Supercomputer Center ? > > Is that stack also available ? > > www.osc.edu/research/network_file/projects/*iwarp*/papers/dalessandro_ipdps2006_cac.pdf > > > >> At 06:08 AM 4/30/2009, Bernard Metzler wrote: >> >>> Roland, Or, >>> >>> absolutely right. we do not intend nor are willing or entitled to dump >>> a 'softiwarp product' here but are still in the internal process of >>> getting something open sourced soon - to be reviewed by the community >>> and eventually addded to mainline Linux kernel when appropriate. >>> Philip was asking, since building an rpm for the OFED installation >>> procedure might be helpful for our internal usage at this point in >>> time - but it is not at high priority. so, lets discuss the code, if >>> it is here. >>> >> >> I think this is a wonderful place to discuss and review any such code! >> Same as for other RDMA providers, of any stripe. >> >> Do you expect any major delays in being able to release it? I, for one, >> am eagerly awaiting it. ;-) >> >> Tom. >> >> _______________________________________________ >> general mailing list >> general at lists.openfabrics.org >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general From rdreier at cisco.com Thu Apr 30 14:01:48 2009 From: rdreier at cisco.com (Roland Dreier) Date: Thu, 30 Apr 2009 14:01:48 -0700 Subject: [ofa-general] Re: adding a purely software based RDMA driver In-Reply-To: <49F9F0DF.7060603@oracle.com> (Richard Frank's message of "Thu, 30 Apr 2009 14:41:35 -0400") References: <49f9edc5.48c3f10a.3de2.ffffb115@mx.google.com> <49F9F0DF.7060603@oracle.com> Message-ID: > Is that stack also available ? I think so, just follow the links under "Download the source" from http://www.osc.edu/research/network_file/projects/iwarp/iwarp_main.shtml - R. From arkady.kanevsky at gmail.com Thu Apr 30 14:25:56 2009 From: arkady.kanevsky at gmail.com (arkady kanevsky) Date: Thu, 30 Apr 2009 17:25:56 -0400 Subject: [ofa-general] New proposal for memory management In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F42B5C8631@orsmsx507.amr.corp.intel.com> References: <659DF081-1112-47B6-9CB2-45B8D5C5E8C2@cisco.com> <382A478CAD40FA4FB46605CF81FE39F42B5C8631@orsmsx507.amr.corp.intel.com> Message-ID: <517c62fb0904301425y2bb7b468qfd2cadd7d41f15d1@mail.gmail.com> Jeff,are the MPI applications that are broken are the ones which is malloc/free instead of MPI_ALLOC calls? Arkady On Thu, Apr 30, 2009 at 4:45 PM, Woodruff, Robert J < robert.j.woodruff at intel.com> wrote: > Jeff wrote, > > >I'm a little amazed that it's gone this long without being fixed (I > >know I spoke about this exact issue at Sonoma 3 years ago!). > > If MPIs are so broken over OFA, then can you give me the list of > applications > that are failing ? I did not hear from any customer at Sonoma that all > the MPIs are totally broken and do not work. > As far as I know, most if not all of the applications are running > just fine with the MPIs as they are today. At least I am not aware of > any applications that are failing when using Intel MPI. Sure, you need to > hook > malloc and such and that is a bit tricky, but it seems like you > have figured out how to do it and it works. If a notifier capability would > make this easier, > then perhaps that should be added, but adding a memory registration cache > to the kernel, > that may or may not even meet the needs of all MPIs does not seem > like the right approach, it will just lead to kernel bloat. > > Sure you have to manage your own cache, but you have chosen to > do a cache to get better performance. You did not have to do a cache at > all. You chose to do it to make your MPI better. > > To me, all this sounds like a lot of whining.... > Why can't the OS fix all my problems. > > _______________________________________________ > general mailing list > general at lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > -- Cheers, Arkady Kanevsky -------------- next part -------------- An HTML attachment was scrubbed... URL: From weiny2 at llnl.gov Thu Apr 30 14:29:41 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 30 Apr 2009 14:29:41 -0700 Subject: [ofa-general] [PATCH 0/3] Implement partial scan of fabric when tools request a single switch. (WAS Re: [PATCH 8/8] Convert ibqueryerrors.pl to C and use new ibnetdisc library.) In-Reply-To: <20090427145026.7e074ffc.weiny2@llnl.gov> References: <20090423133120.acf0af63.weiny2@llnl.gov> <20090425155441.GE28604@sk> <20090427145026.7e074ffc.weiny2@llnl.gov> Message-ID: <20090430142941.4ab8eba4.weiny2@llnl.gov> Hey Sasha, See below: On Mon, 27 Apr 2009 14:50:26 -0700 Ira Weiny wrote: > On Sat, 25 Apr 2009 18:54:41 +0300 > Sasha Khapyorsky wrote: > > > On 13:31 Thu 23 Apr , Ira Weiny wrote: [snip] > > > + } > > > + > > > + report_suppressed(); > > > + > > > + if (switch_guid) { > > > + ibnd_node_t *node = ibnd_find_node_guid(fabric, switch_guid); > > > + print_node(node, NULL); > > > + } else if (dr_path) { > > > + ibnd_node_t *node = ibnd_find_node_dr(fabric, dr_path); > > > + print_node(node, NULL); > > > > When GUID or DR Path are specified we don't need to discover whole > > fabric, but can try to resolve LID using SA or querying PortInfo. > > > > Although when in GUID is specified and SA is not responsive there is > > probably no other choice than discover. > > > > :-( good point. Discovering only part of the fabric was a huge speed > improvement but if the resolve does not succeed I should do a full discover. > > I will work up a separate patch. Right now you are correct if the SA is > unresponsive the "-S" option will fail. iblinkinfo does the full scan every > time. But that slows down the query for a single switch to the same O(n) > query that a full system scan requires. I would rather have that query be > O(1). So I implemented ibqueryerrors in this manner with the intent of going > back and "fixing" iblinkinfo. I think having a fall back on a full system > scan is a good idea. Patch for both tools will follow... :-D > Following up on what I said I would do above. This turned out to be a 3 patch series. Summary of patches (to pq/ibn4) are below. 1/3) I had to fix libibmad DRSLID/DRDLID fields. (See my 1 man thread from a couple of days ago... ;-) 2/3) ibnetdiscover (and therefore libibnetdisc) was not designed to use combined DR routing for discovery. Add combined routing support to libibnetdisc. 3/3) Make iblinkinfo resolve a portid from GUID. Then change iblinkinfo and ibqueryerrors to attempt a single hop discovery around that portid. If either resolving the GUID or discover fails they both attempt a full discover. Then they print the single switch asked for on the command line. Patches follow, Ira From weiny2 at llnl.gov Thu Apr 30 14:29:50 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 30 Apr 2009 14:29:50 -0700 Subject: [ofa-general] [PATCH 1/3] Fix reversal of DRSLID and DRDLID in MAD_FIELDS enum Message-ID: <20090430142950.85ef6368.weiny2@llnl.gov> From: Ira Weiny Date: Thu, 30 Apr 2009 11:19:26 -0700 Subject: [PATCH] Fix reversal of DRSLID and DRDLID in MAD_FIELDS enum. Signed-off-by: Ira Weiny --- libibmad/include/infiniband/mad.h | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/libibmad/include/infiniband/mad.h b/libibmad/include/infiniband/mad.h index b11f778..4492414 100644 --- a/libibmad/include/infiniband/mad.h +++ b/libibmad/include/infiniband/mad.h @@ -247,8 +247,8 @@ enum MAD_FIELDS { IB_MAD_MKEY_F, /* word 9 (32-37 bytes) */ - IB_DRSMP_DRSLID_F, IB_DRSMP_DRDLID_F, + IB_DRSMP_DRSLID_F, /* word 10,11 (36-43 bytes) */ IB_SA_MKEY_F, -- 1.5.4.5 From weiny2 at llnl.gov Thu Apr 30 14:29:58 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 30 Apr 2009 14:29:58 -0700 Subject: [ofa-general] [PATCH 2/3] Add combined routing support to libibnetdisc Message-ID: <20090430142958.5811218f.weiny2@llnl.gov> From: Ira Weiny Date: Wed, 29 Apr 2009 10:15:55 -0700 Subject: [PATCH] Add combined routing support to libibnetdisc Also allow a scan to start at a switch. Signed-off-by: Ira Weiny --- infiniband-diags/libibnetdisc/src/ibnetdisc.c | 28 ++++++++++++++++++------ 1 files changed, 21 insertions(+), 7 deletions(-) diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c b/infiniband-diags/libibnetdisc/src/ibnetdisc.c index 0ff5134..fc19633 100644 --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c @@ -177,11 +177,26 @@ add_port_to_dpath(ib_dr_path_t *path, int nextport) } static int -extend_dpath(struct ibnd_fabric *f, ib_dr_path_t *path, int nextport) +extend_dpath(struct ibnd_fabric *f, ib_portid_t *portid, int nextport) { - int rc = add_port_to_dpath(path, nextport); - if ((rc != -1) && (path->cnt > f->fabric.maxhops_discovered)) - f->fabric.maxhops_discovered = path->cnt; + int rc = 0; + + if (portid->lid && !portid->drpath.drslid) { + /* If we were LID routed + * AND have not done so already + * we need to set up the drslid + */ + ib_portid_t selfportid = { 0 }; + if (ib_resolve_self_via(&selfportid, NULL, NULL, f->fabric.ibmad_port) < 0) + return -1; + portid->drpath.drslid = selfportid.lid; + portid->drpath.drdlid = 0xFFFF; + } + + rc = add_port_to_dpath(&portid->drpath, nextport); + + if ((rc != -1) && (portid->drpath.cnt > f->fabric.maxhops_discovered)) + f->fabric.maxhops_discovered = portid->drpath.cnt; return (rc); } @@ -447,7 +462,7 @@ get_remote_node(struct ibnd_fabric *fabric, struct ibnd_node *node, struct ibnd_ != IB_PORT_PHYS_STATE_LINKUP) return -1; - if (extend_dpath(fabric, &path->drpath, portnum) < 0) + if (extend_dpath(fabric, path, portnum) < 0) return -1; if (query_node(fabric, &node_buf, &port_buf, path)) { @@ -546,8 +561,7 @@ ibnd_discover_fabric(struct ibmad_port *ibmad_port, int timeout_ms, if (!port) IBPANIC("out of memory"); - if (node->node.type != IB_NODE_SWITCH && - get_remote_node(fabric, node, port, from, + if(get_remote_node(fabric, node, port, from, mad_get_field(node->node.info, 0, IB_NODE_LOCAL_PORT_F), 0) < 0) return ((ibnd_fabric_t *)fabric); -- 1.5.4.5 From weiny2 at llnl.gov Thu Apr 30 14:30:02 2009 From: weiny2 at llnl.gov (Ira Weiny) Date: Thu, 30 Apr 2009 14:30:02 -0700 Subject: [ofa-general] [PATCH 3/3] Modify '-S' option of iblinkinfo and ibqueryerrors to do a limited scan of the fabric first and then fall back to a full scan which searches for the GUID. Message-ID: <20090430143002.89262384.weiny2@llnl.gov> From: Ira Weiny Date: Tue, 28 Apr 2009 16:38:38 -0700 Subject: [PATCH] Modify '-S' option of iblinkinfo and ibqueryerrors to do a limited scan of the fabric first and then fall back to a full scan which searches for the GUID. Signed-off-by: Ira Weiny --- infiniband-diags/src/iblinkinfo.c | 24 ++++++++++++++++++------ infiniband-diags/src/ibqueryerrors.c | 20 +++++++------------- 2 files changed, 25 insertions(+), 19 deletions(-) diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c index a8a93de..2454bf2 100644 --- a/infiniband-diags/src/iblinkinfo.c +++ b/infiniband-diags/src/iblinkinfo.c @@ -262,13 +262,14 @@ main(int argc, char **argv) int ca_port = 0; ibnd_fabric_t *fabric = NULL; uint64_t guid = 0; + char *guid_str = NULL; char *dr_path = NULL; char *from = NULL; int hops = 0; - ib_portid_t port_id; + ib_portid_t port_id = {0}; struct ibmad_port *ibmad_port; - int mgmt_classes[2] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS}; + int mgmt_classes[3] = {IB_SMI_CLASS, IB_SMI_DIRECT_CLASS, IB_SA_CLASS}; static char const str_opts[] = "S:D:n:C:P:t:sldgphuf:R"; static const struct option long_opts[] = { @@ -339,7 +340,8 @@ main(int argc, char **argv) print_port_guids = 1; break; case 'S': - guid = (uint64_t)strtoull(optarg, 0, 0); + guid_str = optarg; + guid = (uint64_t)strtoull(guid_str, 0, 0); break; case 'p': add_sw_settings = 1; @@ -358,7 +360,7 @@ main(int argc, char **argv) if (argc && !(f = fopen(argv[0], "w"))) fprintf(stderr, "can't open file %s for writing", argv[0]); - ibmad_port = mad_rpc_open_port(ca, ca_port, mgmt_classes, 2); + ibmad_port = mad_rpc_open_port(ca, ca_port, mgmt_classes, 3); if (!ibmad_port) { fprintf(stderr, "Failed to open %s port %d", ca, ca_port); exit(1); @@ -375,13 +377,23 @@ main(int argc, char **argv) goto close_port; } guid = 0; - } else { + } else if (guid_str) { + if (ib_resolve_portid_str_via(&port_id, guid_str, IB_DEST_GUID, + NULL, ibmad_port) >= 0) { + if ((fabric = ibnd_discover_fabric(ibmad_port, + timeout_ms, &port_id, 1)) == NULL) + IBWARN("Single node discover failed; attempting full scan\n"); + } else + IBWARN("Failed to resolve %s; attempting full scan\n", + guid_str); + } + + if (!fabric) /* do a full scan */ if ((fabric = ibnd_discover_fabric(ibmad_port, timeout_ms, NULL, -1)) == NULL) { fprintf(stderr, "discover failed\n"); rc = 1; goto close_port; } - } if (guid) { ibnd_node_t *sw = ibnd_find_node_guid(fabric, guid); diff --git a/infiniband-diags/src/ibqueryerrors.c b/infiniband-diags/src/ibqueryerrors.c index 70c3d48..525af70 100644 --- a/infiniband-diags/src/ibqueryerrors.c +++ b/infiniband-diags/src/ibqueryerrors.c @@ -427,25 +427,19 @@ main(int argc, char **argv) ib_portid_t portid = {0}; if (ib_resolve_portid_str_via(&portid, switch_guid_str, IB_DEST_GUID, - ibd_sm_id, ibmad_port) < 0) { - fprintf(stderr, "can't resolve destination port %s %p\n", - switch_guid_str, ibd_sm_id); - rc = 1; - goto close_port; - } + ibd_sm_id, ibmad_port) >= 0) { + if ((fabric = ibnd_discover_fabric(ibmad_port, ibd_timeout, &portid, 1)) == NULL) + IBWARN("Single node discover failed; attempting full scan\n"); + } else + IBWARN("Failed to resolve %s; attempting full scan\n", switch_guid_str); + } - if ((fabric = ibnd_discover_fabric(ibmad_port, ibd_timeout, &portid, 1)) == NULL) { - fprintf(stderr, "discover failed\n"); - rc = 1; - goto close_port; - } - } else { + if (!fabric) /* do a full scan */ if ((fabric = ibnd_discover_fabric(ibmad_port, ibd_timeout, NULL, -1)) == NULL) { fprintf(stderr, "discover failed\n"); rc = 1; goto close_port; } - } report_suppressed(); -- 1.5.4.5 From robert.j.woodruff at intel.com Thu Apr 30 15:01:18 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 30 Apr 2009 15:01:18 -0700 Subject: [ofa-general] New proposal for memory management In-Reply-To: References: <382A478CAD40FA4FB46605CF81FE39F42B5C8631@orsmsx507.amr.corp.intel.com> Message-ID: <382A478CAD40FA4FB46605CF81FE39F42B5C876E@orsmsx507.amr.corp.intel.com> Brian wrote, >There's an application at Sandia and at Los Alamos which both of which cause >problems for our linker tricks. This leads to such things as (proven) >silent data corruption. Perhaps your users are just getting silent data >corruption and not doing enough validation and verification to know it? Or >maybe Intel's just gotten lucky - the majority of our applications seem to >have no issue with the registration cache. But the outliers with proven >data corruption are the kind of things that keep me up at night. Have you tried these applications with any MPI other than OpenMPI ? i.e., does this corruption happen with Intel MPI and other MPIs as well? If it is specific to OpenMPI, then perhaps it is just a bug in OpenMPI that can be fixed. >We came with a real problem we're having with code development in real-world >applications, presented two solutions, and were essentially told to take a >hike. If this sounds like a lot of whining to the OFA community, than the >OFA community shouldn't be surprised that the VERBS adoption rate is as poor >as it is. Of the solutions that have been presented so far, I think the kernel notifier approach would be a better solution. Besides the kernel bloat and complexity of a memory registration cache in the kernel, I am not sure it would really be able to work the way you would want. For example, the kernel has no way of knowing when some application calls free(), i.e., free() may not call the kernel to release the memory back to the kernel. It often just puts the free'd memory block on a free list within libc in user-space. Thus, if we had a kernel memory registration cache, from the kernel's perspecive, this block of memory would appear to still be in use and could not be evicted from the cache. Thus the cache could end up filling up with lots of registrations for memory that has already been free()'d in user-space but are stitting on some free list in libc. This is another reason why I think the caching should be done in user-space. If the hooks do not exist in libc to hook all of the appropriate routines, then perhaps you should ask the libc maintainers to add what you need with perhaps the addition of the kernel notifier design that roland suggested. Anyway that would be my suggestion, woody From bwbarre at sandia.gov Thu Apr 30 15:10:28 2009 From: bwbarre at sandia.gov (Barrett, Brian W) Date: Thu, 30 Apr 2009 16:10:28 -0600 Subject: [ofa-general] New proposal for memory management In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F42B5C876E@orsmsx507.amr.corp.intel.com> Message-ID: On 4/30/09 16:01 , "Woodruff, Robert J" wrote: > Brian wrote, > >> There's an application at Sandia and at Los Alamos which both of which cause >> problems for our linker tricks. This leads to such things as (proven) >> silent data corruption. Perhaps your users are just getting silent data >> corruption and not doing enough validation and verification to know it? Or >> maybe Intel's just gotten lucky - the majority of our applications seem to >> have no issue with the registration cache. But the outliers with proven >> data corruption are the kind of things that keep me up at night. > > Have you tried these applications with any MPI other than OpenMPI ? > i.e., does this corruption happen with Intel MPI and other MPIs as well? > If it is specific to OpenMPI, then perhaps it is just a bug in OpenMPI that > can be fixed. Of course it's a bug in Open MPI - it's one we can't fix (and is in MVAPICH as well), but it's a bug in the MPI just the same. I'm not in a position to run such an app against Intel MPI, nor do I know what tricks you play to compare it to MVAPICH and Open MPI. >> We came with a real problem we're having with code development in real-world >> applications, presented two solutions, and were essentially told to take a >> hike. If this sounds like a lot of whining to the OFA community, than the >> OFA community shouldn't be surprised that the VERBS adoption rate is as poor >> as it is. > > Of the solutions that have been presented so far, > I think the kernel notifier approach would be a better solution. > > Besides the kernel bloat and complexity of a memory registration cache > in the kernel, I am not sure it would really be able to work the > way you would want. For example, the kernel has no way of knowing > when some application calls free(), i.e., free() may not call the kernel > to release the memory back to the kernel. It often just puts the free'd > memory block on a free list within libc in user-space. > > Thus, if we had a kernel memory registration cache, from the kernel's > perspecive, this block of memory would appear to still be in use and > could not be evicted from the cache. Thus the cache could end up filling > up with lots of registrations for memory that has already been > free()'d in user-space but are stitting on some free list in libc. I don't actually care about the user calling free(). I care about events that could change the virtual to physical mapping. Intercepting free() is actually really painful from a performance standpoint, as most apps (especially C++ apps) call malloc/free orders of magnitude more times than the app gives memory back to the kernel. > This is another reason why I think the caching should be done in > user-space. If the hooks do not exist in libc to hook > all of the appropriate routines, then perhaps you should ask the > libc maintainers to add what you need with perhaps the addition > of the kernel notifier design that roland suggested. Have you talked to the glibc malloc developers? They're worse than this group. And since I have to intercept munmap and the SysV calls as well, there are hooks in functions that really should be simple syscall wrappers, which even in my opinion isn't defensible to add. I'm done now. You don't want to fix your crap, that's fine. Just don't be surprised by the continued "why you shouldn't use IB" presentations from people who have to write applications to it. Brian -- Brian W. Barrett Dept. 1423: Scalable System Software Sandia National Laboratories From jim.ryan at intel.com Thu Apr 30 15:12:07 2009 From: jim.ryan at intel.com (Ryan, Jim) Date: Thu, 30 Apr 2009 15:12:07 -0700 Subject: [ofa-general] RE: [mwg] Re: RDMA tutorial and OFA In-Reply-To: <35AAF1E4A771E142979F27B51793A4885D9B5E36A0@AVEXMB1.qlogic.org> Message-ID: <3F6F638B8D880340AB536D29CD4C1E19450E0496@orsmsx501.amr.corp.intel.com> At the risk of piling on, I think what Lloyd is suggesting is very important. The objections I continue to hear about programming using RDMA are along the lines of "it's too hard" or "no one knows how to do it". It occurs to me if we could provide some concise instruction, that, coupled with the undeniable benefits of RDMA, could provide a compelling package for "RDMA for the masses" thanks, Jim ________________________________ From: mwg-bounces at lists.openfabrics.org [mailto:mwg-bounces at lists.openfabrics.org] On Behalf Of Lloyd Dickman Sent: Thursday, April 30, 2009 1:17 PM To: arkady kanevsky; bill.boas at openfabrics.org Cc: iwg at lists.openfabrics.org; Paul Grun; OFA at lists.openfabrics.org; Paul Gray; Working Group; Wayne Augsburger; Andy Grover; Richard Frank; Jeff at lists.openfabrics.org; Squyres; Mikkel Hagen; Scott at lists.openfabrics.org; general at lists.openfabrics.org; Friedman; bobs at voltaire.com; Sumanta Chatterjee; asafs`@voltaire.com; Roland Dreier Subject: RE: [mwg] Re: RDMA tutorial and OFA I support the idea of the RDMA tutorial. Beyond the "meat" as described below, I would encourage the tutorial to include a "how to program RDMA" section. While OFA Verbs provides a rich set of mechanisms, it is difficult for the average programmer to get a solid handle on how to use the capabilities, register memory, ... Some cookbook examples, or perhaps development of several programming "patterns" can go a long way to having RDMA become a much more mainstream application programming paradigm. Lloyd From: mwg-bounces at lists.openfabrics.org [mailto:mwg-bounces at lists.openfabrics.org] On Behalf Of arkady kanevsky Sent: Thursday, April 30, 2009 11:27 AM To: bill.boas at openfabrics.org Cc: iwg at lists.openfabrics.org; Paul Grun; Paul Gray; OFA Marketing Working Group; Wayne Augsburger; Andy Grover; Richard Frank; asafs`@voltaire.com; Jeff Squyres; Mikkel Hagen; general at lists.openfabrics.org; Scott Friedman; bobs at voltaire.com; Sumanta Chatterjee; Roland Dreier Subject: [mwg] Re: RDMA tutorial and OFA Keep me in the loop. I am interested to do it also. Thanks, Arkady On Thu, Apr 30, 2009 at 1:39 PM, Bill Boas > wrote: Richard, Andy, Thanks for copying me Richard. I had not seen Andy's email on the general list. Figuring out how to get tutorial and other documentation created and published in the list of things to get done in 2009 for me in my part-time role as Exec. Dir. There is no funding set up for this at the moment but I believe there will be in about 30 days. That's because I'm thinking that we can get funding for this by making it part of the funding for a new marketing plan for OFA that, with Wayne Augsburger and Jim Ryan, we are preparing for the OFA Board to vote on at the next con-call meeting which is on May 20 at 9.00AM PDT. Would you be willing to work with me and create a small team from others within OFA who have the same interest to prepare a description by May 20 of what the tutorial would look like, who would contribute to it, how to get it "polished up" for web and/or book style publication, what the overall costs would be, etc. My thoughts, that could be a starting point for the team's work, are that we would make the creation a collective effort. The tutorial would have several sections for example general intro, benefits of RDMA, applicability in HPC and Enterprise, networking background etc. Members of the Marketing Working Group would be responsible for this. The "meat" would be sections for kernel level things (verbs etc.), then user space things (verbs etc.), then APIs like MPI, SDP, EDS etc. - each section overseen by the technical leaders/maintainers of the code within OFA for that section (for Example Tom Talpey for NFSoRDMA, or you Richard for RDS) Finally the tutorial would have sections about Interoperability Testing that OFA/IOL does but also what customers can do on there own systems - Arkady and Rupert and IOL have put in an SC09 tutorial proposal that we could leverage in this section. To all readers of this email:- If you have read this far, please give us all some feedback. If you have material you'd like to contribute please say so. If there's a better way, tell us what you think it is! Thanks, Bill. Bill Boas Executive Director and Vice Chair OpenFabrics Alliance 510-375-8840 Bill.Boas at openfabrics.org www.openfabrics.org -----Original Message----- From: Richard Frank [mailto:richard.frank at oracle.com] Sent: Wednesday, April 29, 2009 12:58 PM To: Andy Grover Cc: Bill Boas; Sumanta Chatterjee Subject: Re: RDMA tutorial and OFA Andy, I saw your postings to ofa-general on this and I agree it would be great to have this documentation. As OpenFabrics is really about RDMA... we need to make it simpler for folks to pick up and run with RDMA concepts ...vs.. digging thru the IB specs and code examples, etc. Let's see what Bill Boas thinks...perhaps OFA has a writer on board that can help us do this..? I can also help provide input for a new OFA RDMA tutorial doc.. Rick Andy Grover wrote: > Hi Rick, > > Are you around for a brief chat this afternoon? I have a crazy idea that > involves OFA doing something (or putting up $$) and I wanted to see what > you thought, since you're Oracle's OFA rep, right? > > -- Andy > > -- Cheers, Arkady Kanevsky -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.j.woodruff at intel.com Thu Apr 30 15:19:18 2009 From: robert.j.woodruff at intel.com (Woodruff, Robert J) Date: Thu, 30 Apr 2009 15:19:18 -0700 Subject: [ofa-general] RPM / build environment In-Reply-To: References: Message-ID: <382A478CAD40FA4FB46605CF81FE39F42B5C87B7@orsmsx507.amr.corp.intel.com> Roland wrote, >Wouldn't posting patches for review and eventual merge to the upstream >kernel be a better first step, rather than worrying about the OFED build >scripts? > - R. Yes, in general we have agreed in the EWG that any code that goes into OFED should first be reviewed and accepted for upstream inclusion (or at least be accepted by Roland for a future kernel.org kernel.) before it goes into OFED. Given that the feature freeze for OFED 1.5 is coming up pretty soon, I would suggest starting this process as soon as you can if you want to try to get this driver into OFED 1.5. As for understanding the OFED build environment, Vlad from Mellanox is the maintainer of the install.pl script and build/install process. woody From pgrun at systemfabricworks.com Thu Apr 30 12:41:51 2009 From: pgrun at systemfabricworks.com (Paul Grun) Date: Thu, 30 Apr 2009 12:41:51 -0700 Subject: [ofa-general] RE: RDMA tutorial and OFA In-Reply-To: <11F9001990984F12B589A718D8976045@BillGWAYLAPTOP> References: <49F8A59C.3070001@oracle.com> <49F8B149.2050904@oracle.com> <11F9001990984F12B589A718D8976045@BillGWAYLAPTOP> Message-ID: <004301c9c9cb$b714af60$253e0e20$@com> In any event, I would be happy to volunteer to serve as an editor. Since I'm a hardware guy I'm probably not a good choice to volunteer to author any of the technical sections. Did you really mean a tutorial, which to me implies a presentation/classroom format, or did you mean tutorial in the more colloquial sense of capturing the RDMA folklore in written form? -Paul Paul Grun Chief Scientist System Fabric Works, Inc Office: (503) 620-8757 Cell : (503) 703-5382 Fabric Computing that works -----Original Message----- From: Bill Boas [mailto:Bill.Boas at openfabrics.org] Sent: Thursday, April 30, 2009 10:39 AM To: 'Richard Frank'; 'Andy Grover' Cc: 'Sumanta Chatterjee'; general at lists.openfabrics.org; 'Wayne Augsburger'; 'Ryan, Jim'; 'Paul Gray'; 'Paul Grun'; bobs at voltaire.com; Scott Friedman; 'Jeff Squyres'; 'Roland Dreier'; asafs`@voltaire.com; 'Rupert Dance'; Mikkel Hagen; 'arkady kanevsky'; 'OFA Marketing Working Group'; iwg at lists.openfabrics.org Subject: RE: RDMA tutorial and OFA Richard, Andy, Thanks for copying me Richard. I had not seen Andy's email on the general list. Figuring out how to get tutorial and other documentation created and published in the list of things to get done in 2009 for me in my part-time role as Exec. Dir. There is no funding set up for this at the moment but I believe there will be in about 30 days. That's because I'm thinking that we can get funding for this by making it part of the funding for a new marketing plan for OFA that, with Wayne Augsburger and Jim Ryan, we are preparing for the OFA Board to vote on at the next con-call meeting which is on May 20 at 9.00AM PDT. Would you be willing to work with me and create a small team from others within OFA who have the same interest to prepare a description by May 20 of what the tutorial would look like, who would contribute to it, how to get it "polished up" for web and/or book style publication, what the overall costs would be, etc. My thoughts, that could be a starting point for the team's work, are that we would make the creation a collective effort. The tutorial would have several sections for example general intro, benefits of RDMA, applicability in HPC and Enterprise, networking background etc. Members of the Marketing Working Group would be responsible for this. The "meat" would be sections for kernel level things (verbs etc.), then user space things (verbs etc.), then APIs like MPI, SDP, EDS etc. - each section overseen by the technical leaders/maintainers of the code within OFA for that section (for Example Tom Talpey for NFSoRDMA, or you Richard for RDS) Finally the tutorial would have sections about Interoperability Testing that OFA/IOL does but also what customers can do on there own systems - Arkady and Rupert and IOL have put in an SC09 tutorial proposal that we could leverage in this section. To all readers of this email:- If you have read this far, please give us all some feedback. If you have material you'd like to contribute please say so. If there's a better way, tell us what you think it is! Thanks, Bill. Bill Boas Executive Director and Vice Chair OpenFabrics Alliance 510-375-8840 Bill.Boas at openfabrics.org www.openfabrics.org -----Original Message----- From: Richard Frank [mailto:richard.frank at oracle.com] Sent: Wednesday, April 29, 2009 12:58 PM To: Andy Grover Cc: Bill Boas; Sumanta Chatterjee Subject: Re: RDMA tutorial and OFA Andy, I saw your postings to ofa-general on this and I agree it would be great to have this documentation. As OpenFabrics is really about RDMA... we need to make it simpler for folks to pick up and run with RDMA concepts ...vs.. digging thru the IB specs and code examples, etc. Let's see what Bill Boas thinks...perhaps OFA has a writer on board that can help us do this..? I can also help provide input for a new OFA RDMA tutorial doc.. Rick Andy Grover wrote: > Hi Rick, > > Are you around for a brief chat this afternoon? I have a crazy idea that > involves OFA doing something (or putting up $$) and I wanted to see what > you thought, since you're Oracle's OFA rep, right? > > -- Andy > > From lloyd.dickman at qlogic.com Thu Apr 30 13:17:04 2009 From: lloyd.dickman at qlogic.com (Lloyd Dickman) Date: Thu, 30 Apr 2009 13:17:04 -0700 Subject: [ofa-general] RE: [mwg] Re: RDMA tutorial and OFA In-Reply-To: <517c62fb0904301126n277faae7ye816eb496d985945@mail.gmail.com> References: <49F8A59C.3070001@oracle.com> <49F8B149.2050904@oracle.com> <11F9001990984F12B589A718D8976045@BillGWAYLAPTOP> <517c62fb0904301126n277faae7ye816eb496d985945@mail.gmail.com> Message-ID: <35AAF1E4A771E142979F27B51793A4885D9B5E36A0@AVEXMB1.qlogic.org> I support the idea of the RDMA tutorial. Beyond the "meat" as described below, I would encourage the tutorial to include a "how to program RDMA" section. While OFA Verbs provides a rich set of mechanisms, it is difficult for the average programmer to get a solid handle on how to use the capabilities, register memory, ... Some cookbook examples, or perhaps development of several programming "patterns" can go a long way to having RDMA become a much more mainstream application programming paradigm. Lloyd From: mwg-bounces at lists.openfabrics.org [mailto:mwg-bounces at lists.openfabrics.org] On Behalf Of arkady kanevsky Sent: Thursday, April 30, 2009 11:27 AM To: bill.boas at openfabrics.org Cc: iwg at lists.openfabrics.org; Paul Grun; Paul Gray; OFA Marketing Working Group; Wayne Augsburger; Andy Grover; Richard Frank; asafs`@voltaire.com; Jeff Squyres; Mikkel Hagen; general at lists.openfabrics.org; Scott Friedman; bobs at voltaire.com; Sumanta Chatterjee; Roland Dreier Subject: [mwg] Re: RDMA tutorial and OFA Keep me in the loop. I am interested to do it also. Thanks, Arkady On Thu, Apr 30, 2009 at 1:39 PM, Bill Boas > wrote: Richard, Andy, Thanks for copying me Richard. I had not seen Andy's email on the general list. Figuring out how to get tutorial and other documentation created and published in the list of things to get done in 2009 for me in my part-time role as Exec. Dir. There is no funding set up for this at the moment but I believe there will be in about 30 days. That's because I'm thinking that we can get funding for this by making it part of the funding for a new marketing plan for OFA that, with Wayne Augsburger and Jim Ryan, we are preparing for the OFA Board to vote on at the next con-call meeting which is on May 20 at 9.00AM PDT. Would you be willing to work with me and create a small team from others within OFA who have the same interest to prepare a description by May 20 of what the tutorial would look like, who would contribute to it, how to get it "polished up" for web and/or book style publication, what the overall costs would be, etc. My thoughts, that could be a starting point for the team's work, are that we would make the creation a collective effort. The tutorial would have several sections for example general intro, benefits of RDMA, applicability in HPC and Enterprise, networking background etc. Members of the Marketing Working Group would be responsible for this. The "meat" would be sections for kernel level things (verbs etc.), then user space things (verbs etc.), then APIs like MPI, SDP, EDS etc. - each section overseen by the technical leaders/maintainers of the code within OFA for that section (for Example Tom Talpey for NFSoRDMA, or you Richard for RDS) Finally the tutorial would have sections about Interoperability Testing that OFA/IOL does but also what customers can do on there own systems - Arkady and Rupert and IOL have put in an SC09 tutorial proposal that we could leverage in this section. To all readers of this email:- If you have read this far, please give us all some feedback. If you have material you'd like to contribute please say so. If there's a better way, tell us what you think it is! Thanks, Bill. Bill Boas Executive Director and Vice Chair OpenFabrics Alliance 510-375-8840 Bill.Boas at openfabrics.org www.openfabrics.org -----Original Message----- From: Richard Frank [mailto:richard.frank at oracle.com] Sent: Wednesday, April 29, 2009 12:58 PM To: Andy Grover Cc: Bill Boas; Sumanta Chatterjee Subject: Re: RDMA tutorial and OFA Andy, I saw your postings to ofa-general on this and I agree it would be great to have this documentation. As OpenFabrics is really about RDMA... we need to make it simpler for folks to pick up and run with RDMA concepts ...vs.. digging thru the IB specs and code examples, etc. Let's see what Bill Boas thinks...perhaps OFA has a writer on board that can help us do this..? I can also help provide input for a new OFA RDMA tutorial doc.. Rick Andy Grover wrote: > Hi Rick, > > Are you around for a brief chat this afternoon? I have a crazy idea that > involves OFA doing something (or putting up $$) and I wanted to see what > you thought, since you're Oracle's OFA rep, right? > > -- Andy > > -- Cheers, Arkady Kanevsky -------------- next part -------------- An HTML attachment was scrubbed... URL: From jgunthorpe at obsidianresearch.com Thu Apr 30 15:22:30 2009 From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe) Date: Thu, 30 Apr 2009 16:22:30 -0600 Subject: [ofa-general] Re: New proposal for memory management In-Reply-To: <23635E11-F18E-4799-9B6E-C3163000A3A3@cisco.com> References: <20090429215508.GW4431@obsidianresearch.com> <20090429222125.GX4431@obsidianresearch.com> <1241044080.3403.374.camel@chromite.mv.qlogic.com> <20090429224411.GC32114@obsidianresearch.com> <23635E11-F18E-4799-9B6E-C3163000A3A3@cisco.com> Message-ID: <20090430222230.GF32114@obsidianresearch.com> On Thu, Apr 30, 2009 at 09:52:32AM -0400, Jeff Squyres wrote: > I think Jason is the only one who is remaining at least somewhat on-topic > here. Thanks, but I have no stake in this, it is just interesting :) After reading all the postings, I think my idea to fix the verbs API to not, essentially, corrupt an existing registration when the virtual address space changes is the best bet. This slightly changes the semantics of the verbs MR to refer to virtual address space within the process, not the underlying object(s) that happen to be mapped there when the registration is made. The data corruption problem would be immediately solved by doing this and no changes at all would be needed in any MPIs. Surely a good thing? MPIs can choose to continue to hook malloc/free/etc or not, it doesn't really matter from a correctness perspective. Minimizating the amount of VM that is registered is perhaps important, perhaps not, I suspect, depending on the job.. Certainly there are fairly fast ways to do periodic garbage collection on the registration cache (ie by inspecting proc/self/maps, or the glibc free block list, or whatever.). Notifiers are going to be very troublesome, every time any sort of synchronous to user space notifier has been proposed or implemented in the kernel it has been a disaster. I would not hold out much hope for this.. > While MPI is currently the biggest victim, this broken memory management > model is also an enormous roadblock for any other application or ULP to > write to verbs. I'm not sure this is true.. The purpose built verbs apps I've worked on don't have a problem like MPI, and managing the memory registration was not hard. My main complaint would be the lack of FMR and FRMR in userspace - mostly because of the performance of the registration functions.. Jason From aafabbri at cisco.com Thu Apr 30 16:20:26 2009 From: aafabbri at cisco.com (Aaron Fabbri (aafabbri)) Date: Thu, 30 Apr 2009 16:20:26 -0700 Subject: [ofa-general] New proposal for memory management In-Reply-To: <48644923-A2EF-4C4F-8F9A-B1658BAA36FE@cisco.com> References: <48644923-A2EF-4C4F-8F9A-B1658BAA36FE@cisco.com> Message-ID: <3B6F5C9068D5864096EB291236C3386F058EDB8C@xmb-sjc-21d.amer.cisco.com> > -----Original Message----- > From: Jeff Squyres (jsquyres) > Sent: Thursday, April 30, 2009 6:19 AM > To: Aaron Fabbri (aafabbri) > Cc: general at lists.openfabrics.org > Subject: Re: [ofa-general] New proposal for memory management > > On Apr 30, 2009, at 1:30 AM, Aaron Fabbri (aafabbri) wrote: > > > Have you considered changing the MPI API to require applications to > > use MPI to allocate any/all buffers that may be used for > network I/O? > > That is, instead of calling malloc() et al., call a new > mpi_malloc() > > which allocates from pre- registered memory. > > > Yes, MPI_ALLOC_MEM / MPI_FREE_MEM calls have been around for > a long time (~10 years?). Using them does avoid many of the > problems that have been discussed. Most (all?) MPI's either > support ALLOC_MEM / FREE_MEM by registering at allocation > time and unregistering at free time, or some variation of that. > Ah. Are there any problems that are not addressed by having MPI own allocation of network bufs? (BTW registering for each allocation could be improved, I think.) > But unfortunately, very few MPI apps use these calls; they use > malloc() and friends instead. Or they're written in Fortran, > where such concepts are not easily mapped (don't > underestimate how much Fortran MPI code runs on verbs!). > Indeed, in some layered scenarios, it's not easy to use these > calls (e.g., if an MPI-enabled computational library may > re-use user-provided buffers because they're so large, etc.). I understand the difficulty. A couple possible counterpoints: 1. Make the next version of MPI spec *require* using the mpi_alloc atuff. 2. MPI already requires recompilation of apps, right? I don't know fortran, or what it uses for allocation, but worse case, maybe you could change the standard libraries or compilers. 3. Rip out your registration cache. Make malloc'd buffers go really slow (register in fast path) and mpi_alloc_mem() buffers go really fast. People will migrate. The hard part of this would be getting all MPIs to agree on this, I'm guessing. Aaron From gregkh at suse.de Thu Apr 30 15:16:42 2009 From: gregkh at suse.de (Greg Kroah-Hartman) Date: Thu, 30 Apr 2009 15:16:42 -0700 Subject: [ofa-general] [PATCH] infiniband: remove driver_data direct access of struct device Message-ID: <20090430221642.GA18492@kroah.com> From: Greg Kroah-Hartman In the near future, the driver core is going to not allow direct access to the driver_data pointer in struct device. Instead, the functions dev_get_drvdata() and dev_set_drvdata() should be used. These functions have been around since the beginning, so are backwards compatible with all older kernel versions. Cc: general at lists.openfabrics.org Cc: Roland Dreier Cc: Hal Rosenstock Cc: Sean Hefty Signed-off-by: Greg Kroah-Hartman --- drivers/infiniband/core/sysfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/drivers/infiniband/core/sysfs.c +++ b/drivers/infiniband/core/sysfs.c @@ -760,9 +760,9 @@ int ib_device_register_sysfs(struct ib_d int i; class_dev->class = &ib_class; - class_dev->driver_data = device; class_dev->parent = device->dma_device; dev_set_name(class_dev, device->name); + dev_set_drvdata(class_dev, device); INIT_LIST_HEAD(&device->port_list); From Jie.Cai at cs.anu.edu.au Thu Apr 30 21:35:45 2009 From: Jie.Cai at cs.anu.edu.au (Jie Cai) Date: Fri, 01 May 2009 14:35:45 +1000 Subject: [ofa-general] uDAPL DTO completion question. In-Reply-To: <469958e00904010852ydaf5b07wccdde27dd02ca724@mail.gmail.com> References: <49D2BD00.5010002@cs.anu.edu.au> <469958e00903312040j7700d2ccr9104996c2fc29cd4@mail.gmail.com> <517c62fb0903312253w6344d62j1b8c072354b15ad2@mail.gmail.com> <49D30C7F.1050201@cs.anu.edu.au> <469958e00904010852ydaf5b07wccdde27dd02ca724@mail.gmail.com> Message-ID: <49FA7C21.1050400@cs.anu.edu.au> Thanks for the help along my understanding with IB and uDAPL. Is that possible to spin remote memory of a rdma atomic compare and swap (dat_ib_post_cmp_and_swap())? I have wrote a program that initiator atomic cmp_swap a value to a remote memory. Instead of sending a message to notify the remoter about the completion of cmp_swap, the remoter actually doing a memory spin to test the update on the memory (e.g. while(target == 0);). However, at remote side, this while loops goes infinitely, and the initiator has already received DAT_IB_DTO_EVENT. I don't really understand what's going on, and what would be a correct way to spin memory for checking remote write updates. Any suggestions? Regards, Jie -- Jie Cai Caitlin Bestler wrote: > On Tue, Mar 31, 2009 at 11:41 PM, Jie Cai wrote: > >> Understood now. A further question is here again. >> >> To implement software level acknowledgment to inform initiator that data >> has been available for remoter, is that possible to use a busy loop at >> remote >> side to detect the last element of transferring has appear in the memory. >> >> Or remoter has to wait for the event of recv matching initiator's send, then >> send a message back to initiator as a acknowledgment? >> >> > > There are two issues when spinning on a remote memory update. > > The first is that packets may be received and processed out of order, > especially for iWARP. Therefore the fact that the last byte has been > received and placed does not guarantee that the prior packets have > been received and placed. > > More importantly, the order in which updates become visible to a > specific software thread can make the order of updates unpredictable > to the application. > > When delivering a completion the Provider is responsible for dealing > with both of these problems. So when you reap a completion from the > CQ, the operation it represents (and all prior operations) are complete. > There are no gaps in received packets, nothing is still sitting on an > Adapter buffer waiting to be placed in host memory. > > If your application does not want to block you can consider polling > the cq whether than enabling notifications. But polling memory locations > directly should only be done when you're willing to have bus/adapter > specific dependencies. You working code might stop working when > your network changes, or you install a new Adapter that has a different > strategy for optimizing its writes over the PCIe bus. >